Download Research on a simplified variable analysis of credit rating in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Research on a simplified variable analysis of credit rating in Chinese
small enterprises based on a support vector machine
Ying Chena*, Kwok Leung Tam b, Yongping Renc
a. SILC business school, Shanghai University, Shanghai, China
b. Insearch institute, University of technology Sydney, Sydney, Australia
c. Management school, Shanghai University, Shanghai, China
* Corresponding author
Tel: 86 21 69980028* 53121
E-mail: [email protected]
Address: Chengzhong Road 20, Jiading district, Shanghai, China
Zip code:201899
a. Lecturer of SILC business school, Shanghai University, the candidate of PH. D in
Management school, Shanghai University, primarily responsible for the framework of the
paper
b. Lecturer of Insearch institute, University of technology Sydney, PH.D of statistics,
primarily responsible for methodology and data analysis
c. Professor of Management school, Shanghai University, PH.D of accounting, primarily
responsible for the theory and concepts
1
Research on a simplified variable analysis of credit rating in Chinese
small enterprises based on a support vector machine
Abstract:Small enterprises are an important component in the national economy, take
an important role in people's lives, provide social employment and generate tax
payments. Small enterprises are also valuable customers of commercial banks, which
need a decision-making model to analyze whether or not to provide loans to the
enterprise. Commercial banks use the credit rating for new customers as a method of
analyzing small enterprises before committing to long-term cooperation. They use
various kinds of information, including financial indexes and non-financial indexes in
a credit rating system. This paper selects a support vector machine algorithm to
establish the imbalanced multi-classification model and compares the results with
other methods. Furthermore, commercial banks need a simplified variable analysis of
the credit rating for small enterprises that uses less information and variables and
rapidly and accurately obtains a credit rate of the small enterprise to assist in making
decisions and to improve the efficiency of the process. In this paper, we perform a
few tests and obtain better results with the experimental data than the other methods.
Key
words:
small
enterprises,
support
vector
machine,
imbalanced
multi-classification, credit rating
1. Introduction
The Chinese government has always attached great importance to the development of
small enterprises and constantly builds up the financial ecological environment of
small enterprises to help in their development. However, it has always been very
difficult for small enterprises to obtain finance. In July 2014, Premier Li Keqiang
mentioned questions related to reducing the cost of finance for enterprises, especially
small enterprises, many times at the state council executive meeting. Thus, it can be
seen that the limitation of sources of funds and the lack of continuous financial
support has become a bottleneck in the sustainable development of small enterprises.
Therefore, it is urgent to solve the problem of financing in small enterprises by
2
facilitating an effective financial support to promote their development.
At the same time, the commercial banks, as money suppliers, are also under huge
pressure. After a few years of rapid development, the Chinese commercial banks and
financial system in general have established a good financial foundation for enterprise
financing. Under the macro background of liberalization on interest rates and financial
innovation increasingly advanced by the Internet, market competition between
commercial banks has become increasingly bitter with the rise of the cost of deposits
and interest rates on loans, and the profits from the interest balance between deposits
and loans being compressed. With fierce competition in the credit market, commercial
banks need to obtain customer resources to increase market share, especially for small
enterprises. Small enterprises become important customers of commercial banks. The
credit rating for small enterprises is the key point in solving the problem of small
enterprise financing. The practical problem is finding an efficient and accurate
method for obtaining a credit rating on new customers; how to effectively identify and
analyze the performance of small enterprises and eliminate small enterprises with
poor performance.
With an increasing demand on identifying small enterprises in commercial banks,
the credit managers need help to quickly and efficiently analyze enterprise
performance. The decision on loan applications is needed before small enterprises and
commercial banks establish long-term cooperation. Commercial banks select small
enterprises according to a variety of information, so the credit manager also needs an
effective, scientific method, especially for small enterprises. Therefore, it is necessary
to use different kinds of methods including a machine learning algorithm based on
data mining to make decisions.
2. Literature review
(1 )Credit rating of enterprises.
There are two categories; quantitative and qualitative credit rating methods.
Qualitative evaluation methods are called an artificial expert analysis, also known as a
3
classic credit analysis method. At present, Chinese commercial banks still mainly use
this method. However, a few quantitative credit rating methods have been used.
Initially, Altman (1977) used multiple discriminated analyses and Li Zhihui and Li
Meng (2005), and Junni l. Zhang (2010) used the Logistic model analysis. Credit
Metrics was used by J.P. Morgan in the United States in 1997. This was a Value at
Risk model and estimated the risk value of loans and other assets. McKinsey &
company designed the Credit Portfolio View Model, which was based on the Credit
Metrics. It added factors from the macroeconomic cycle, established a relationship
between the macroeconomic indicators, such as economic growth rate, interest rate,
government expenditures, etc. and the transition matrix of credit rating, and using the
Monte Carlo method simulated the changes on the transitional probability of rating
with the degree of cyclical factors. The Credit Monitor model was developed by
KMV Ltd. in the United States, and the method estimated the probability of loan
defaults. The model Credit Risk + was issued by the financial products development
department in the Swiss Credit Bank, which was the model that calculated the
probability of defaults. There were many other artificial intelligent methods for credit
rating, such as integer programming, artificial neural network, genetic algorithm,
support vector machine algorithm, etc.
(2) Support vector machine
Support vector machine was firstly proposed by Corinna Cortes and Vapnikin 1995.It
was easily combined with other methods and conveniently popularized. Since it had
the advantage of solving the problems using non-linear, high-dimensional
classification and regression with relatively high accuracy, it was widely used in the
fields of disease diagnosis, handwritten font and text recognition, face recognition and
image retrieval, analysis and application of Engineering Technology, and the
economic and management fields of evaluation and prediction. Zhou Qifeng et al.
(2005) selected more than 1000 sample data of the enterprises in light industry from a
certain commercial bank in 2003. The data included the ratios on debt payment,
profitability, operational management and the output results of AAA, AA, A and A-.
From the empirical research, using a support vector machine, the overall test accuracy
4
reached 83.15% with faster learning speed, higher than that of the neutral network
method. It was a suitable credit rating method for commercial banks. Ligang Zhou et
al. (2009) explored how to select the parameters on credit scoring using support
vector machine and testified the result well with two real world credit datasets.
Kyoung-jae Kim et al. (2012) compared support vector machine and other artificial
intelligence methods on multi-class problems and got the improved performance.
Terry Harris (2015) used the clustered support vector machine to solve the credit
scoring of binary classification problems and obtained the better results. Pai,
Ping-Feng (2015) proposed a new kind of decision tree support vector machine that
was combined rough set theory and support vector machine to solve multi-class
methods.
At present, there were a large number of researches based on the support vector
machine. The existing researches on the credit rating of enterprises mostly focus on
the problems of binary classification and less on the problem of multiple
classifications. For binary classification problems, the same amount of sample data
was generally chosen in the normal and control groups, and there was little research
on imbalanced classification. In the index system of credit rating on enterprises, the
index mostly was selected from the financial statements that reflected historical
information and few qualitative indicators were used. Therefore, the model of the
support vector machine in determining the credit rating of the enterprises could also
be further applied and played a more important role in practice.
3. Problems and concepts analysis
According to the regulations and the collected sample data, this paper defines small
enterprises as having owners’ equity of more than 6 million Yuan, the number of
employees are low, approved financial reports by completely audited or third party
agencies cannot be provided, and the applied for loan amount is below 30 million
Yuan. There needs to be an increase in the qualitative indexes combined with
quantitative indexes in the indexes system of credit rating of small enterprises.
5
Commercial banks could initially evaluate small enterprises through a decision system,
and then the credit managers could analyze and judge whether or not to give loans to
small enterprises by an artificial expert method. During this period, commercial banks
still need to measure the risk exposure of old customers regularly, paying close
attention to the development of small enterprises, and as far as possible reduce the
possibility of bad debts. Therefore, it is necessary to establish a model based on data
mining and a machine learning algorithm, providing information to the credit manager
in the credit management and risk management department, according to the
requirements and regulations of the new Basel agreement and China’s banking
regulatory commission.
Generally, there were only a few enterprises that default in each commercial
bank’s database. According to the existing data in the database of commercial banks, it
was the selected models that identified the possible defaulting customers from the
existing customers, which in scientific essence was an imbalanced classification
problem. The problems of imbalanced classification contained the binary imbalanced
classification problems and multiple imbalanced classification problems.
At present, there are many researches on the problems of binary imbalance
classification. The problems of multiple imbalanced classifications generally refer to
problems with the classification categories. There are more than two kinds of
categories and there is a big difference between the numbers of every group sample,
especially there is only a small amount of sample data in a certain individual group in
the multiple classification problems. Also with many users using many different
methods, it is difficult to fully learn the characteristics of each group, which leads to
the accuracy of classifications decreasing. The problems on credit rating of the
enterprises in commercial banks is a typical multiple imbalanced classification
problem. First, there are 10 distinguished grades of credit rating; AAA, AA, A, BBB,
BB, B, CCC, CC, C, D and some commercial banks even add A+ and A- to make
twelve grades of credit rating, based on the different characteristics of management
and performance of the enterprises. Second, there are different numbers in each
sample data category from the commercial banks database. For example, there are a
6
vast majority above Grade A and few below Grade BBB, even if there is still a big
difference among the sample groups of Grade AAA, AA and A. Third, the amount of
sample data in a certain category might be zero. Since the enterprises with low grades
would be rejected by commercial banks, there is no sample data in some categories
(such as grade D) in the customers’ database of commercial banks. In this paper, we
use a credit rating system with ten grades, including grades AAA, AA, A, BBB, BB, B,
CCC, CC, C, and D for a small commercial bank. Table 1 shows the numbers of each
grade for164 enterprises in the customer database of the commercial bank:
Table 1:The numbers of grades of 164 enterprises
Grades
The numbers of
AAA AA
6
25
A
BBB BB
49
46
20
B
16
CCC CC C D total
2
-
-
-
164
sample data
It can be seen from the above table that there are only seven grades and no
sample data below Grades CCC with different numbers of samples in other grades.
The classification results of 7-10 grades might exist in different years and these are
different from the other problems of multiple imbalance and the problems of binary
classification in which the quantity of sample data are nearly the same. The basic
model of the machine learning algorithm could not effectively learn the characteristics
of the information from these kinds of samples.
From the existing researches, there are three ways to solve the problem of
imbalanced classification: (1) algorithm improvement, (2) improvement in the
technique of data sampling, or (3) both improved simultaneously. The algorithm
improvement would come from changing the inherent characteristics and the original
treatment principle of the algorithm, making the calculation and analysis of the model
adapt to the requests of the problem. The technique of data sampling improvement is
aimed at the selection methods of data and can be used independently. It is
over-sampling that increases the numbers of sample data in minority grades and
under-sampling which decreases the numbers of sample data in majority grades. A
7
hybrid algorithm is the combination of the techniques of sampling and the algorithm.
During the stage of training the sample data set, the characteristics of the data of
minority grades are analyzed. It is the advantages of the statistical learning theory of
the support vector machine that solves the problems of classification and regression.
Therefore, the improvement from the basic support vector machine algorithm can be
used to address the problems of multiple classifications.
4. Methodology
According to the book ‘The theory and algorithms of support vector machines’ written
by Deng Naiyang et al. (2005) and Cuihua Shen (2004), the basic ideas of the support
vector machine method mostly solved the problems of regression and classification,
of which included the linear separable problems and the linear inseparable problems.
For the linear separable problems, the training sample data set in binary classification
problems was xi ∈ Rn, i = 1, 2,... , n and the classification of the corresponding level
was yi∈ {-1, 1}, i = 1, 2. The classified hyperplane (w•xi)+b=0. w•xi was the dot
product between w and xi. To make two kinds of sample data, all satisfied yi((w•xi)+b)
≥1,i=1, …, n and the classified margin equaled
2
.
‖w‖
Under the constraint of yi ((w•xi)
2
+ b) ≥1,i=1,…, n the objective function maximized‖w‖. Maximizing the classified
margin was the same as minimizing
‖w‖2
2
divide the sample data and minimize
. The optimal classified hyperplane could
‖w‖2
2
. Support vectors on the hyperplane
contributed to the optimal hyperplane and the decision functions. So it was
unnecessary to require all the training data on yi ((w•xi) + b) ≥1,i=1, …,n,
introducingξi≥0,i=1, …, n, and the constraint conditions were relaxed to yi ((w. xi) +
b) ≥1-ξi, i=1, …, n. ∑ξi described the degree to which the training set was
wrongly distinguished with the wrong data. The penalty parameter C was the
adjustable parameter C > 0; the larger C meant the punishment for fault classification.
This was a problem of quadratic programming, and the formula was used to solve the
optimization problem:
8
n


2
min  12 w  C  i 
w, b
i 1


(1)
yi((w•xi)+b)≥1-ξi, i=1, 2, …, n
s.t.
(2)
ξi≥0, i=1, 2, …, n
(3)
The parameter C was used to balance the training accuracy and generalization
ability.ξi was the slack variables used in solving the problem in a larger feasible
region, w ∈ Rn was a weight vector to explain the location of the separating
hyperplane of every space and b was the position error of the mobile hyperplane. This
was a quadratic programming problem, and the optimal solution was the following
Lagrange function of saddle points:
L( w, b,  ) 
1
2
n
n
n
i 1
i 1
i 1
w  C  i    i  yi ( w  x  b)  i  1   ii
2
(4)
Among them, αi ≥ 0 and βi ≥ 0 were the Lagrange multipliers. Since at the
saddle point, the gradient of w, b and ξwas zero, so it got:
n
L
 w    i yi xi  0
w
i 1
n
 w    i yi xi
(5)
i 1
n
L
   i yi  0
b i 1
(6)
L
 C   i  i  0

(7)
Put (5) (6) (7) into (4) and calculated the maximum of (4) on α, obtaining the dual
optimization problem of (1) (2) (3) as following:
n
n
n

1

i 2   i j yi y j ( xi  x j )
ma x
i 1 j 1
 i 1

n
s.t.
 y
i 1
i
i
(8)
0
0  i  C
(9)
i=1, 2,… ,n
(10)
To solve the above formulas,  would be  i=0, 0<  <C,  =C. If 0<  <C, 
=C, the corresponding xi was support vector. In support vector machine method, the
9
corresponding xi of  =C was on the border, known as bound Support Vector, the
corresponding xi of 0 < a < C was in the interval, known as normal support Vector.
According to the KKT conditions, at the optimum points, Lagrange multiplier and
constraint conditions multiplied 0:
i [ yi (( w  x)  b)  1  i ]  0, i  1,2,....n
(11)
ii  0, i  1,2,...n
(12)
For normal support vector (0<  <C), it was known i >0 from formula (7) and
from formula (12) it got  i =1. For any normal support vector, it was satisfied:
yi (( w  x)  b)  1
(13)
So b was:
b  yi  ( w  x)  yi 

x j J
j
y j ( xi  x j ), x  JN
(14)
JN was the set of normal support vectors and J was the set of support vectors. The
constraints of formula (2) (3) limited w and b and made the empirical risk for error
equaled to 0. At the same time, it minimized w to make VC dimension minimizing.
So the optimization of question (1) embodied the principle of structural risk
minimizing and had good generalization ability. This method could solve the linear
separable problems very well.
However, the linear inseparable problems were that using any straight line would
wrongly distinguish a lot of data in the training set. For the linear inseparable
problems, a support vector machine selected a kernel function K, which was used on
the sample data to map to a high-dimensional data space, transforming the linear
inseparable problems into a linear separable problem, and constructing the optimal
hyperplane separating the points of difficult nonlinear data. The different kernel
function obtained the different classifiers and the parameters used in the selection of
the kernel function were also very important. In this condition, the formula (8)
changed to the following form:
10
n
n n
  i j yi y j K ( xi , x j )]
i 1 j 1
max[  i  12 

i 1
n
s.t.
 y
i 1
i
i
0
(15)
(16)
0  i  C, i  1,2,...n
(17)
Here, K(xi , x j) ( ( xi )   ( x j )) was the kernel function, solving the dual
problem to determine the final decision function:
n

f ( x)  sgn   i yi K ( xi , x)  b 
 i 1

(18)
If Kernel function K (xi , x) was selected appropriately, the linear inseparable
problem in input space could be transformed into the linear separable problem in the
feature space. There were many different kernel functions to be used in the support
vector machine model. In this paper, the Gaussian radial basis function (RBF) was
applied as the kernel function.
K(x, y) e x p( x  y )  0
2
(19)
The sample data was mapped to a high-dimensional space by the kernel function,
which then was able to solve the problem with the nonlinear relationship between
class labels and the characteristics of data, and the problem of insufficient prior
experiences. γ was the inherent parameter of a function, which mapped data to the
distribution in the new feature space. C was the penalty parameter, which meant the
higher the value, the less the errors in the support vector machine classification model.
The choice of parameters without a prior knowledge was through a grid search method
which is a common method to set parameters.
In this paper using an integrated learning algorithm to improve the support vector
machine, the process mainly included three steps: segmentation, training and
aggregation. In our sample data sets, the positive subsets were the good credit rating of
small enterprise in commercial banks, such as grades AAA, AA, A, BBB, BB, B, CCC,
CC, C and D. These enterprises would not default and made up a large percentage of
11
the sample data sets. The poorer credit ratings of small enterprise were negative
subsets and the minority data set in the customer database of commercial banks, which
would default and could not easily be predicted by machine learning methods with
inadequate characteristics, such as Grade C or D. The first step was segmentation,
which reclassified the existing sample groups to achieve nearly balanced groups.
There were less data in the negative groups that did not need to be further segmented.
But the data samples in positive groups needed detailed segmentation and
classification. The samples group could be divided into k (k > 3) subclasses, for
example, there were multiple classes in the customer data of small enterprises in
commercial banks after data cleaning. The second step was training, which
consolidated the sample data negative classes. For example, there were only 2 pieces
of data of grade 7 in the sample data and these were the negative classes. If directly
using the traditional methods to get rid of negative classes, there was no data, so the
sample data of negative classes was kept and the sample data of positive categories
were the key points and should be classified in detail. Support vector machine was
used to classify the sample data after segmentation. The third step was aggregation.
After training the sample data, it was necessary to integrate every individual class and
form a suitable method for this kind of classified problem to distinguish all classes,
according to the distance between each feature vector of the sample data in every class.
Firstly, the sample data was separated into negative classes, and then the sample data
into positive categories. A new test data was classified to the belonged class by the
support vector machine method.
Input:
The known training set was sample data of small enterprise D = {(x1, y1), (x2,
y2), … ( xn, yn )}. x represented the information of different characteristics for each
sample of a small enterprise. y represented the corresponding grade of the small
enterprise. Positive categories of training set P were the good customer data sets for
credit rating and negative classes of training set N were the poor customer data sets for
credit rating ( the sample size of P was m1, the sample size of N was m2, m1 + m2 = L
12
and m1≥ m2). M was the number in the positive sample data set.
Positive categories of sample data set Pin the algorithm could be divided into M
(in this case M = 6) data subsets and was Vi (i = 1, 2, ... , M). Simultaneously, support
vector Ci ( i = 1, 2,... , M ) could be obtained separated, which separated every sample
data set. Amalgamating the negative categories of N sample data set, Vi showed each
category set. Making Di = [Vi , N], decision hyperplane gi (x) was used to work out the
formula in the support vector machine and obtained the optimal solution Di = [Vi, N].
di was the distances (ci, c0) between the hyperplane separated at every category of
space and completed the calculation. Among them vi = exp (di / dm), i= 1,... , M, dm
was a median of all the distances between the hyperplane.
M
Output:F(x) y  sgn(  vi gi ( x))
(20)
i 1
Sgn( ) was a sign function. The final output was the specific category y that
corresponded with the arbitrary input of sample vector x. In the learning process, it
needed to pay attention to the choice of M, which was not only the number of
categories in the positive data sample set, but also the numbers of classifiers in an
integrated learning algorithm. Since the size and the distribution of data affected the
efficiency and accuracy of classification results of support vector machine classifiers
according to the actual sample data in each year, measuring M could always be used
between 6 and 11.
5. Data
Support vector machine method did not need the sample data as a normal distribution
and correlation tests. The authors collected the samples data of small enterprises from
customers’ database from a city commercial bank in Zhejiang Province in China in
financial year 2013. The following table was the descriptive statistical analysis of the
164 original sample data and the 17 variables.
13
Table 2: The descriptive statistical analysis of the original sample data
Standard
Variables
N Minimum Maximum
Working years of
Mean
deviation
Variance
164
1
35
11.15
6.124
37.500
164
0
7
2.19
1.965
3.860
Corporate life
164
0
21
7.18
4.149
17.214
Investors’ assets
164
2
5
4.67
0.768
0.590
Sales output ratio
164
0
3
2.34
0.840
0.705
Debt ratio
164
0.066
0.770
0.432
0.136
0.019
Owner’s equity
164
7925567 590875849 68910686 65072022
4.234*1015
Current ratio
164
0.650
6.980
1.631
0.903
0.815
Accounts
164
2.080
127.218
10.39
12.037
144.889
Sales growth rate
164
-0.920
14.040
0.334
1.176
1.384
Profit growth rate
164
-4.060
130.260
1.210
10.347
107.062
Return on equity
164
-0.060
38.650
0.475
3.015
9.090
1
0
manager
Education
background of
manager
receivable turnover
Personal credit
records
164
1
1
0
Industry policy
164
0
1
0.71
0.456
0.208
Local environment
164
0
1
0.77
0.423
0.179
Operating site
164
0
1
0.96
0.188
0.035
164
0
1
0.87
0.335
0.112
of manager
conditions
Equipment
utilization
14
According to the distribution characteristic of the original data, there were very
few sample data less than zero and most of them greater than zero, belonged to the
uniform distribution variables. To reduce the series between different variables, there
was the process mapping the sample data to [0, 1], which was extreme linear model of
processing, treating the sample data as the order from small to large and the array of
the large data, the better the data.
The complete credit rating methods of enterprises contained the index system of
the credit rating. Small enterprises in this paper refer to the enterprise whose owner’s
equity is above 6 million Yuan, the numbers of employees are low and cannot provide
complete financial statements audited or recognized by third party agencies. The index
system of credit rating in small enterprises includes the quantitative indexes and the
qualitative indexes. In order to obtain more accurate results, the authors drew on the
experience of the stated-owner commercial banks in China and had discussions many
times with the experts. After the discussions, the index system was redesigned as in the
following table:
15
Table 3: The index system of credit rating in small enterprises
Variables
Definitions
Description (marks)
Working years of
The experience of the
Above6 years:100, 3-6 years:
manager (Years)
manager in small
50, Below 3 years:0
enterprises.
Education
Primary school, middle
Above bachelor degree:100,
background of
school, diploma, bachelor
diploma:50, Below diploma:0
manager
degree, master degree.
Corporate life
The open years of small
Above 5 years:100, 2-5 years:50,
enterprises.
Below 2 years:0
Investors’ assets
The individual property of
Above 6 million:100,5-6
(Yuan)
investors.
million:80,4-5 million:60,
3-4 million:40,2-3 million:20,
Below 2 million:0
Sales output ratio
Sales units / produced units
Above 90%:100,80-90%:
50,Below80%:0
Debt ratio
Total liabilities/ total assets
0:100,0- 50%:50,50%-100%:
0
Owner’s equity
Owner’s equity in small
The sample data all more than 6
(Yuan)
enterprises
million:100
Current ratio
Current assets / current
Above 1.2:100, 1.1-1.2:75, 1-1.1:
liabilities
50
0.9-1:25,Below 0.9:0
Accounts
Net sales/ net accounts
Above 4:100,2-4:50,Below 2:
receivable
receivables
0
(Net sales in this year- net
Above 20%:100,15%-20%:75,
sales in last year)/ net sales
10%-15%:50,5%-10%,25,Below
in last year * 100%
5%:0
turnover
Sales growth rate
16
Profit growth rate
(Total profits in this year-
Above 20%:100,15% - 20%:75,
total profits in last year)/
10% - 15%:50,5% - 10%,25,
total profits in last year *
Below 5%:0
100%
Return on equity
Total profits / owners’
Above 10%:100,8% - 10%:75,
equity
6% - 8%:50,4% - 6%:25,Below
4%
Personal credit
Personal credit records of
records of
managers in local banks
The sample data all good:100
Industry policy in local area
Good:100,normal:50,limited:
manager
Industry policy
0
Local environment Environment policy in local
Good:100,normal:50,limited:
area
0
Operating site
Operating site which was
Self-owned:100,leased:0
conditions
owned or leased
Equipment
The ratio of operating
utilization
equipment
High:100,medium:50,low:0
6. Empirical results
The results of the experiment were the predicted precision ratio:
The accuracy rate of precision = the accurate numbers of sample data / the numbers of
all sample data
(20)
For example, C was selected in the range 100-10000 for the experiment and
increased by 10n. γ was selected from the range with an increasing speed of 10-n.
According to the results of the convergence and accuracy of precision, gradually
narrowing the scope should produce higher classification accuracy. If C was smaller
and the actual testing accuracy was low, C was gradually increased and gradually
closed to the optimal value range of the support vector machine. Every trial needed 10
17
minutes or so. Because of the limitations of time and energy, after repeated testing and
analysis, the parameters combination in the classification model was ruled out as it
would lead to the results being divergent, not convergent. This was found for the
following parameters C = (1800,1900,2000,2100,2200,2500,2700), γ = (0.001,
0.003, 0.005, 0.007, 0.009).The test results of this range were better than that of other
parameters. The different combinations of C and γ were composed of several different
classification models of the support vector machine model. And the average precision
accuracy of each classification was measured 20 times. There were 35 different testing
results for analysis, as shown in the table, after the use of the integrated support vector
model:
Table 5: The classified results of 17 variables
C
1800
1900
2000
2100
2200
2500
2700
0.001
0.7560
0.7560
0.7548
0.7548
0.7524
0.7560
0.7583
0.003
0.7690
0.7726
0.7738
0.7714
0.7726
0.7726
0.7702
0.005
0.7702
0.7643
0.7655
0.7655
0.7655
0.7631
0.7643
0.007
0.7631
0.7631
0.7667
0.7667
0.7607
0.7595
0.7595
0.009
0.7619
0.7607
0.7607
0.7607
0.7595
0.7583
0.7583
γ
Compared with the results of the support vector machine classification method in
this paper, the author also used other methods to classify the same sample data of 164
enterprises. Table 6 shows the results:
18
Table 6: The comparison results of model classification
The name of model
Two-step
clustering
method
The accuracy of classification
Silhouette measure of cohesion and
separation indicates that the cluster
quality is poor.
K mean clustering
method
System clustering
method
27.8%
43.9%
The radial basis function
24.5%
neural net work method
The multilayer perceptron
29.8%
neural network method
It can be seen that the accuracy of the experimental results with other methods is
low from above table. After the interviews with the credit managers in commercial
banks, the credit managers cooperated with the risk assessment manager and seriously
identified the potential risk of enterprise loans. In practice, the smallest amount of
variable data as possible needed to be chosen to analyze small enterprises rapidly and
accurately. So the authors tested reducing and eliminating a few credit rating variables
of small enterprise customers and hoped to obtain ideal results. Due to there being 17
collected variables, there were numerous different possible combinations of variables,
which could not be tested seriatim. Using factor analysis and principal component
analysis dimensionality reduction was unsuitable for the analysis of credit rating in
commercial banks. The principal components and factors calculated did not have
stability and it was difficult to interpret the meanings. So we could only continuously
delete the variables according to the correlation analysis, based on changes of the
accuracy rate to determine the combination of variables.
Firstly, 16 variables were selected from the 17 variables of valid sample data in
164 small enterprises. After the standardization of sample data, the variable of owner
equity equaled 1 and this variable was eliminated. The new index system included16
19
variables, which were working years of manager,education background of manager,
corporate life,investors’ assets,sales output ratio,debt ratio,current ratio,accounts
receivable turnover,sales growth rate,profit growth rate,return on equity,personal
credit records of manager,industry policy,local environment,operating site conditions,
and equipment utilization. The test results are shown in the following table:
Table 7:
The classification results of 16 variables
C
γ
1800
1900
2000
2100
2200
2500
2700
0.001
0.7418
0.7409
0.7409
0.7436
0.7455
0.7482
0.7509
0.003
0.7664
0.7673
0.7718
0.7745
0.7745
0.7727
0.7727
0.005
0.7636
0.7591
0.7591
0.7564
0.7582
0.7527
0.7609
0.007
0.7527
0.7573
0.7573
0.7582
0.7609
0.7627
0.7591
0.009
0.7618
0.7655
0.7673
0.7664
0.7664
0.7682
0.7700
From the above table, the accuracy rate of the classification with 16 variables was
lower than that of 17 variables; it showed that the support vector machine was more
effective for the problems on high-dimensional classification with higher accuracy.
After standardization, the variable of personal credit records of the manager were all
good in the sample data of 164 enterprises and the standard was 1, so16 variables were
used for testing, including working years of manager,education background of
manager,corporate life,investors’ assets,sales output ratio,debt ratio,owner’s equity,
current ratio,accounts receivable turnover,sales growth rate,profit growth rate,return
on equity,industry policy,local environment,operating site conditions,and equipment
utilization. The test results are shown in the following table:
20
The classification results of 16 variables
Table 8:
C
1800
1900
2000
2100
2200
2500
2700
0.001
0.7727
0.7718
0.7700
0.7691
0.7709
0.7709
0.7709
0.003
0.7782
0.7845
0.7873
0.7891
0.7864
0.7909
0.7873
0.005
0.7855
0.7836
0.7800
0.7773
0.7718
0.7764
0.7736
0.007
0.7755
0.7755
0.7736
0.7764
0.7755
0.7736
0.7718
0.009
0.7718
0.7736
0.7709
0.7709
0.7691
0.7664
γ
0.7636
This test used eight variables, including debt ratio, current ratio, sales growth,
sales growth rate, return on equity, corporate life, industry policy, investors’ assets, and
sales output ratio, and deleted working years of manager, education background of
manager,owner’s equity,accounts receivable turnover,profit growth rate, personal
credit records of manager,local environment,operating site conditions,and equipment
utilization. The test results are shown in the following table:
Table 9:
The classification results of 8 variables
C
1800
1900
2000
2100
2200
2500
2700
0.001
0.6298
0.6286
0.6369
0.6381
0.6357
0.6524
0.6607
0.003
0.6976
0.6917
0.6881
0.6905
0.6881
0.6929
0.6929
0.005
0.6964
0.7012
0.7000
0.6976
0.7060
0.7095
0.7095
0.007
0.7107
0.7107
0.7095
0.7190
0.7202
0.7286
0.7310
0.009
0.7262
0.7274
0.7321
0.7298
0.7298
0.7286
0.7286
γ
This test used 8 variables, including debt ratio, current ratio, sales growth rate,
return on equity, corporate life, industry policy, investors’ assets, and sales output ratio
and deleted education background of manager, working years of manager, owner’s
equity, current ratio, accounts receivable turnover,profit growth rate,return on equity,
personal credit records of manager,industry policy,local environment,operating site
conditions,and equipment utilization. The test results are shown in the following
table:
21
Table 10:
8 variables classification results
C
1000
1200
1400
1600
1800
2000
2200
0.001
0.6286
0.6190
0.6238
0.6250
0.6298
0.6369
0.6357
0.003
0.6690
0.6869
0.6905
0.6976
0.6976
0.6881
0.6881
0.005
0.6964
0.6917
0.6857
0.6929
0.6964
0.7000
0.7060
0.007
0.6857
0.6940
0.6988
0.7071
0.7107
0.7095
0.7202
0.009
0.6952
0.6988
0.7107
0.7143
0.7262
0.7321
0.7298
γ
Different combinations of variable parameters C and γ together constituted a
support vector classification machine, so there were 35 support vector classification
machines in one table, all operated many times, which built more thousands of
classifiers of the support vector machine. The result of operations was the average
accuracy rate after 20 times, enhancing the precision and robustness of classification.
Table 11:
The comparison results of classification
The name of model
Two-step clustering method
The accuracy rate of classification
Silhouette measure of cohesion and
separation indicates that the cluster quality
is poor.
K mean clustering method
35.98%
System clustering method
15.24%
The
31.9%
radial
basis
function
neural network method
The
multilayer
perceptron
49%
neural network method
From the above table it can be seen the accuracy results with the other methods
are low.
7. Conclusion
22
In China, it was necessary to identify credit rating of small enterprises for
commercial banks. The author selected the suitable ensemble support vector machine
method to analyze the sample data from customer database in a commercial bank.
From the test results of these models, after many tests and analyses, the index system
of variables was gradually adjusted. Since the characteristics of the support vector
machine were suitable for the problems of high dimensional nonlinear classification,
so there were more features in the variable index, the accuracy attained was higher.
Support vector machine method did not need the sample data as a normal distribution
and correlation tests to solve this kind of imbalanced multi-classification problems and
enhanced the precision and robustness of classification.
From the table below it could be seen that the variables in index system sample
data were 15 to 17, and the accuracy rate of classification was close to 80%.
Decreasing the variables in index system caused the accuracy rate of classification to
fall. The accuracy rate of 8 variables in the index system was over 62%, and even in
some combinations of parameters above 70%; relatively good evaluation accuracy.
When the number of variables decreased to 7 in the index system, the accuracy
quickly fell below 60%, with no references. Thus, 8 variables were selected as the
simplified index system of credit rating for small enterprises.
The accuracy rate of classification
80%
60%
8
Figure
16 The numbers of variable
the schematic diagram of relationship between the numbers of variable
and the accuracy rate of classification
23
Therefore, through comparing results, we used 8 variables, which were working
years of manager,debt ratio,profit growth rate,sales growth rate, corporate life,
industry policy, investors’ assets and sales output ratio. After normalization processing,
enterprise owner’s equity and personal credit records of manager both equaled 1 and
were very important evaluation indexes. Two sample data of variables were not
brought into the calculation when the classifiers were tested, but these could not be
ignored for the simplified index system of credit rating on small enterprises. So the
simplified index system of credit rating on small enterprises included 10 variables,
which were working years of manager,debt ratio,profit growth rate,sales growth rate,
corporate life, industry policy, investors’ assets,sales output ratio, enterprise owner’s
equity and personal credit records of the manager.
References:
Altman,EI,Haldeman, R& Narayanan, P 1997, ZETA analysis: A new model to
identify bankruptcy risk of corporations, Journal of Banking and Finance, 1977,no. l,
pp. 29-54.
Basel Committee on Banking Supervision,2001, Consultative Document: The New
Basel Capital Accord,Bank for International Settlements,January 2001.
Cristianini,N & Shawe-Taylor,J, An introduction to support vector machine and
other kernel-based learning method, China Machine Press, English edition, July 2005
Committee on the Global Financial System 2001,A Survey of stress tests and current
practice at major financial institutions,Bank
for International Settlements,2001.
Martin,D 1977, Early warning of bank failure: A Logit regression approach,Journal
of Banking and Finance,1977, no. 1, pp. 249-276.
Carey,M 2002, A Guide To Choosing Absolute Bank Capital Requirements,Board of
Governors of the Federal Reserve System,International Finance Discussion Papers,
no. 726,May.
Cuihua Shen, Research on individual credit rating of loans for consumption based on
support vector machine method, (in Chinese), Doctoral Dissertation in China
24
agricultural university, Nov. 2004
Cheng-Lung Huang,Mu-Chen Chen,Chieh-Jen Wang, 2007, Credit scoring with a
data mining approach based on support vector machines,Expert Systems with
Applications,no. 33, pp. 847-856.
Chong Wu, Hang Xia, Research on customer credit evaluation model based on
integrated support vector machine in the electronic commerce environment, (in
Chinese), Chinese Journal of Management Science, Oct. 2008
Euro-currency Standing Committee of the central banks of the Group of Ten countries
1998, On The Use of Information and Risk Management, Bank for International
Settlements,Basle,1st October,1998.
Fuyong Wan, Xiufu Shi, Research on financial prediction on credit evaluation method
of Listed company based on support vector machine, (in Chinese), China intelligent
automation conference proceedings in 2009 (the second volume)
Grinold,RC &Kahn,RN 1999, Active Portfolio Management: A Quantitative
Approach for Producing Superior Returns and Selecting Superior Returns and
Controlling Risk,McGraw-Hill.
Global Credit Research 2002, Loss Calc TM:Moody’s Model For Predicting Loss
Given Default
(LGD),February.
Jorion,P 2000, Value at Risk:The New Benchmark for Managing Financial Risk,
McGraw-Hill.
Junni L, Zhang, W&Härdle,K 2010, The Bayesian Additive Classification Tree
applied to credit risk modeling,Computational Statistics and Data Analysis,2010,
no.54, pp.1197-1205.
KMV 1993,Portfolio Management of Default Risk,released date:November15,
1993, Revised:May31,2001
KMV 1993, San Francisco,Credit Monitor Overview,KMV Corporation,1993
Reining,A 2001, Monte Carlo Simulation in the Integrated Market and Credit Risk
Portfolio Model,Algorithmic Inc. June.
Kyoung-jae Kim, Hyunchul Ahn, A corporate credit rating model using multi-class
support vector machines with an ordinal pairwise partitioning approach, Computers &
25
Operations Research, 2012, no.39, pp. 1800-1811
Ligang Zhou, Kin Keung Lai, Lean Yu, Credit scoring using support vector machines
with direct search for parameters selection, Soft Comput, 2009, no.13, pp.149-155
Liyong Yu, Jiehui Zhan, Research on default probability prediction based on Logistic
regression analysis (in Chinese), Journal of Finance and Economics, 2004, no. 9,
pp.15-23
Morgan,JP997, Credit Metrics,New York, Technical Document,April 2,1997.
Gieseke , K 2002, Credit Risk Modeling and Valuation : An Introduction ,
Humboldt-University zoo Berlin, August 19.
Naiyang Deng, Yingjie Tian, Support vector machine : Theory, algorithm,
development, (in Chinese), China Science Press, First edition, Aug. 2009
O’Connor,R, Golden,JF & Rack,R year, A Value-At-Risk Calculation of Required
Reserves For Credit Risk In Corporate Lending Portfolios
Pai, Ping-Feng; Tan, Yi-Shien; Hsu, Ming-Fu, Credit Rating Analysis by the
Decision-Tree Support Vector Machine with Ensemble Strategies, International
journal of fuzzy systems, 2015, no. 17, pp. 521-530
Qian Li, Bing Yang, Yi Li, Naiyang Deng & Ling Jing, Constructing support vector
machine ensemble with segmentation for imbalanced datasets, Neural Comput &
Applic, DOI 10.1007/s00521-021-1041-z, published online: 10 July 2012.
Shu-Ting Luo,Bor-Wen Cheng & Chun-Hung Hsieh 2009, Prediction model building
with clustering-launched classification and support vector machines in credit scoring,
Expert Systems with Applications, no. 36, pp. 7562-7566.
Sinha,AP &Zhao,H 2008, Incorporating domain knowledge into data mining
classifiers:An application in indirect lending,Decision Support Systems, no.46, pp.
287-299.
Saunders,A 1999, Credit Risk Measurement,John Wiley & Sons,New York.
Shinong Wu, Shizhong Huang, Research on prediction model of financial distress of
listing Corporation in China ( in Chinese ), Economics Research Journal, 2001, no. 6,
pp. 46-55
Terry Harris, Credit scoring using the clustered support vector machine, Expert
26
Systems with Applications, 2015, no. 42, pp. 741-750
Wun-Hwa Chen,Jen-Ying Shih2006,A study of Taiwan’s issuer credit rating systems
using support vector machines,Expert Systems with Applications, no. 30, pp.
427-435.
Wenbing Xiao, Qi Fei , Hu Wan, The credit evaluation model based on support vector
machine and risk, (in Chinese), Chinese Science Abstracts(Chinese Edition), 2007,
Vol.13,no. 22 p:284
Yifeng Zhou, Chengde Lin, The comparison on classification method of credit risk
assessment in commercial bank, (in Chinese), The 24th Chinese control conference,
July 2005, p 1734-1737
Zhihui Li, Meng Li, Empirical Research on the credit risk identification model in
Chinese commercial banks ( in Chinese ), Economic Science, 2005, no. 5, pp. 61-71.
Zongyuan Zhao a, Shuxiang Xu a, Byeong Ho Kang, Mir Md Jahangir Kabir, Yunling
Liu c, Rainer Wasinger, Investigation and improvement of multi-layer perceptron
neural networks for credit scoring, Expert Systems with Applications , 2015 no. 42,
pp.3, 508-516
27