Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research on a simplified variable analysis of credit rating in Chinese small enterprises based on a support vector machine Ying Chena*, Kwok Leung Tam b, Yongping Renc a. SILC business school, Shanghai University, Shanghai, China b. Insearch institute, University of technology Sydney, Sydney, Australia c. Management school, Shanghai University, Shanghai, China * Corresponding author Tel: 86 21 69980028* 53121 E-mail: [email protected] Address: Chengzhong Road 20, Jiading district, Shanghai, China Zip code:201899 a. Lecturer of SILC business school, Shanghai University, the candidate of PH. D in Management school, Shanghai University, primarily responsible for the framework of the paper b. Lecturer of Insearch institute, University of technology Sydney, PH.D of statistics, primarily responsible for methodology and data analysis c. Professor of Management school, Shanghai University, PH.D of accounting, primarily responsible for the theory and concepts 1 Research on a simplified variable analysis of credit rating in Chinese small enterprises based on a support vector machine Abstract:Small enterprises are an important component in the national economy, take an important role in people's lives, provide social employment and generate tax payments. Small enterprises are also valuable customers of commercial banks, which need a decision-making model to analyze whether or not to provide loans to the enterprise. Commercial banks use the credit rating for new customers as a method of analyzing small enterprises before committing to long-term cooperation. They use various kinds of information, including financial indexes and non-financial indexes in a credit rating system. This paper selects a support vector machine algorithm to establish the imbalanced multi-classification model and compares the results with other methods. Furthermore, commercial banks need a simplified variable analysis of the credit rating for small enterprises that uses less information and variables and rapidly and accurately obtains a credit rate of the small enterprise to assist in making decisions and to improve the efficiency of the process. In this paper, we perform a few tests and obtain better results with the experimental data than the other methods. Key words: small enterprises, support vector machine, imbalanced multi-classification, credit rating 1. Introduction The Chinese government has always attached great importance to the development of small enterprises and constantly builds up the financial ecological environment of small enterprises to help in their development. However, it has always been very difficult for small enterprises to obtain finance. In July 2014, Premier Li Keqiang mentioned questions related to reducing the cost of finance for enterprises, especially small enterprises, many times at the state council executive meeting. Thus, it can be seen that the limitation of sources of funds and the lack of continuous financial support has become a bottleneck in the sustainable development of small enterprises. Therefore, it is urgent to solve the problem of financing in small enterprises by 2 facilitating an effective financial support to promote their development. At the same time, the commercial banks, as money suppliers, are also under huge pressure. After a few years of rapid development, the Chinese commercial banks and financial system in general have established a good financial foundation for enterprise financing. Under the macro background of liberalization on interest rates and financial innovation increasingly advanced by the Internet, market competition between commercial banks has become increasingly bitter with the rise of the cost of deposits and interest rates on loans, and the profits from the interest balance between deposits and loans being compressed. With fierce competition in the credit market, commercial banks need to obtain customer resources to increase market share, especially for small enterprises. Small enterprises become important customers of commercial banks. The credit rating for small enterprises is the key point in solving the problem of small enterprise financing. The practical problem is finding an efficient and accurate method for obtaining a credit rating on new customers; how to effectively identify and analyze the performance of small enterprises and eliminate small enterprises with poor performance. With an increasing demand on identifying small enterprises in commercial banks, the credit managers need help to quickly and efficiently analyze enterprise performance. The decision on loan applications is needed before small enterprises and commercial banks establish long-term cooperation. Commercial banks select small enterprises according to a variety of information, so the credit manager also needs an effective, scientific method, especially for small enterprises. Therefore, it is necessary to use different kinds of methods including a machine learning algorithm based on data mining to make decisions. 2. Literature review (1 )Credit rating of enterprises. There are two categories; quantitative and qualitative credit rating methods. Qualitative evaluation methods are called an artificial expert analysis, also known as a 3 classic credit analysis method. At present, Chinese commercial banks still mainly use this method. However, a few quantitative credit rating methods have been used. Initially, Altman (1977) used multiple discriminated analyses and Li Zhihui and Li Meng (2005), and Junni l. Zhang (2010) used the Logistic model analysis. Credit Metrics was used by J.P. Morgan in the United States in 1997. This was a Value at Risk model and estimated the risk value of loans and other assets. McKinsey & company designed the Credit Portfolio View Model, which was based on the Credit Metrics. It added factors from the macroeconomic cycle, established a relationship between the macroeconomic indicators, such as economic growth rate, interest rate, government expenditures, etc. and the transition matrix of credit rating, and using the Monte Carlo method simulated the changes on the transitional probability of rating with the degree of cyclical factors. The Credit Monitor model was developed by KMV Ltd. in the United States, and the method estimated the probability of loan defaults. The model Credit Risk + was issued by the financial products development department in the Swiss Credit Bank, which was the model that calculated the probability of defaults. There were many other artificial intelligent methods for credit rating, such as integer programming, artificial neural network, genetic algorithm, support vector machine algorithm, etc. (2) Support vector machine Support vector machine was firstly proposed by Corinna Cortes and Vapnikin 1995.It was easily combined with other methods and conveniently popularized. Since it had the advantage of solving the problems using non-linear, high-dimensional classification and regression with relatively high accuracy, it was widely used in the fields of disease diagnosis, handwritten font and text recognition, face recognition and image retrieval, analysis and application of Engineering Technology, and the economic and management fields of evaluation and prediction. Zhou Qifeng et al. (2005) selected more than 1000 sample data of the enterprises in light industry from a certain commercial bank in 2003. The data included the ratios on debt payment, profitability, operational management and the output results of AAA, AA, A and A-. From the empirical research, using a support vector machine, the overall test accuracy 4 reached 83.15% with faster learning speed, higher than that of the neutral network method. It was a suitable credit rating method for commercial banks. Ligang Zhou et al. (2009) explored how to select the parameters on credit scoring using support vector machine and testified the result well with two real world credit datasets. Kyoung-jae Kim et al. (2012) compared support vector machine and other artificial intelligence methods on multi-class problems and got the improved performance. Terry Harris (2015) used the clustered support vector machine to solve the credit scoring of binary classification problems and obtained the better results. Pai, Ping-Feng (2015) proposed a new kind of decision tree support vector machine that was combined rough set theory and support vector machine to solve multi-class methods. At present, there were a large number of researches based on the support vector machine. The existing researches on the credit rating of enterprises mostly focus on the problems of binary classification and less on the problem of multiple classifications. For binary classification problems, the same amount of sample data was generally chosen in the normal and control groups, and there was little research on imbalanced classification. In the index system of credit rating on enterprises, the index mostly was selected from the financial statements that reflected historical information and few qualitative indicators were used. Therefore, the model of the support vector machine in determining the credit rating of the enterprises could also be further applied and played a more important role in practice. 3. Problems and concepts analysis According to the regulations and the collected sample data, this paper defines small enterprises as having owners’ equity of more than 6 million Yuan, the number of employees are low, approved financial reports by completely audited or third party agencies cannot be provided, and the applied for loan amount is below 30 million Yuan. There needs to be an increase in the qualitative indexes combined with quantitative indexes in the indexes system of credit rating of small enterprises. 5 Commercial banks could initially evaluate small enterprises through a decision system, and then the credit managers could analyze and judge whether or not to give loans to small enterprises by an artificial expert method. During this period, commercial banks still need to measure the risk exposure of old customers regularly, paying close attention to the development of small enterprises, and as far as possible reduce the possibility of bad debts. Therefore, it is necessary to establish a model based on data mining and a machine learning algorithm, providing information to the credit manager in the credit management and risk management department, according to the requirements and regulations of the new Basel agreement and China’s banking regulatory commission. Generally, there were only a few enterprises that default in each commercial bank’s database. According to the existing data in the database of commercial banks, it was the selected models that identified the possible defaulting customers from the existing customers, which in scientific essence was an imbalanced classification problem. The problems of imbalanced classification contained the binary imbalanced classification problems and multiple imbalanced classification problems. At present, there are many researches on the problems of binary imbalance classification. The problems of multiple imbalanced classifications generally refer to problems with the classification categories. There are more than two kinds of categories and there is a big difference between the numbers of every group sample, especially there is only a small amount of sample data in a certain individual group in the multiple classification problems. Also with many users using many different methods, it is difficult to fully learn the characteristics of each group, which leads to the accuracy of classifications decreasing. The problems on credit rating of the enterprises in commercial banks is a typical multiple imbalanced classification problem. First, there are 10 distinguished grades of credit rating; AAA, AA, A, BBB, BB, B, CCC, CC, C, D and some commercial banks even add A+ and A- to make twelve grades of credit rating, based on the different characteristics of management and performance of the enterprises. Second, there are different numbers in each sample data category from the commercial banks database. For example, there are a 6 vast majority above Grade A and few below Grade BBB, even if there is still a big difference among the sample groups of Grade AAA, AA and A. Third, the amount of sample data in a certain category might be zero. Since the enterprises with low grades would be rejected by commercial banks, there is no sample data in some categories (such as grade D) in the customers’ database of commercial banks. In this paper, we use a credit rating system with ten grades, including grades AAA, AA, A, BBB, BB, B, CCC, CC, C, and D for a small commercial bank. Table 1 shows the numbers of each grade for164 enterprises in the customer database of the commercial bank: Table 1:The numbers of grades of 164 enterprises Grades The numbers of AAA AA 6 25 A BBB BB 49 46 20 B 16 CCC CC C D total 2 - - - 164 sample data It can be seen from the above table that there are only seven grades and no sample data below Grades CCC with different numbers of samples in other grades. The classification results of 7-10 grades might exist in different years and these are different from the other problems of multiple imbalance and the problems of binary classification in which the quantity of sample data are nearly the same. The basic model of the machine learning algorithm could not effectively learn the characteristics of the information from these kinds of samples. From the existing researches, there are three ways to solve the problem of imbalanced classification: (1) algorithm improvement, (2) improvement in the technique of data sampling, or (3) both improved simultaneously. The algorithm improvement would come from changing the inherent characteristics and the original treatment principle of the algorithm, making the calculation and analysis of the model adapt to the requests of the problem. The technique of data sampling improvement is aimed at the selection methods of data and can be used independently. It is over-sampling that increases the numbers of sample data in minority grades and under-sampling which decreases the numbers of sample data in majority grades. A 7 hybrid algorithm is the combination of the techniques of sampling and the algorithm. During the stage of training the sample data set, the characteristics of the data of minority grades are analyzed. It is the advantages of the statistical learning theory of the support vector machine that solves the problems of classification and regression. Therefore, the improvement from the basic support vector machine algorithm can be used to address the problems of multiple classifications. 4. Methodology According to the book ‘The theory and algorithms of support vector machines’ written by Deng Naiyang et al. (2005) and Cuihua Shen (2004), the basic ideas of the support vector machine method mostly solved the problems of regression and classification, of which included the linear separable problems and the linear inseparable problems. For the linear separable problems, the training sample data set in binary classification problems was xi ∈ Rn, i = 1, 2,... , n and the classification of the corresponding level was yi∈ {-1, 1}, i = 1, 2. The classified hyperplane (w•xi)+b=0. w•xi was the dot product between w and xi. To make two kinds of sample data, all satisfied yi((w•xi)+b) ≥1,i=1, …, n and the classified margin equaled 2 . ‖w‖ Under the constraint of yi ((w•xi) 2 + b) ≥1,i=1,…, n the objective function maximized‖w‖. Maximizing the classified margin was the same as minimizing ‖w‖2 2 divide the sample data and minimize . The optimal classified hyperplane could ‖w‖2 2 . Support vectors on the hyperplane contributed to the optimal hyperplane and the decision functions. So it was unnecessary to require all the training data on yi ((w•xi) + b) ≥1,i=1, …,n, introducingξi≥0,i=1, …, n, and the constraint conditions were relaxed to yi ((w. xi) + b) ≥1-ξi, i=1, …, n. ∑ξi described the degree to which the training set was wrongly distinguished with the wrong data. The penalty parameter C was the adjustable parameter C > 0; the larger C meant the punishment for fault classification. This was a problem of quadratic programming, and the formula was used to solve the optimization problem: 8 n 2 min 12 w C i w, b i 1 (1) yi((w•xi)+b)≥1-ξi, i=1, 2, …, n s.t. (2) ξi≥0, i=1, 2, …, n (3) The parameter C was used to balance the training accuracy and generalization ability.ξi was the slack variables used in solving the problem in a larger feasible region, w ∈ Rn was a weight vector to explain the location of the separating hyperplane of every space and b was the position error of the mobile hyperplane. This was a quadratic programming problem, and the optimal solution was the following Lagrange function of saddle points: L( w, b, ) 1 2 n n n i 1 i 1 i 1 w C i i yi ( w x b) i 1 ii 2 (4) Among them, αi ≥ 0 and βi ≥ 0 were the Lagrange multipliers. Since at the saddle point, the gradient of w, b and ξwas zero, so it got: n L w i yi xi 0 w i 1 n w i yi xi (5) i 1 n L i yi 0 b i 1 (6) L C i i 0 (7) Put (5) (6) (7) into (4) and calculated the maximum of (4) on α, obtaining the dual optimization problem of (1) (2) (3) as following: n n n 1 i 2 i j yi y j ( xi x j ) ma x i 1 j 1 i 1 n s.t. y i 1 i i (8) 0 0 i C (9) i=1, 2,… ,n (10) To solve the above formulas, would be i=0, 0< <C, =C. If 0< <C, =C, the corresponding xi was support vector. In support vector machine method, the 9 corresponding xi of =C was on the border, known as bound Support Vector, the corresponding xi of 0 < a < C was in the interval, known as normal support Vector. According to the KKT conditions, at the optimum points, Lagrange multiplier and constraint conditions multiplied 0: i [ yi (( w x) b) 1 i ] 0, i 1,2,....n (11) ii 0, i 1,2,...n (12) For normal support vector (0< <C), it was known i >0 from formula (7) and from formula (12) it got i =1. For any normal support vector, it was satisfied: yi (( w x) b) 1 (13) So b was: b yi ( w x) yi x j J j y j ( xi x j ), x JN (14) JN was the set of normal support vectors and J was the set of support vectors. The constraints of formula (2) (3) limited w and b and made the empirical risk for error equaled to 0. At the same time, it minimized w to make VC dimension minimizing. So the optimization of question (1) embodied the principle of structural risk minimizing and had good generalization ability. This method could solve the linear separable problems very well. However, the linear inseparable problems were that using any straight line would wrongly distinguish a lot of data in the training set. For the linear inseparable problems, a support vector machine selected a kernel function K, which was used on the sample data to map to a high-dimensional data space, transforming the linear inseparable problems into a linear separable problem, and constructing the optimal hyperplane separating the points of difficult nonlinear data. The different kernel function obtained the different classifiers and the parameters used in the selection of the kernel function were also very important. In this condition, the formula (8) changed to the following form: 10 n n n i j yi y j K ( xi , x j )] i 1 j 1 max[ i 12 i 1 n s.t. y i 1 i i 0 (15) (16) 0 i C, i 1,2,...n (17) Here, K(xi , x j) ( ( xi ) ( x j )) was the kernel function, solving the dual problem to determine the final decision function: n f ( x) sgn i yi K ( xi , x) b i 1 (18) If Kernel function K (xi , x) was selected appropriately, the linear inseparable problem in input space could be transformed into the linear separable problem in the feature space. There were many different kernel functions to be used in the support vector machine model. In this paper, the Gaussian radial basis function (RBF) was applied as the kernel function. K(x, y) e x p( x y ) 0 2 (19) The sample data was mapped to a high-dimensional space by the kernel function, which then was able to solve the problem with the nonlinear relationship between class labels and the characteristics of data, and the problem of insufficient prior experiences. γ was the inherent parameter of a function, which mapped data to the distribution in the new feature space. C was the penalty parameter, which meant the higher the value, the less the errors in the support vector machine classification model. The choice of parameters without a prior knowledge was through a grid search method which is a common method to set parameters. In this paper using an integrated learning algorithm to improve the support vector machine, the process mainly included three steps: segmentation, training and aggregation. In our sample data sets, the positive subsets were the good credit rating of small enterprise in commercial banks, such as grades AAA, AA, A, BBB, BB, B, CCC, CC, C and D. These enterprises would not default and made up a large percentage of 11 the sample data sets. The poorer credit ratings of small enterprise were negative subsets and the minority data set in the customer database of commercial banks, which would default and could not easily be predicted by machine learning methods with inadequate characteristics, such as Grade C or D. The first step was segmentation, which reclassified the existing sample groups to achieve nearly balanced groups. There were less data in the negative groups that did not need to be further segmented. But the data samples in positive groups needed detailed segmentation and classification. The samples group could be divided into k (k > 3) subclasses, for example, there were multiple classes in the customer data of small enterprises in commercial banks after data cleaning. The second step was training, which consolidated the sample data negative classes. For example, there were only 2 pieces of data of grade 7 in the sample data and these were the negative classes. If directly using the traditional methods to get rid of negative classes, there was no data, so the sample data of negative classes was kept and the sample data of positive categories were the key points and should be classified in detail. Support vector machine was used to classify the sample data after segmentation. The third step was aggregation. After training the sample data, it was necessary to integrate every individual class and form a suitable method for this kind of classified problem to distinguish all classes, according to the distance between each feature vector of the sample data in every class. Firstly, the sample data was separated into negative classes, and then the sample data into positive categories. A new test data was classified to the belonged class by the support vector machine method. Input: The known training set was sample data of small enterprise D = {(x1, y1), (x2, y2), … ( xn, yn )}. x represented the information of different characteristics for each sample of a small enterprise. y represented the corresponding grade of the small enterprise. Positive categories of training set P were the good customer data sets for credit rating and negative classes of training set N were the poor customer data sets for credit rating ( the sample size of P was m1, the sample size of N was m2, m1 + m2 = L 12 and m1≥ m2). M was the number in the positive sample data set. Positive categories of sample data set Pin the algorithm could be divided into M (in this case M = 6) data subsets and was Vi (i = 1, 2, ... , M). Simultaneously, support vector Ci ( i = 1, 2,... , M ) could be obtained separated, which separated every sample data set. Amalgamating the negative categories of N sample data set, Vi showed each category set. Making Di = [Vi , N], decision hyperplane gi (x) was used to work out the formula in the support vector machine and obtained the optimal solution Di = [Vi, N]. di was the distances (ci, c0) between the hyperplane separated at every category of space and completed the calculation. Among them vi = exp (di / dm), i= 1,... , M, dm was a median of all the distances between the hyperplane. M Output:F(x) y sgn( vi gi ( x)) (20) i 1 Sgn( ) was a sign function. The final output was the specific category y that corresponded with the arbitrary input of sample vector x. In the learning process, it needed to pay attention to the choice of M, which was not only the number of categories in the positive data sample set, but also the numbers of classifiers in an integrated learning algorithm. Since the size and the distribution of data affected the efficiency and accuracy of classification results of support vector machine classifiers according to the actual sample data in each year, measuring M could always be used between 6 and 11. 5. Data Support vector machine method did not need the sample data as a normal distribution and correlation tests. The authors collected the samples data of small enterprises from customers’ database from a city commercial bank in Zhejiang Province in China in financial year 2013. The following table was the descriptive statistical analysis of the 164 original sample data and the 17 variables. 13 Table 2: The descriptive statistical analysis of the original sample data Standard Variables N Minimum Maximum Working years of Mean deviation Variance 164 1 35 11.15 6.124 37.500 164 0 7 2.19 1.965 3.860 Corporate life 164 0 21 7.18 4.149 17.214 Investors’ assets 164 2 5 4.67 0.768 0.590 Sales output ratio 164 0 3 2.34 0.840 0.705 Debt ratio 164 0.066 0.770 0.432 0.136 0.019 Owner’s equity 164 7925567 590875849 68910686 65072022 4.234*1015 Current ratio 164 0.650 6.980 1.631 0.903 0.815 Accounts 164 2.080 127.218 10.39 12.037 144.889 Sales growth rate 164 -0.920 14.040 0.334 1.176 1.384 Profit growth rate 164 -4.060 130.260 1.210 10.347 107.062 Return on equity 164 -0.060 38.650 0.475 3.015 9.090 1 0 manager Education background of manager receivable turnover Personal credit records 164 1 1 0 Industry policy 164 0 1 0.71 0.456 0.208 Local environment 164 0 1 0.77 0.423 0.179 Operating site 164 0 1 0.96 0.188 0.035 164 0 1 0.87 0.335 0.112 of manager conditions Equipment utilization 14 According to the distribution characteristic of the original data, there were very few sample data less than zero and most of them greater than zero, belonged to the uniform distribution variables. To reduce the series between different variables, there was the process mapping the sample data to [0, 1], which was extreme linear model of processing, treating the sample data as the order from small to large and the array of the large data, the better the data. The complete credit rating methods of enterprises contained the index system of the credit rating. Small enterprises in this paper refer to the enterprise whose owner’s equity is above 6 million Yuan, the numbers of employees are low and cannot provide complete financial statements audited or recognized by third party agencies. The index system of credit rating in small enterprises includes the quantitative indexes and the qualitative indexes. In order to obtain more accurate results, the authors drew on the experience of the stated-owner commercial banks in China and had discussions many times with the experts. After the discussions, the index system was redesigned as in the following table: 15 Table 3: The index system of credit rating in small enterprises Variables Definitions Description (marks) Working years of The experience of the Above6 years:100, 3-6 years: manager (Years) manager in small 50, Below 3 years:0 enterprises. Education Primary school, middle Above bachelor degree:100, background of school, diploma, bachelor diploma:50, Below diploma:0 manager degree, master degree. Corporate life The open years of small Above 5 years:100, 2-5 years:50, enterprises. Below 2 years:0 Investors’ assets The individual property of Above 6 million:100,5-6 (Yuan) investors. million:80,4-5 million:60, 3-4 million:40,2-3 million:20, Below 2 million:0 Sales output ratio Sales units / produced units Above 90%:100,80-90%: 50,Below80%:0 Debt ratio Total liabilities/ total assets 0:100,0- 50%:50,50%-100%: 0 Owner’s equity Owner’s equity in small The sample data all more than 6 (Yuan) enterprises million:100 Current ratio Current assets / current Above 1.2:100, 1.1-1.2:75, 1-1.1: liabilities 50 0.9-1:25,Below 0.9:0 Accounts Net sales/ net accounts Above 4:100,2-4:50,Below 2: receivable receivables 0 (Net sales in this year- net Above 20%:100,15%-20%:75, sales in last year)/ net sales 10%-15%:50,5%-10%,25,Below in last year * 100% 5%:0 turnover Sales growth rate 16 Profit growth rate (Total profits in this year- Above 20%:100,15% - 20%:75, total profits in last year)/ 10% - 15%:50,5% - 10%,25, total profits in last year * Below 5%:0 100% Return on equity Total profits / owners’ Above 10%:100,8% - 10%:75, equity 6% - 8%:50,4% - 6%:25,Below 4% Personal credit Personal credit records of records of managers in local banks The sample data all good:100 Industry policy in local area Good:100,normal:50,limited: manager Industry policy 0 Local environment Environment policy in local Good:100,normal:50,limited: area 0 Operating site Operating site which was Self-owned:100,leased:0 conditions owned or leased Equipment The ratio of operating utilization equipment High:100,medium:50,low:0 6. Empirical results The results of the experiment were the predicted precision ratio: The accuracy rate of precision = the accurate numbers of sample data / the numbers of all sample data (20) For example, C was selected in the range 100-10000 for the experiment and increased by 10n. γ was selected from the range with an increasing speed of 10-n. According to the results of the convergence and accuracy of precision, gradually narrowing the scope should produce higher classification accuracy. If C was smaller and the actual testing accuracy was low, C was gradually increased and gradually closed to the optimal value range of the support vector machine. Every trial needed 10 17 minutes or so. Because of the limitations of time and energy, after repeated testing and analysis, the parameters combination in the classification model was ruled out as it would lead to the results being divergent, not convergent. This was found for the following parameters C = (1800,1900,2000,2100,2200,2500,2700), γ = (0.001, 0.003, 0.005, 0.007, 0.009).The test results of this range were better than that of other parameters. The different combinations of C and γ were composed of several different classification models of the support vector machine model. And the average precision accuracy of each classification was measured 20 times. There were 35 different testing results for analysis, as shown in the table, after the use of the integrated support vector model: Table 5: The classified results of 17 variables C 1800 1900 2000 2100 2200 2500 2700 0.001 0.7560 0.7560 0.7548 0.7548 0.7524 0.7560 0.7583 0.003 0.7690 0.7726 0.7738 0.7714 0.7726 0.7726 0.7702 0.005 0.7702 0.7643 0.7655 0.7655 0.7655 0.7631 0.7643 0.007 0.7631 0.7631 0.7667 0.7667 0.7607 0.7595 0.7595 0.009 0.7619 0.7607 0.7607 0.7607 0.7595 0.7583 0.7583 γ Compared with the results of the support vector machine classification method in this paper, the author also used other methods to classify the same sample data of 164 enterprises. Table 6 shows the results: 18 Table 6: The comparison results of model classification The name of model Two-step clustering method The accuracy of classification Silhouette measure of cohesion and separation indicates that the cluster quality is poor. K mean clustering method System clustering method 27.8% 43.9% The radial basis function 24.5% neural net work method The multilayer perceptron 29.8% neural network method It can be seen that the accuracy of the experimental results with other methods is low from above table. After the interviews with the credit managers in commercial banks, the credit managers cooperated with the risk assessment manager and seriously identified the potential risk of enterprise loans. In practice, the smallest amount of variable data as possible needed to be chosen to analyze small enterprises rapidly and accurately. So the authors tested reducing and eliminating a few credit rating variables of small enterprise customers and hoped to obtain ideal results. Due to there being 17 collected variables, there were numerous different possible combinations of variables, which could not be tested seriatim. Using factor analysis and principal component analysis dimensionality reduction was unsuitable for the analysis of credit rating in commercial banks. The principal components and factors calculated did not have stability and it was difficult to interpret the meanings. So we could only continuously delete the variables according to the correlation analysis, based on changes of the accuracy rate to determine the combination of variables. Firstly, 16 variables were selected from the 17 variables of valid sample data in 164 small enterprises. After the standardization of sample data, the variable of owner equity equaled 1 and this variable was eliminated. The new index system included16 19 variables, which were working years of manager,education background of manager, corporate life,investors’ assets,sales output ratio,debt ratio,current ratio,accounts receivable turnover,sales growth rate,profit growth rate,return on equity,personal credit records of manager,industry policy,local environment,operating site conditions, and equipment utilization. The test results are shown in the following table: Table 7: The classification results of 16 variables C γ 1800 1900 2000 2100 2200 2500 2700 0.001 0.7418 0.7409 0.7409 0.7436 0.7455 0.7482 0.7509 0.003 0.7664 0.7673 0.7718 0.7745 0.7745 0.7727 0.7727 0.005 0.7636 0.7591 0.7591 0.7564 0.7582 0.7527 0.7609 0.007 0.7527 0.7573 0.7573 0.7582 0.7609 0.7627 0.7591 0.009 0.7618 0.7655 0.7673 0.7664 0.7664 0.7682 0.7700 From the above table, the accuracy rate of the classification with 16 variables was lower than that of 17 variables; it showed that the support vector machine was more effective for the problems on high-dimensional classification with higher accuracy. After standardization, the variable of personal credit records of the manager were all good in the sample data of 164 enterprises and the standard was 1, so16 variables were used for testing, including working years of manager,education background of manager,corporate life,investors’ assets,sales output ratio,debt ratio,owner’s equity, current ratio,accounts receivable turnover,sales growth rate,profit growth rate,return on equity,industry policy,local environment,operating site conditions,and equipment utilization. The test results are shown in the following table: 20 The classification results of 16 variables Table 8: C 1800 1900 2000 2100 2200 2500 2700 0.001 0.7727 0.7718 0.7700 0.7691 0.7709 0.7709 0.7709 0.003 0.7782 0.7845 0.7873 0.7891 0.7864 0.7909 0.7873 0.005 0.7855 0.7836 0.7800 0.7773 0.7718 0.7764 0.7736 0.007 0.7755 0.7755 0.7736 0.7764 0.7755 0.7736 0.7718 0.009 0.7718 0.7736 0.7709 0.7709 0.7691 0.7664 γ 0.7636 This test used eight variables, including debt ratio, current ratio, sales growth, sales growth rate, return on equity, corporate life, industry policy, investors’ assets, and sales output ratio, and deleted working years of manager, education background of manager,owner’s equity,accounts receivable turnover,profit growth rate, personal credit records of manager,local environment,operating site conditions,and equipment utilization. The test results are shown in the following table: Table 9: The classification results of 8 variables C 1800 1900 2000 2100 2200 2500 2700 0.001 0.6298 0.6286 0.6369 0.6381 0.6357 0.6524 0.6607 0.003 0.6976 0.6917 0.6881 0.6905 0.6881 0.6929 0.6929 0.005 0.6964 0.7012 0.7000 0.6976 0.7060 0.7095 0.7095 0.007 0.7107 0.7107 0.7095 0.7190 0.7202 0.7286 0.7310 0.009 0.7262 0.7274 0.7321 0.7298 0.7298 0.7286 0.7286 γ This test used 8 variables, including debt ratio, current ratio, sales growth rate, return on equity, corporate life, industry policy, investors’ assets, and sales output ratio and deleted education background of manager, working years of manager, owner’s equity, current ratio, accounts receivable turnover,profit growth rate,return on equity, personal credit records of manager,industry policy,local environment,operating site conditions,and equipment utilization. The test results are shown in the following table: 21 Table 10: 8 variables classification results C 1000 1200 1400 1600 1800 2000 2200 0.001 0.6286 0.6190 0.6238 0.6250 0.6298 0.6369 0.6357 0.003 0.6690 0.6869 0.6905 0.6976 0.6976 0.6881 0.6881 0.005 0.6964 0.6917 0.6857 0.6929 0.6964 0.7000 0.7060 0.007 0.6857 0.6940 0.6988 0.7071 0.7107 0.7095 0.7202 0.009 0.6952 0.6988 0.7107 0.7143 0.7262 0.7321 0.7298 γ Different combinations of variable parameters C and γ together constituted a support vector classification machine, so there were 35 support vector classification machines in one table, all operated many times, which built more thousands of classifiers of the support vector machine. The result of operations was the average accuracy rate after 20 times, enhancing the precision and robustness of classification. Table 11: The comparison results of classification The name of model Two-step clustering method The accuracy rate of classification Silhouette measure of cohesion and separation indicates that the cluster quality is poor. K mean clustering method 35.98% System clustering method 15.24% The 31.9% radial basis function neural network method The multilayer perceptron 49% neural network method From the above table it can be seen the accuracy results with the other methods are low. 7. Conclusion 22 In China, it was necessary to identify credit rating of small enterprises for commercial banks. The author selected the suitable ensemble support vector machine method to analyze the sample data from customer database in a commercial bank. From the test results of these models, after many tests and analyses, the index system of variables was gradually adjusted. Since the characteristics of the support vector machine were suitable for the problems of high dimensional nonlinear classification, so there were more features in the variable index, the accuracy attained was higher. Support vector machine method did not need the sample data as a normal distribution and correlation tests to solve this kind of imbalanced multi-classification problems and enhanced the precision and robustness of classification. From the table below it could be seen that the variables in index system sample data were 15 to 17, and the accuracy rate of classification was close to 80%. Decreasing the variables in index system caused the accuracy rate of classification to fall. The accuracy rate of 8 variables in the index system was over 62%, and even in some combinations of parameters above 70%; relatively good evaluation accuracy. When the number of variables decreased to 7 in the index system, the accuracy quickly fell below 60%, with no references. Thus, 8 variables were selected as the simplified index system of credit rating for small enterprises. The accuracy rate of classification 80% 60% 8 Figure 16 The numbers of variable the schematic diagram of relationship between the numbers of variable and the accuracy rate of classification 23 Therefore, through comparing results, we used 8 variables, which were working years of manager,debt ratio,profit growth rate,sales growth rate, corporate life, industry policy, investors’ assets and sales output ratio. After normalization processing, enterprise owner’s equity and personal credit records of manager both equaled 1 and were very important evaluation indexes. Two sample data of variables were not brought into the calculation when the classifiers were tested, but these could not be ignored for the simplified index system of credit rating on small enterprises. So the simplified index system of credit rating on small enterprises included 10 variables, which were working years of manager,debt ratio,profit growth rate,sales growth rate, corporate life, industry policy, investors’ assets,sales output ratio, enterprise owner’s equity and personal credit records of the manager. References: Altman,EI,Haldeman, R& Narayanan, P 1997, ZETA analysis: A new model to identify bankruptcy risk of corporations, Journal of Banking and Finance, 1977,no. l, pp. 29-54. Basel Committee on Banking Supervision,2001, Consultative Document: The New Basel Capital Accord,Bank for International Settlements,January 2001. Cristianini,N & Shawe-Taylor,J, An introduction to support vector machine and other kernel-based learning method, China Machine Press, English edition, July 2005 Committee on the Global Financial System 2001,A Survey of stress tests and current practice at major financial institutions,Bank for International Settlements,2001. Martin,D 1977, Early warning of bank failure: A Logit regression approach,Journal of Banking and Finance,1977, no. 1, pp. 249-276. Carey,M 2002, A Guide To Choosing Absolute Bank Capital Requirements,Board of Governors of the Federal Reserve System,International Finance Discussion Papers, no. 726,May. Cuihua Shen, Research on individual credit rating of loans for consumption based on support vector machine method, (in Chinese), Doctoral Dissertation in China 24 agricultural university, Nov. 2004 Cheng-Lung Huang,Mu-Chen Chen,Chieh-Jen Wang, 2007, Credit scoring with a data mining approach based on support vector machines,Expert Systems with Applications,no. 33, pp. 847-856. Chong Wu, Hang Xia, Research on customer credit evaluation model based on integrated support vector machine in the electronic commerce environment, (in Chinese), Chinese Journal of Management Science, Oct. 2008 Euro-currency Standing Committee of the central banks of the Group of Ten countries 1998, On The Use of Information and Risk Management, Bank for International Settlements,Basle,1st October,1998. Fuyong Wan, Xiufu Shi, Research on financial prediction on credit evaluation method of Listed company based on support vector machine, (in Chinese), China intelligent automation conference proceedings in 2009 (the second volume) Grinold,RC &Kahn,RN 1999, Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Selecting Superior Returns and Controlling Risk,McGraw-Hill. Global Credit Research 2002, Loss Calc TM:Moody’s Model For Predicting Loss Given Default (LGD),February. Jorion,P 2000, Value at Risk:The New Benchmark for Managing Financial Risk, McGraw-Hill. Junni L, Zhang, W&Härdle,K 2010, The Bayesian Additive Classification Tree applied to credit risk modeling,Computational Statistics and Data Analysis,2010, no.54, pp.1197-1205. KMV 1993,Portfolio Management of Default Risk,released date:November15, 1993, Revised:May31,2001 KMV 1993, San Francisco,Credit Monitor Overview,KMV Corporation,1993 Reining,A 2001, Monte Carlo Simulation in the Integrated Market and Credit Risk Portfolio Model,Algorithmic Inc. June. Kyoung-jae Kim, Hyunchul Ahn, A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach, Computers & 25 Operations Research, 2012, no.39, pp. 1800-1811 Ligang Zhou, Kin Keung Lai, Lean Yu, Credit scoring using support vector machines with direct search for parameters selection, Soft Comput, 2009, no.13, pp.149-155 Liyong Yu, Jiehui Zhan, Research on default probability prediction based on Logistic regression analysis (in Chinese), Journal of Finance and Economics, 2004, no. 9, pp.15-23 Morgan,JP997, Credit Metrics,New York, Technical Document,April 2,1997. Gieseke , K 2002, Credit Risk Modeling and Valuation : An Introduction , Humboldt-University zoo Berlin, August 19. Naiyang Deng, Yingjie Tian, Support vector machine : Theory, algorithm, development, (in Chinese), China Science Press, First edition, Aug. 2009 O’Connor,R, Golden,JF & Rack,R year, A Value-At-Risk Calculation of Required Reserves For Credit Risk In Corporate Lending Portfolios Pai, Ping-Feng; Tan, Yi-Shien; Hsu, Ming-Fu, Credit Rating Analysis by the Decision-Tree Support Vector Machine with Ensemble Strategies, International journal of fuzzy systems, 2015, no. 17, pp. 521-530 Qian Li, Bing Yang, Yi Li, Naiyang Deng & Ling Jing, Constructing support vector machine ensemble with segmentation for imbalanced datasets, Neural Comput & Applic, DOI 10.1007/s00521-021-1041-z, published online: 10 July 2012. Shu-Ting Luo,Bor-Wen Cheng & Chun-Hung Hsieh 2009, Prediction model building with clustering-launched classification and support vector machines in credit scoring, Expert Systems with Applications, no. 36, pp. 7562-7566. Sinha,AP &Zhao,H 2008, Incorporating domain knowledge into data mining classifiers:An application in indirect lending,Decision Support Systems, no.46, pp. 287-299. Saunders,A 1999, Credit Risk Measurement,John Wiley & Sons,New York. Shinong Wu, Shizhong Huang, Research on prediction model of financial distress of listing Corporation in China ( in Chinese ), Economics Research Journal, 2001, no. 6, pp. 46-55 Terry Harris, Credit scoring using the clustered support vector machine, Expert 26 Systems with Applications, 2015, no. 42, pp. 741-750 Wun-Hwa Chen,Jen-Ying Shih2006,A study of Taiwan’s issuer credit rating systems using support vector machines,Expert Systems with Applications, no. 30, pp. 427-435. Wenbing Xiao, Qi Fei , Hu Wan, The credit evaluation model based on support vector machine and risk, (in Chinese), Chinese Science Abstracts(Chinese Edition), 2007, Vol.13,no. 22 p:284 Yifeng Zhou, Chengde Lin, The comparison on classification method of credit risk assessment in commercial bank, (in Chinese), The 24th Chinese control conference, July 2005, p 1734-1737 Zhihui Li, Meng Li, Empirical Research on the credit risk identification model in Chinese commercial banks ( in Chinese ), Economic Science, 2005, no. 5, pp. 61-71. Zongyuan Zhao a, Shuxiang Xu a, Byeong Ho Kang, Mir Md Jahangir Kabir, Yunling Liu c, Rainer Wasinger, Investigation and improvement of multi-layer perceptron neural networks for credit scoring, Expert Systems with Applications , 2015 no. 42, pp.3, 508-516 27