Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Expert Systems with Applications 36 (2009) 4075–4086 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Using neural networks and data mining techniques for the financial distress prediction model Wei-Sen Chen *, Yin-Kuan Du Industrial Technology Research Institute, #195, Sec. 4, Chung-Hsing Rd., Chutung 310, HsinChu, Taiwan, ROC a r t i c l e i n f o Keywords: Financial distress prediction model Artificial neural network Data mining a b s t r a c t The operating status of an enterprise is disclosed periodically in a financial statement. As a result, investors usually only get information about the financial distress a company may be in after the formal financial statement has been published. If company executives intentionally package financial statements with the purpose of hiding the actual status of the company, then investors will have even less chance of obtaining the real financial information. For example, a company can manipulate its current ratio by up to 200% so that its liquidity deficiency will not show up as a financial distress in the short run. To improve the accuracy of the financial distress prediction model, this paper adopted the operating rules of the Taiwan stock exchange corporation (TSEC) which were violated by those companies that were subsequently stopped and suspended, as the range of the analysis of this research. In addition, this paper also used financial ratios, other non-financial ratios, and factor analysis to extract adaptable variables. Moreover, the artificial neural network (ANN) and data mining (DM) techniques were used to construct the financial distress prediction model. The empirical experiment with a total of 37 ratios and 68 listed companies as the initial samples obtained a satisfactory result, which testifies for the feasibility and validity of our proposed methods for the financial distress prediction of listed companies. This paper makes four critical contributions: (1) The more factor analysis we used, the less accuracy we obtained by the ANN and DM approach. (2) The closer we get to the actual occurrence of financial distress, the higher the accuracy we obtain, with an 82.14% correct percentage for two seasons prior to the occurrence of financial distress. (3) Our empirical results show that factor analysis increases the error of classifying companies that are in a financial crisis as normal companies. (4) By developing a financial distress prediction model, the ANN approach obtains better prediction accuracy than the DM clustering approach. Therefore, this paper proposes that the artificial intelligent (AI) approach could be a more suitable methodology than traditional statistics for predicting the potential financial distress of a company. Crown Copyright Ó 2008 Published by Elsevier Ltd. All rights reserved. 1. Introduction In Taiwan, domestic and foreign capital markets have developed rapidly in recent years, gradually giving people the idea of making a financial investment. There are various financial investment objects, such as stocks, futures, options, bond funds etc., and investment stock is the most widely accepted in society. However, capital markets are volatile, and most investors only know that a company is in financial trouble after the financial statement of the company has been made public. Therefore, forecasting corporate financial distress plays an increasingly important role in today’s society since it has a significant impact on lending decisions and the profitability of financial institutions. The ability to make accurate bankruptcy predictions are of critical importance * Corresponding author. Tel.: +886 3 5820100; fax: +886 3 5610616. E-mail address: [email protected] (W.-S. Chen). to various professionals, such as bank loan officers, creditors, stockholders, bondholders, financial analysts, governmental officials, as well as the general public, as it provides them with timely warnings (Ko & Lin, 2006). Financial failure occurs when a firm suffers chronic and serious losses or when the firm becomes insolvent with liabilities that are disproportionate to its assets (Hua, Wang, Xu, Zhang, & Liang, 2007). Common causes and symptoms of financial failure include lack of financial knowledge, failure to set capital plans, poor debt management, inadequate protection against unforeseen events and difficulties in adhering to proper operating discipline in the financial market. The common assumption underlying bankruptcy prediction is that a firm’s financial statements appropriately reflect above characteristics. Several classification techniques have been suggested to predict financial distress using ratios and data originating from these financial statements, e.g., univariate approaches (Beaver, 1966), multivariate approaches, linear multiple 0957-4174/$ - see front matter Crown Copyright Ó 2008 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.03.020 4076 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 discriminant approaches (MDA) (Altman, 1968; Altman, Edward, Haldeman, & Narayanan, 1977), multiple regression (Meyer & Pifer, 1970), logistic regression (Dimitras, Zanakis, & Zopounidis, 1996), factor analysis (Blum, 1974), and stepwise (Laitinen & Laitinen, 2000). However, strict assumptions of traditional statistics such as linearity, normality, independence among predictor variables and pre-existing functional form relating to the criterion variable and the predictor variable limit their application in the real world (Hua et al., 2007). With radical changes taking place in corporate finance and the global economic environment, critical financial ratios can change dynamically (John & Robert, 2001). This means that it is both important as well as necessary to develop an evolutionary approach for coping with future dynamic financial environments. Therefore, this paper proposes a model of financial distress prediction integrating artificial neural network (ANN) and data mining (DM) techniques. The main objectives of this paper are to (1) adopt ANN and DM techniques to construct a financial distress prediction model, (2) use financial and non-financial ratios to enhance the accuracy of the financial distress prediction model, (3) employ a traditional statistical method (factor analysis) to compare the degree of accuracy with that of the artificial intelligent (AI) approach, and (4) to expand this model so that it will work within a financial distress prediction system to provide information to investors as well as investment monitoring organizations. The data for our experiment were collected from the Taiwan stock exchange corporation (TSEC) database. The rest of this paper is organized as follows. A literature review of related studies is provided in Section 2. Section 3 describes our proposed approach and the functionalities of each process. Section 4 presents the process for selecting suitable indicators by factor analysis. To prove the prediction performance of our approach, we carried out several experiments which are described in Section 5. In Section 6, we compared our results with the ANN, and DM approaches. Finally, in Section 7 we draw our conclusions about financial distress forecasting and discuss future work. 2. Literature review 2.1. Artificial neural network The ANN is composed of richly interconnected non-linear nodes that communicate in parallel. The connection weights are modifiable, allowing ANN to learn directly from examples without requiring or providing an analytical solution to the problem. The most popular forms of learning are: niques for classification and prediction (Wu, Yang, & Liang, 2006), and is considered an advanced multiple regression analysis that can accommodate complex and non-linear data relationships (Jost, 1993). It was first described by Werbos (1974), and further developed by Ronald, Rumelhart, and Hinton (1986). The details for the back-propagation learning algorithm can be found in Medsker and Liebowitz (1994). Fig. 1 shows the l m n (l denotes input neurons, mdenotes hidden neurons, and n denotes output neurons) architecture of a BPN model (Panda, Chakraborty, & Pal, 2007). The input layer can be considered the model stimuli and the output layer the input stimuli outcome. The hidden layer determines the mapping relationships between input and output layers, whereas the relationships between neurons are stored as weights of the connecting links. The input signals are modified by the interconnection weight, known as weight factor wji, which represents the interconnection of the ith node of the first layer to the jth node of the second layer. The sum of the modified signals (total activation) is then modified by a sigmoid transfer function (f). Similarly, the output signals of the hidden layer are modified by interconnection weight wkj of the kth node of the output layer to the j th node of the hidden layer. The sum of the modified signals is then modified by sigmoid transfer (f) function and the output is collected at the output layer. Let Ip = (Ip1,Ip2, . . . , Ipl), p = 1,2, . . . , N be the pth pattern among N input patterns. Where wji and wkj are connection weights between the ith input neuron to the jth hidden neuron, and the jth hidden neuron to the kth output neuron, respectively (Panda et al., 2007). Output from a neuron in the input layer is Opi ¼ Ipi ; i ¼ 1; 2; . . . ; l Output from a neuron in the hidden layer is ! 1 X Opj ¼ f ðNET pj Þ ¼ f wji opi ; j ¼ 1; 2; . . . ; m ð2Þ i¼0 Output from a neuron in the output layer is ! m X Opk ¼ f ðNET pk Þ ¼ f wkj opj ; k ¼ 1; 2; . . . ; n ð3Þ j¼0 Where f( ) is the sigmoid transfer function given by f(x) = 1/(1 + ex). BPN has been applied to various areas, such as investigating long-term tidal predictions (Lee, 2004), improving customer satisfaction (Deng, Chen, & Pei, 2007), predicting flank wear in drills (Panda et al., 2007), enhancing job completion time prediction in the semiconductor fabrication factory (Chen, 2007), and providing Supervised learning: Patterns for which both their inputs and outputs are known are presented to the ANN. The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples. ANN employing supervised learning has been widely utilized for the solution of function approximation and classification problems. Unsupervised learning: Patterns are presented to the ANN in the form of feature values. It is distinguished from supervised learning by the fact that there is no a priori output. ANN employing unsupervised learning has been successfully employed for data mining and classification tasks. The self-organizing map (SOM) and adaptive resonance theory (ART) constitutes the most popular exemplar of this class. A back-propagation network (BPN) is a neural network that uses a supervised learning method and feed-forward architecture. A BPN is one of the most frequently utilized neural network tech- ð1Þ Fig. 1. Back-propagation network architecture. W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 the required accuracy for focal ventricular arrhythmias diagnosis (Yılmaz & Cunedioglu, 2007). Based on the above literatures, many researches employed the BPN techniques for many applications. However, few of them used it to carry out empirical investigations of financial distress prediction related topics. Therefore, in this study we will use the BPN technique to forecast a potential crisis in the bankruptcy prediction domain. We hope that the results of our proposed approach will provide a useful methodology for investors as well as supervisory organizations to predict and avoid investing in, a company open to a bankruptcy in the near future. 2.2. Data mining Data mining (DM), also known as ‘‘knowledge discovery in databases” (KDD), is the process of discovering meaningful patterns in huge databases (Han & Kamber, 2001). In addition, it is also an application that can provide significant competitive advantages for making the right decision. (Huang, Chen, & Lee, 2007). DM is an explorative and complicated process involving multiple iterative steps. Fig. 2 shows an overview of the data mining process (Han & Kamber, 2001). It is interactive and iterative, involving the following steps: Step 1. Application domain identification: Investigate and understand the application domain and the relevant prior knowledge. In addition, identify the goal of the KDD from the administrators’ or users’ point of view. Step 2. Target dataset selection: Select a suitable dataset, or focus on a subset of variables or data samples where data relevant to the analysis task are retrieved from the database. Step 3. Data Preprocessing: the DM basic operations include ‘data clean’ and ‘data reduction’: In the ‘data clean’ process, we remove the noise data, or respond to the missing data field. In the ‘data reduction’ process, we reduce the unnecessary dimensionality or adopt useful transformation methods. The primary objective is to improve the effective number of variables under consideration. Step 4. Data mining: This is an essential process, where AI methods are applied in order to search for meaningful or desired patterns in a particular representational form, such as association rule mining, classification trees, and clustering techniques. Step 5. Knowledge Extraction: Based on the above steps it is possible to visualize the extracted patterns or visualize the data depending on the extraction models. Besides, this process also checks for or resolves any potential conflicts with previously believed knowledge. Step 6. Knowledge Application: Here, we apply the found knowledge directly into the current application domain or in other fields for further action. Step 7. Knowledge Evaluation: Here, we identify the most interesting patterns representing knowledge based data on some measure of interest. Moreover, it allows us to improve the accuracy and efficiency of the mined knowledge. A particular data mining algorithm is usually an instantiation of the model preference search components. The more common model functions in the current data mining process include the following (Mitra, Pal, & Mitra, 2002). Classification: Classifies a data item into one of several predefined categories. Regression: Maps a data item to a real-valued prediction variable. Clustering: Maps a data item into a cluster, where clusters are natural groupings of data items based on similarity metrics or probability density models. Association rules: Describes association relationship among different attributes. Summarization: Provides a compact description for a subset of data. Dependency modeling: Describes significant dependencies among variables. Sequence analysis: Models sequential patterns, like time-series analysis. The goal is to model the state of the process generating the sequence or to extract and report deviations and trends over time. Knowledge Evaluation Knowledge Applying Knowledge Extraction Data Mining Target Dataset Application Selection Domain Identification 4077 Data Preprocessing Fig. 2. Data mining phases (Han & Kamber, 2001). 4078 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 In the recent past many research contributions have applied data mining techniques to many applications. DM has been successfully applied to several financial problem domains. Recent examples are as follows. Huang, Hsu, and Wang (2007) adopted the time-series mining approach to simulate human intelligence and discover financial database patterns automatically (Huang et al., 2007). Kirkos, Spathis, and Manolopoulos (2007) used classification mining to identify fraudulent financial statements (Kirkos et al., 2007). Chun and Park (2006) integrated the regression analysis and case-based reasoning for predicting the stock market index (Chun & Park, 2006). However, few of these studies focused on the data clustering approach, and even fewer empirical investigations were made of financial distress prediction related topics. Therefore, we will use data clustering to enhance the accuracy of predicting bankruptcy in a capital market. 3. Research methodology In this study we integrate ANN and DM techniques for financial distress prediction (FDP). The research methodology is as shown in Fig. 3. In the first phase we deal with the dataset which basically is the original huge set of records from the TSEC which will be covered by data pre-processing. The data sets then undergo cleaning and preprocessing for removing discrepancies and inconsistencies to improve their quality. The goal in this phase is to select the suitable indicators, including financial and non-financial ratios, by means of factor analysis. After the above processes, the next phase will load these indicators and discovery prediction rule sets that are ready to be used in ANN and DM clustering. The ‘‘FDP Selecting” will be discussed in detail in the following sections. In the FDP Modeling phase we collect the financial statement data sets for ANN and DM processing. In the ANN approach, we will use the BPN algorithm to discover the rules and predict the FDP. In the DM approach, we will use the clustering technique to classify and predict the FDP. Next, the selected data set is analyzed by applying algorithms in order to identify the patterns among the data that represent a relationship. The BPN and clustering algorithm are applied to separately determine the financial distress prediction patterns or rules. In the FDP Comparison phase, we compare the prediction accuracy for BPN and clustering mining by means of several times fac- tor analysis (non-factor analysis, 1st factor analysis, and 2nd factor analysis). Then, the intelligent financial distress prediction model will be constructed and initiated to validate the new data sets of the financial statement from the TSEC. 4. The FDP selecting phase 4.1. Data Our sample contained data from 68 Taiwan firms listed in the TSEC. The period of sampling was from 1999 January u/i October, 2006, amounting to 7 years and 10 months. The 34 firms in financial distress were matched with 34 non-bankruptcy firms. These firms were characterized as non-bankruptcy based on the absence of any indication or proof concerning the issuing of financial distress in the auditors’ reports, in the financial and taxation databases and in the TSEC. This of course did not guarantee that the financial statements of these firms were not falsified or that the financial distress of these firms would not be revealed in the future. It only guaranteed that no firms in financial distress had been found during an extensive search. All the variables used in the sample were extracted from formal financial statements, such as balance sheets and income statements. This implies that the usefulness of this study is not restricted by the fact that only data from Taiwanese companies was used. 4.2. Variables The selection of variables to be used as candidates for participation in the input vector was based upon prior research work linked to the topic of financial distress prediction. The work carried out by Kirkos et al. (2007), Spathis (2002), Spathis, Doumpos, and Zopounidis (2002), Fanning and Cogger (1998), Persons (1995), Stice (1991), Feroz, Park, and Pastena (1991), Loebbecke, Eining, and Willingham (1989) and Kinney and McDaniel (1989) contained the suggested indicators of financial distress prediction. Therefore, this paper adopted the related variables based on prior researches, the Taiwanese Economic Journal (TEJ), and the Taiwanese economic database. Moreover, this paper selected 37 variables and categorized them as six major types: earning ability, financial structure ability, management efficiency ability, management Fig. 3. Research methodology. 4079 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 performance, debt-repaying ability, and non-financial factors. The details of these indicators belong to each type and are listed as follows: Table 1 1st Factor analysis results Factors Variables Factor loadings Communality Eigenvalues Explained variance Earning ability: Including pretax margin, return on total assets, return on equity, earnings per share, and gross margin ratios. Financial structure ability: Including debt to assets, times interest earned, book value per share, financial leverage ratio, debt to equity, short term and long term debt to book value ratio, fixed assets to total assets ratio, gross margin to total assets ratio, inventory to total assets ratio, inventory to sales ratio, investment ratio, and current assets to total assets ratios. Management efficiency ability: Including turnover rate of inventory, turnover rate of account receivable, turnover rate of fixed assets, turnover rate of total assets, turnover rate of equity, and turnover rate of working capital ratios. Management performance: Including pretax margin growth ratio, gross margin growth ratio, and sales growth ratio ratios. Debt-repaying ability: Including current ratio, acid-test ratio, cash ratio, cash flow ratio, cash flow to long term debt, cash flow to total debt, and cash flow to short term and long term debt ratio ratios. Non-financial factors: Including dividend payout ratio, pricebook ratio, the proportion of collateralized shares by the board of directors, and the insider holding ratio. 1 Earnings per share (EPS) Return on equity (ROE) Return on asset (ROA) Pretax margin growth ratio Margin before interest and tax (BEFM) 0.870 0.919 4.023 10.874 0.862 0.889 0.850 0.866 0.641 0.524 0.638 0.814 2 Current ratio Acid-test ratio Equity per share Cash ratio Gross margin ratio Price-book ratio (PBR) 0.762 0.742 0.631 0.624 0.609 0.352 0.877 0.833 0.742 0.503 0.660 0.462 3.858 10.428 3 Gearing ratio Debt to equity ratio (DEBE) Debt/equity (DE) Debt ratio 0.949 0.948 0.969 0.968 3.738 10.103 0.923 0.625 0.962 0.820 Turnover rate of total assets Turnover rate of equity Turnover rate of fixed assets Gross margin to total assets ratio 0.858 0.824 2.886 7.800 0.798 0.793 0.635 0.803 0.479 0.716 Inventory to total assets ratio Inventory to sales ratio Current assets to total assets The proportion of collateralized shares by the broad of directors 0.899 0.889 2.558 6.912 0.848 0.802 0.578 0.871 0.422 0.397 Cash flow ratio Cash flow to total debt ratio Dividend payout ratio 0.859 0.830 0.873 0.823 2.476 6.693 0.514 0.579 Insider holding ratio Investment ratio Fixed assets to total assets ratio 0.756 0.635 2.039 5.510 0.635 0.607 0.755 0.772 Times interest earned Cash flow to long term debt 0.836 0.778 0.827 0.732 Turnover rate of working capital Turnover rate of inventory 0.788 0.777 0.714 0.736 Turnover rate of account receivable Gross margin growth ratio Sales revenue growth ratio 0.782 0.693 0.665 0.526 0.548 0.641 Cash flow to short 0.871 term and long term debt ratio Total explained variance 0.813 4 4.3. Factor analysis This paper collected the samples of 34 pairs of financial distress and non-bankruptcy firms listed in the TSEC, between 1999 and 2006. The main variables are 37 ratios for the predictive financial distress model factors. This research used the SPSS statistical software to conduct factor analysis and principle component analysis (PCA) with varimax for rotation (VARIMAX), in order to make the factor structure easier and simpler to explain. The principle for the selection of factors is based on Kaiser’s criteria, meaning that the eigenvalue greater than 1 is a common factor, the absolute value of the factor loadings is greater than 0.5 and the communality is greater than 0.8 in order to obtain suitable factors. In total, we compiled 33 financial ratios and 4 non-financial ratios. In an attempt to reduce dimensionality, we ran a factor analysis to test whether the differences between these 37 variables were significant for each variable. If the difference was not significant (low factor loadings or communality values), the variable was considered to be non-informative. Table 1 shows the factor loadings, communality, the eigenvalues and the explained variance for each variable. As a result, 18 variables presented high factor loadings or communality values. These variables were chosen to be used in the input vector, while the remaining 19 variables were discarded. In addition, the total explained variance was 75.776%. We used the factor analysis to process the experiment a second time. Table 2 shows that 5 variables were discarded, and that the total explained variance was 85.288%. Due to the better performance in the total explained variance value, we can assume that the factor analysis is not yet the optimal solution. Therefore, we used the factor analysis to process the experiment a third time. Table 3 shows that two variables were discarded, and that the total explained variance was 91.876%. Therefore we used the factor analysis to process the experiment a fourth time. However, Table 4 shows there were no suitable variables to be discarded, and the total explained variance was down to 88.228%. Therefore, we can were sure that the optimal factor analysis was the one we carried out the third time, where the performance was the highest at 91.876%. 5 6 7 8 9 10 11 75.776 2.012 4.571 1.646 4.450 1.110 2.999 4080 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 Table 2 2nd Factor analysis results Table 4 4th Factor analysis results Factors Variables Factor loadings Communality Eigenvalues Explained variance Factors Variables Factor loadings Communality Eigenvalues Explained variance 1 Return on asset (ROA) Return on equity (ROE) Earnings per share (EPS) Margin before interest and tax (BEFM) 0.913 0.904 3.497 19.427 1 0.962 0.961 0.988 0.986 2.969 26.989 0.903 0.923 Gearing ratio Debt to equity ratio Debt/equity (DE) 0.942 0.973 0.897 0.897 2 0.923 0.929 2.779 25.259 0.759 0.717 0.918 0.923 0.910 0.931 Gearing ratio Debt to equity ratio Debt/equity (DE) Debt ratio 0.962 0.961 0.978 0.976 Return on asset (ROA) Earnings per share (EPS) Return on equity (ROE) 0.915 2.047 18.613 0.968 0.789 0.920 0.393 0.916 0.194 3 Current ratio Acid-test ratio 0.896 0.892 0.939 0.940 Cash flow to total debt ratio Cash flow ratio Inventory to total assets ratio 0.937 0.939 0.656 0.909 0.868 0.936 0.921 0.975 0.975 17.366 Inventory to total assets ratio Inventory to sales ratio Current ratio Acid-test ratio 1.910 4 0.891 0.854 Turnover rate of fixed assets Turnover rate of total assets Current assets to total assets 0.840 0.775 0.811 0.761 0.649 0.869 Cash flow to total debt ratio Cash flow ratio Cash flow to short term and long term debt ratio 0.830 0.835 0.811 0.642 0.869 0.489 2 5 6 Total explained variance 3.492 19.401 3 2.296 12.755 4 11.659 Total explained variance 2.076 11.536 1.892 10.509 85.288 Factors Variables Factor loadings Communality Eigenvalues Explained variance 1 Debt to equity ratio Gearing ratio Debt/equity (DE) 0.970 0.993 3.011 23.164 0.969 0.943 0.994 0.973 Return on asset (ROA) Earnings per share (EPS) Return on equity (ROE) 0.923 0.935 2.759 21.227 0.917 0.930 0.909 0.930 3 Current ratio Acid-test ratio Current assets to total assets 0.894 0.877 0.602 0.929 0.945 0.763 2.106 16.203 4 Cash flow to total debt ratio Cash flow ratio 0.950 0.934 2.038 15.673 0.940 0.943 Inventory to total assets ratio Inventory to sales ratio 0.927 0.889 0.874 0.788 5 equity ratio, gearing ratio, debt/equity (DE), return on asset (ROA), earnings per share (EPS), return on equity (ROE), current ratio, acid-test ratio, current assets to total assets, cash flow to total debt ratio, cash flow ratio, inventory to total assets ratio, and inventory to sales ratio. 5. The FDP modeling phase 5.1. ANN experiments and results Table 3 3rd Factor analysis results 2 88.228 This process uses the finance and non-finance ratios, and constructs a financial distress prediction model after carrying out a second time factor analysis. The variables are then loaded as ANN input nodes. In addition, we also apply these experiment parameters to investigate the past 2 seasons, the past 4 seasons, the past 6 seasons, and the past 8 seasons before the financial distress occurred, for the sake of prediction accuracy. In this experiment, we will use the BPN as the ANN algorithm. In addition, the training sample and the testing sample will adopt the 80:20 ratio. In terms of bankruptcy prediction, whether or not the prediction is accurate is routinely measured by three quantities: Type I Error Rate, Type II Error Rate, and Total Error Rate. ‘‘Type I Error Rate” means that the error rate for the risk can not categorize the normal company as a normal company, ‘‘Type II Error Rate” means that the error rate for the risk can not categorize the bankruptcy company, and ‘‘Total Error Rate” means the combined ‘‘Type I Error Rate” and ‘‘Type II Error Rate”. Table 5 shows the relationship among these three error rate types. The formula for each error rate is listed as follows: Y2 Y3 Y4 Type II Error Rate ¼ Y6 ðY 2 þ Y 4 Þ Total Error Rate ¼ Y9 Type I Error Rate ¼ Total explained variance 2.029 15.609 91.876 ð4Þ ð5Þ ð6Þ Table 5 The relationship with type I, II, and total error rates Prediction After the three times factor analysis, 13 variables presented higher factor loadings or communality values. These variables were chosen to be used in the input vector, while the remaining 24 variables were discarded. The selected variables were debt to Actually Normal Bankruptcy Sum Sum Normal Bankruptcy Y1 Y4 Y7 Y2 Y5 Y8 Y3 Y6 Y9 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 5.1.1. The experiment using a non-factor analysis This experiment obtains a result after using 37 original ratio variables that have not yet obtained a result by factor analysis. As shown in Table 6, the testing data has an estimate accuracy rate as high as 82.14%, with an error rate of 17.86% for the past 2 seasons. However, the accuracy rate reduces to 60%, and the error rate rises to 40% when measured over the past 8 seasons. The closer the financial crisis the higher the accuracy will be. 5.1.2. The experiment with the 1st factor analysis This experiment obtains a result after using 18 original ratio variables of this research that have undergone 1st factor analysis. As shown in Table 7, the testing data has an estimate accuracy rate as high as 78.57%, with an error rate of 21.43% for the past 2 seasons. However, the accuracy rate reduces to 66.36%, and the error rate rises to 33.64% when measured over the past 8 seasons. Similar to the above experiment, the closer the financial crisis the higher the accuracy will be. 5.1.3. The experiment with 2nd factor analysis This experiment obtains a result after using 13 original ratio variables of this research that have undergone 2nd factor analysis. As shown in Table 8, the testing data has an estimate accuracy rate as high as 75%, with an error rate of 25% for the past 2 seasons. Table 6 The accuracy for the ANN model with non-factor analysis Training data Normal 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average rate rate rate rate Testing data Bankruptcy 87.03% 94.44% 90.74% 89.91% 92.67% 91.28% 91.41% 95.71% 93.56% 95.85% 93.55% 94.70% Normal Bankruptcy 92.86% 71.43% 82.14% 100.00% 55.56% 77.78% 87.80% 65.85% 76.83% 74.55% 45.45% 60.00% Table 7 The accuracy for the ANN model with 1st factor analysis Training data Normal 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average rate rate rate rate Testing data Bankruptcy 90.74% 84.48% 86.11% 87.16% 85.32% 86.24% 83.44% 86.50% 84.97% 93.09% 88.48% 90.78% Normal 85.71% 88.89% 65.85% 67.27% Bankruptcy 71.43% 78.57% 48.15% 68.52% 68.29% 67.07% 65.45% 66.36% Table 8 The accuracy for the ANN model with 2nd factor analysis Training data Normal 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average rate rate rate rate 87.04% Testing data Bankruptcy 77.78% 82.41% 86.24% 86.24% 86.24% 87.73% 83.44% 85.58% 86.18% 81.57% 83.87% Normal 78.57% 92.59% 80.49% 78.18% Bankruptcy 71.43% 75.00% 51.85% 72.22% 48.78% 64.63% 52.73% 65.45% 4081 However, the accuracy rate reduces to 65.45%, and the error rate rises to 34.55% when measured over the past 8 seasons. Similar to the above experiment, the closer the financial crisis the higher the accuracy will be. 5.2. DM experiments and results Clustering analysis finds groups, each very different from the other. However, within a group all members are very similar. Unlike classification, the class label of each group is not known. Clustering is a way to naturally segment data into groups, whereas classification is a way to segment data by assigning it into groups. Briefly, a good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity (Chen & Chen, 2006). However, how good a cluster is ultimately depends on the opinion of the user. In our experiment, we used the partitioning methods to cluster the datasets for the financial distress prediction model. The partitioning methods construct a partition of a database of N objects into a set of k clusters. Usually, they start with an initial partition and then use an iterative control strategy to optimize an objective function. The K-means algorithm (Han & Kamber, 2001) is a wellknown and commonly used clustering algorithm. It takes input parameter k and partitions data into k clusters. First, we select k objects to represent the cluster centers. The remaining objects are then assigned to the cluster whose center is closest to the object. Then, it computes the mean value for each cluster as new cluster centers. This process is iterated until the criterion function converges. The same as with the ANN experiment, this process also uses a finance and non-finance ratio, and constructs the financial distress prediction model after a second time factor analysis. We apply the K-means algorithm to investigate the past 2 seasons, the past 4 seasons, the past 6 seasons, and the past 8 seasons before the occurrence of financial distress to ensure prediction accuracy. After the K-means algorithm implementation, we decided to adopt 10–15 clusters to analyze the prediction accuracy. 5.2.1. The experiment with non-factor analysis This experiment obtains a result after using 37 original ratio variables of this research that haven’t yet undergone a factor analysis. As shown in Table 9, the data has an estimate accuracy rate as high as 78.57%, with an error rate of 21.43% for the past 2 seasons. However, the accurate rate reduces to 56.36%, and the error rate rises to 43.64%, when measured over the past 8 seasons. The closer the financial crisis the higher the accuracy will be. 5.2.2. The experiment with 1st factor analysis This experiment obtains a result after using 18 original ratio variables of this research that have undergone a 1st factor analysis. As shown in Table 10, the data has an estimate accuracy rate as high as 75%, with an error rate of 25% for the past 2 seasons. However, the accurate rate reduces to 56.36%, and the error rate rises to 43.64% when measured over the past 8 seasons. Similar to the above experiment, the closer the financial crisis the higher the accuracy will be. 5.2.3. The experiment with 2nd factor analysis This experiment obtains a result after using 13 original ratio variables of this research that have undergone 2nd factor analysis. As shown in Table 11, the testing data has an estimate accuracy rate as high as 75%, with an error rate of 25% for the past 2 seasons. However, the accurate rate reduces to 56.36%, and the error rate rises to 43.64% when measured over the past 8 seasons. Similar to the above experiment, the closer the financial crisis the higher the accuracy will be. 4082 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 Table 9 The accuracy for the clustering model with non-factor analysis 10 Clusters Accuracy Normal 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average Bankruptcy 100% 57.14% 78.57% 74.07% 77.78% 75.93% 51.22% 85.37% 68.29% 47.27% 87.27% 67.27% 11 Clusters Accuracy Normal Bankruptcy 50.00% 85.71% 67.86% 88.89% 55.56% 72.22% 100% 48.78% 64.63% 85.45% 36.36% 60.91% 12 Clusters Accuracy Normal 13 Clusters Accuracy Bankruptcy 50.00% 85.71% 67.86% 88.89% 55.56% 72.22% 100% 51.22% 75.61% 41.82% 78.18% 60.00% Normal 14 Clusters Accuracy Bankruptcy 100% 57.14% 78.57% 62.96% 70.37% 66.67% 100% 51.22% 75.61% 40.00% 78.18% 59.09% Normal 15 Clusters Accuracy Bankruptcy Normal 100% 57.14% 78.57% 88.89% 55.56% 72.22% 97.56% 51.22% 74.39% 50.91% 76.36% 63.64% 85.71% 14 Clusters Accuracy 15 Clusters Accuracy Bankruptcy 57.14% 71.43% 88.89% 55.56% 72.22% 60.98% 73.17% 67.07% 96.36% 16.36% 56.36% Table 10 The accuracy for the clustering model with 1st factor analysis 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average 10 Clusters Accuracy 11 Clusters Accuracy Normal Normal Bankruptcy 100% 35.71% 67.86% 70.37% 51.85% 61.11% 92.68% 29.27% 60.98% 72.73% 40.00% 56.36% 12 Clusters Accuracy Bankruptcy 100.00% 50.00% 75.00% 100.00% 40.74% 70.37% 100% 19.51% 59.76% 100.00% 14.55% 57.27% Normal 13 Clusters Accuracy Bankruptcy Normal Bankruptcy Normal Bankruptcy Normal Bankruptcy 100.00% 50.00% 75.00% 100.00% 37.04% 68.52% 100% 19.51% 59.76% 100.00% 21.82% 60.91% 100% 50.00% 75.00% 100.00% 40.74% 70.37% 53.66% 51.22% 52.44% 100.00% 23.64% 61.82% 100% 50.00% 75.00% 100.00% 40.74% 70.37% 100.00% 34.15% 67.07% 100.00% 21.82% 60.91% 100.00% 50.00% 75.00% 100.00% 37.04% 68.52% 100.00% 34.15% 67.07% 100.00% 21.82% 60.91% 12 Clusters Accuracy 13 Clusters Accuracy 14 Clusters Accuracy 15 Clusters Accuracy Table 11 The accuracy for the clustering model with 2nd factor analysis 10 Clusters Accuracy Normal 2 4 6 8 Accuracy Average Accuracy Average Accuracy Average Accuracy Average 11 Clusters Accuracy Bankruptcy 100.00% 21.43% 60.71% 100.00% 37.04% 68.52% 82.93% 65.85% 74.39% 58.18% 56.36% 57.27% Normal Bankruptcy 100.00% 35.71% 67.86% 100.00% 37.04% 68.52% 82.93% 65.85% 74.39% 83.64% 29.09% 56.36% Normal Bankruptcy 100.00% 50.00% 75.00% 100.00% 37.04% 68.52% 82.93% 65.85% 74.39% 65.45% 60.00% 62.73% 6. The FDP comparing phase After the implementation for the FDP modeling phase, we will compare the BPN and clustering approaches with the accuracy rate, Type II error rate, and factor analysis. The detail descriptions will be discussed as following sections. 6.1. The accuracy rate for BPN and clustering As is evident by the above-mentioned results in Fig. 4, the BPN model presents the prediction performance by non-factor analysis, after the first-time factor analysis, and after the second time factor analysis. The result shows that the accuracy rate has the worst trend from the past 2 seasons to the past 8 seasons prior to the occurrence of the financial crisis. In addition, the BPN model shows that the closer the crisis the higher the accuracy rate becomes. As seen by the above-mentioned results shown in Fig. 5, the clustering model shows the prediction performance by non-factor analysis, after first-time factor analysis, and after the second time factor analysis. As a result, the accuracy rate is also shown the Normal 71.43% Bankruptcy 50.00% 60.71% 100.00% 25.93% 62.96% 82.93% 65.85% 74.39% 65.45% 60.00% 62.73% Normal Bankruptcy 71.43% 50.00% 60.71% 100.00% 51.85% 75.93% 68.29% 68.29% 68.29% 61.82% 61.82% 61.82% Normal Bankruptcy 92.86% 57.14% 75.00% 70.37% 74.07% 72.22% 68.29% 68.29% 68.29% 74.55% 56.36% 65.45% worse and worse trend as BPN model. In addition, the clustering model becomes more accurate the closer the crisis. 6.2. The type II error rate for BPN and clustering As seen by the above-mentioned results shown in Fig. 6, the BPN model presents the Type II error rate by non-factor analysis, after first-time factor analysis, and after the second time factor analysis. It shows that the Type II error rate increases for each factor analysis, while the accuracy rate decreases from the past 2 seasons to the past 8 seasons prior to the financial crisis. In addition, the BPN model becomes more accurate the closer the crisis and the Type II error rate becomes lower. As seen by the above-mentioned results shown in Fig. 7, the clustering model presents the Type II error rate by non-factor analysis, after first-time factor analysis, and after the second time factor analysis. It indicates that the Type II error rate has approximately the same increasing trend as the BPN model, while the accuracy rate decreases similar to the BPN model. The only exception is the Type II error rate which is better in the 2nd factor anal- 4083 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 The Accuracy Rate for the BPN 90.00% 80.00% past 2 seasons 70.00% 60.00% past 4 seasons 50.00% 40.00% past 6 seasons 30.00% 20.00% past 8 seasons 10.00% 0.00% None 1st 2nd past 2 seasons 82.14% 78.57% 75.00% past 4 seasons 77.78% 68.52% 72.22% past 6 seasons 76.83% 67.07% 64.63% past 8 seasons 60.00% 66.36% 65.45% Fig. 4. The accuracy rate for the BPN. The Accuracy Rate for Clustering 80.00% past 2 seasons 60.00% past 4 seasons 40.00% past 6 seasons 20.00% past 8 seasons 0.00% None 1st 2nd past 2 seasons 73.81% 74.31% 66.67% past 4 seasons 71.91% 68.21% 69.36% past 6 seasons 72.56% 61.18% 72.36% past 8 seasons 61.21% 59.70% 61.06% Fig. 5. The accuracy rate for clustering. ysis than in the non-factor analysis over the past 6 seasons. Nevertheless, in summary we get that the closer the crisis point, the lower the Type II error rate in the clustering model. 6.3. The factor analysis for BPN and clustering In this comparison, we average the accuracy rate of BPN and the clustering model for each factor analysis and over 2, 4, 6, and 8 seasons. In Fig. 8, we can see that the accuracy rate (non-factor analysis) with the BPN model is better than with the clustering model, with the exception of the past 8 seasons. In Fig. 9, we can see that the accuracy rates (1st factor analysis) with the BPN model are all better than with the clustering model. In Fig. 10, we can see that the accuracy rate (2nd factor analysis) with the BPN model is better than with the clustering model, with the exception over the past 6 seasons. 7. Conclusions This research aimed at the financial and the non-financial ratios in the financial statement, and used the BPN and the clustering model to compare the performance of the financial distress predictions, in order to find a better early-warning method. This research took 34 companies that were facing a financial crisis, and matched them with 34 normal companies of the similar industry. In addition, we adopted the necessary dataset from the TSEC database and sampled them into the past 2, 4, 6, 8 seasons prior to the financial crisis occurrence. This data was then used to carry out a statistical factor analysis, with each ratio variable being generated going into BPN and clustering methods in order to make a comparison. After the experiments, we summarized four critical contributions. First, the more time we used factor analysis, the less accurate 4084 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 The Type 2 Error Rate for the BPN 60.00% past 2 seasons 50.00% 40.00% past 4 seasons 30.00% past 6 seasons 20.00% 10.00% 0.00% past 8 seasons None 1st 2nd past 2 seasons 28.57% 28.57% 28.57% past 4 seasons 44.44% 51.85% 48.15% past 6 seasons 34.15% 31.71% 51.22% past 8 seasons 54.55% 34.55% 47.27% Fig. 6. The type 2 error rate for the BPN. The Type 2 Error Rate for Clustering 90.00% 80.00% past 2 seasons 70.00% 60.00% past 4 seasons 50.00% 40.00% past 6 seasons 30.00% 20.00% past 8 seasons 10.00% 0.00% None 1st 2nd past 2 seasons 33.34% 52.38% 55.95% past 4 seasons 38.27% 58.64% 56.17% past 6 seasons 39.84% 74.71% 33.34% past 8 seasons 37.88% 76.36% 46.06% Fig. 7. The type 2 error rate for clustering. the results for the BPN and clustering approaches. In our experiments, we found that when we applied all of the 37 variables with non-factor analysis into the BPN and clustering models, we could obtain a better prediction performance except for the past 8 seasons in the BPN model and for the past 2 seasons in the clustering model. Second, the closer we get to the time of the actual financial distress, the more accurate the prediction will be. For example, the accuracy rate with the non-factor analysis for 2 seasons before the financial distress occurs is 82.14% in BPN, while it is only 60% over 8 seasons. The results are similar for the clustering model, where the accuracy rate with non-factor analysis for 2 and 8 seasons before the occurrence of financial distress are 73.81% and 61.21%, respectively. Third, most investors are concerned with the Type II error rate and avoid investing in these companies. Our empirical results show that factor analysis increases the error forecasts of classifying companies with a potential financial crisis as a normal company. Moreover, we also found that the average rate of the Type II error in the clustering model is higher than in the BPN model. Therefore, the prediction performance for the clustering approach is more aggressively influenced than the BPN model. Finally, the BPN approach obtains a better prediction accuracy than the DM clustering approach in developing a financial distress prediction model, with the exception that the accuracy rate (nonfactor analysis) for the past 8 seasons model and the accuracy rate (2nd factor analysis) for the past 6 seasons is lower with the BPN model. 4085 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 BPN vs. Clustering Model Accuracy Rate 90.00% 80.00% 70.00% 60.00% 50.00% BPN 40.00% Clustering 30.00% 20.00% 10.00% 0.00% 2 Seasons 4 Seasons 6 Seasons 8 Seasons Before the Occurrence of Financial Distress Fig. 8. The accuracy rate with non-factor analysis for the BPN and clustering comparison. BPN vs. Clustering Model Accuracy rate 90.00% 80.00% 70.00% 60.00% 50.00% BPN 40.00% Clustering 30.00% 20.00% 10.00% 0.00% 2 Seasons 4 Seasons 6 Seasons 8 Seasons Before the Occurrence of Financial Distress Fig. 9. The accuracy rate with 1st analysis for the BPN and clustering comparison. BPN vs. Clustering Model Accuracy rate 80.00% 70.00% 60.00% 50.00% BPN 40.00% Clustering 30.00% 20.00% 10.00% 0.00% 2 Seasons 4 Seasons 6 Seasons 8 Seasons Before the Occurrence of Financial Distress Fig. 10. The accuracy rate with 2nd analysis for the BPN and clustering comparison. In future research, additional artificial intelligence techniques, such as other neural network models, classification mining, genetic algorithms, and others, could also be applied. And certainly, researchers could expand the system so as to deal with more financial datasets. Acknowledgements We also gratefully acknowledge the Editor and anonymous reviewers for their valuable comments and constructive suggestions. 4086 W.-S. Chen, Y.-K. Du / Expert Systems with Applications 36 (2009) 4075–4086 References Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(3), 589–609. Altman, E. L., Edward, I., Haldeman, R., & Narayanan, P. (1977). A new model to identify bankruptcy risk of corporations. Journal of Banking and Finance, 1, 29–54. Beaver, W. (1966). Financial ratios as predictors of failure, empirical research in accounting: Selected studied. Journal of Accounting Research, 71–111. Blum, M. (1974). Failing company discriminant analysis. Journal of Accounting Research, 1–25. Chen, T. (2007). Incorporating fuzzy c-means and a back-propagation network ensemble to job completion time prediction in a semiconductor fabrication factory. Fuzzy Sets and Systems, 158(19), 2153–2168. Chen, A. P., & Chen, C. C. (2006). A new efficient approach for data clustering in electronic library using ant colony clustering algorithm. The Electronic Library, 24(4), 548–559. Chun, S. H., & Park, Y. J. (2006). A new hybrid data mining technique using a regression case based reasoning: Application to financial forecasting. Expert Systems with Applications, 31(2), 329–336. Deng, W. J., Chen, W. C., & Pei, W. (2007). Back-propagation neural network based importance–performance analysis for determining critical service attributes. Expert Systems with Applications. doi: 10.1016/j.eswa.2006. 12.016. Dimitras, A. I., Zanakis, S. H., & Zopounidis, C. (1996). A survey of business failure with an emphasis on prediction methods and industrial applications. European Journal of Operational Research, 90(3), 487–513. Fanning, K., & Cogger, K. (1998). Neural network detection of management fraud using published financial data. International Journal of Intelligent Systems in Accounting, Finance and Management, 7(1), 21–24. Feroz, E., Park, K., & Pastena, V. (1991). The financial and market effects of the SECs accounting and auditing enforcement releases. Journal of Accounting Research, 29(Suppl.), 107–142. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann. Huang, M. J., Chen, M. Y., & Lee, S. C. (2007). Integrating data mining with casebased reasoning for chronic diseases prognosis and diagnosis. Expert Systems with Applications, 32(3), 856–867. Huang, Y. P., Hsu, C. C., & Wang, S. H. (2007). Pattern recognition in time series database: A case study on financial database. Expert Systems with Applications, 33(1), 199–205. Hua, Z., Wang, Y., Xu, X., Zhang, B., & Liang, L. (2007). Predicting corporate financial distress based on integration of support vector machine and logistic regression. Expert Systems with Applications, 33(2), 434–440. John, S. G., & Robert, W. I. (2001). Tests of the generalizability of altman’s bankruptcy prediction model. Journal of Business Research, 54, 53–61. Jost, A. (1993). Neural networks: A logical progression in credit and marketing decision system. Credit World, 81(4), 26–33. Kinney, W., & McDaniel, L. (1989). Characteristics of firms correcting previously reported quarterly earnings. Journal of Accounting and Economics, 11(1), 71–93. Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications, 32(4), 995–1003. Ko, P. C., & Lin, P. C. (2006). An evolution-based approach with modularized evaluations to forecast financial distress. Knowledge-Based Systems, 19(1), 84–91. Laitinen, E. K., & Laitinen, T. (2000). Bankruptcy prediction application of the Taylor’s expansion in logistic regression. International Review of Financial Analysis, 9, 327–349. Lee, T. L. (2004). Back-propagation neural network for long-term tidal predictions. Ocean Engineering, 31(2), 225–238. Loebbecke, J., Eining, M., & Willingham, J. (1989). Auditor’s experience with material irregularities: Frequency, nature and detectability. Auditing: A Journal of Practice and Theory, 9, 1–28. Medsker, L., & Liebowitz, J. (1994). Design and development of expert systems and neural networks. New York: Macmillan. Meyer, P. A., & Pifer, H. (1970). Prediction of bank failures. The Journal of Finance, 25, 853–868. Mitra, S., Pal, S. K., & Mitra, P. (2002). Data mining in soft computing framework: A survey. IEEE Transactions Neural Networks, 13(1), 3–14. Panda, S. S., Chakraborty, D., & Pal, S. K. (2007). Flank wear prediction in drilling using back-propagation neural network and radial basis function network. Applied Soft Computing. doi:10.1016/j.asoc.2007.07.003. Persons, O. (1995). Using financial statement data to identify factors associated with fraudulent financial reporting. Journal of Applied Business Research, 11(3), 38–46. Ronald, J. W., Rumelhart, D. E., & Hinton, G. E. (1986). Learning internal representations by error propagation. In E. David Rumelhart & J. A. McClelland (Eds.). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge: MIT Press/Bradford Books. Spathis, C. (2002). Detecting false financial statements using published data: Some evidence from Greece. Managerial Auditing Journal, 17(4), 179–191. Spathis, C., Doumpos, M., & Zopounidis, C. (2002). Detecting falsified financial statements: A comparative study using multicriteria analysis and multivariate statistical techniques. The European Accounting Review, 11(3), 509–535. Stice, J. (1991). Using financial and market information to identify pre-engagement market factors associated with lawsuits against auditors. The Accounting Review, 66(3), 516–533. Werbos, P. (1974), Beyond regression: New tools for prediction and analysis in the behavioral science, Ph.D. Thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA. Wu, D., Yang, Z., & Liang, L. (2006). Using DEA-neural network approach to evaluate branch efficiency of a large Canadian bank. Expert Systems with Applications, 31, 108–115. Yılmaz, B., & Cunedioglu, U. (2007). Source localization of focal ventricular arrhythmias using linear estimation, correlation, and back-propagation networks. Computers in Biology and Medicine, 37(10), 1437–1445.