Download Artificial Intelligence Approach to Credit Risk

Charles University in Prague Faculty of Social Sciences Institute of Economic Studies MASTER THESIS Artificial Intelligence Approach to Credit Risk Author: Bc. Jan Řı́ha Supervisor: PhDr. Jozef Barunı́k Ph.D. Academic Year: 2015/2016 Declaration of Authorship The author hereby declares that he compiled this thesis independently, using only the listed resources and literature. The author grants to Charles University permission to reproduce and to distribute copies of this thesis document in whole or in part. Prague, January 3, 2016 Signature Acknowledgments I would like to thank to PhDr. Jozef Barunı́k Ph.D. for his comments and valuable feedback, my family and my partner for her continuous moral support. Abstract This thesis focuses on application of artificial intelligence techniques in credit risk management. Moreover, these modern tools are compared with the current industry standard - Logistic Regression. We introduce the theory underlying Neural Networks, Support Vector Machines, Random Forests and Logistic Regression. In addition, we present methodology for statistical and business evaluation and comparison of the aforementioned models. We find that models based on Neural Networks approach (specifically Multi-Layer Perceptron and Radial Basis Function Network) are outperforming the Logistic Regression in the standard statistical metrics and in the business metrics as well. The performance of the Random Forest and Support Vector Machines is not satisfactory and these models do not prove to be superior to Logistic Regression in our application. JEL Classification Keywords Author’s e-mail Supervisor’s e-mail G23, C15, C45, C53, C58 Credit Risk, Scoring, Neural Networks, Support Vector Machines, Random Forests [email protected] [email protected] Abstrakt Tato práce se zabývá aplikaci umělé inteligence v řı́zenı́ kreditnı́ho rizika. Tento modernı́ přı́stup je porovnán s aktuálnı́m standardem trhu, s logistickou regresı́. V práci prezentujeme teorii zaměřenou na neuronové sı́tě, podpůrné vektorové stroje, náhodné lesy a logistickou regresi. Také se zabýváme metodologiı́ na vyhodnocenı́ a porovnávánı́ těchto modelů ze statistického a obchodnı́ho hlediska. Zjistili jsme, že modely z kategorie neuronových sı́tı́, zejména Multi-Layer Perceptron a Radial Basis Function Network, překonávajı́ logistickou regresi ve standardnı́ch statistických a obchodnı́ch kritériı́ch. Výkonnost náhodných lesů a podpůrných vektorových strojů nenı́ dostatečná a v našı́ práci jejich výkonnost nedosahovala výkonnosti logistické regrese. Klasifikace JEL Klı́čová slova G23, C15, C45, C53, C58 Kreditnı́ Riziko, Scoring, Neuronové Sı́tě, Podpůrné Vektorové Stroje, Náhodné Lesy, Logistická Regrese E-mail autora [email protected] E-mail vedoucı́ho práce [email protected] Table of Contents List of Tables vii List of Figures viii Acronyms x Thesis Proposal xi 1 Introduction 1 2 Motivation and Literature Review 2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 Methodology 3.1 Linear and Non-linear Classification . . . . . . . . . . 3.2 Logistic Regression . . . . . . . . . . . . . . . . . . . 3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . 3.3.1 Feedforward Networks . . . . . . . . . . . . . 3.3.2 Neural Networks with Radial Basis Functions 3.3.3 Jump Connections . . . . . . . . . . . . . . . 3.3.4 Multilayered Feedforward Networks . . . . . . 3.4 Support Vector Machines . . . . . . . . . . . . . . . . 3.4.1 Computation of the Support Vector Classifier 3.4.2 Support Vector Machines and Kernels . . . . . 3.4.3 Support Vector Machines and Regressions . . 3.5 Random Forests and Trees . . . . . . . . . . . . . . . 3.5.1 Classification and Regression Trees . . . . . . 3.5.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 8 11 12 15 17 19 20 23 24 29 31 31 34 Contents 4 Model Development 4.1 Variable Selection . . . . . . . . . . . . . . . 4.2 Model Selection . . . . . . . . . . . . . . . . 4.3 Performance assessment . . . . . . . . . . . 4.4 Hypotheses . . . . . . . . . . . . . . . . . . 4.5 Data Descripption and Exploratory Analysis 4.6 Model Building . . . . . . . . . . . . . . . . 4.7 Evaluation of Ordering Ability . . . . . . . . 4.8 Evaluation of Profit-Making Potential . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 40 40 45 47 49 56 63 5 Conclusion 66 Bibliography 72 A Underwriting Process in Finance I List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 Observations . . . . . . . . . . . . . . . . . . . . . . . . . Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . Categorical Variables . . . . . . . . . . . . . . . . . . . . Categorical Variables: Detailed Overview . . . . . . . . . Independent Variables . . . . . . . . . . . . . . . . . . . Parameter Estimation . . . . . . . . . . . . . . . . . . . Comparison of Final Specifications . . . . . . . . . . . . Portfolio Risk on the 70th Percentile: Model Comparison K-S & AUC Estimation Results . . . . . . . . . . . . . . Cost Matrix . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Simulation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 48 49 54 55 55 57 58 63 64 List of Figures 3.1 3.2 3.3 3.4 3.5 6 10 12 13 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Linear and Non-linear Separable Cases . . . . . . . . . . . . . . Logit and Probit Distribution Function . . . . . . . . . . . . . . Architecture of Basic Feedforward Network . . . . . . . . . . . . Step, Tansigmoid, Gaussian and Logsigmoid Functions . . . . . Separation using Neural Network with Different Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feedforward Network with Jump Connections . . . . . . . . . . Architecture of Multilayered Feedforward Network . . . . . . . . Separable and Non-separable Cases . . . . . . . . . . . . . . . . (Non)separability Demonstration . . . . . . . . . . . . . . . . . Separation using SVM with Different Kernel Functions . . . . . -sensitive Error Function . . . . . . . . . . . . . . . . . . . . . CART Plot and Explanation of Logic Using Simple Data . . . . Comparison of Random Forests . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 K-S Statistics . . . . . . . . . . . . . . . . . . ROC, AUC . . . . . . . . . . . . . . . . . . . Cummulative Lift . . . . . . . . . . . . . . . . Out-of-time Sample . . . . . . . . . . . . . . . Non-Categorical Variables: Detailed Overview Employment Length . . . . . . . . . . . . . . Goods Price . . . . . . . . . . . . . . . . . . . Bad & Good: Visualisation . . . . . . . . . . . Categorical Variables: Visualisation . . . . . . AUC coefficient comparison . . . . . . . . . . Risk Dynamics . . . . . . . . . . . . . . . . . ROC . . . . . . . . . . . . . . . . . . . . . . . Difference between CDF of Goods and Bads . 42 43 45 47 50 50 50 51 52 56 57 58 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 18 19 21 25 28 30 32 36 List of Figures 4.14 4.15 4.16 4.17 4.18 Cumulative Lift . . . . . . . . . . . . . Goods & Bads Distribution - part 1 . . Goods & Bads Distribution - part 2 . . Profit Dynamics . . . . . . . . . . . . . Monte Carlo Simmulation Visualisation ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 62 64 65 Acronyms AUC Area Under ROC Curve CART Classification And Regression Trees CDF Cumulative Distribution Function DPD Days Past Due FLM Full Logistic Model LLR Land Line Refused MCS Monte Carlo Simulation MLP Multi-Layer Perceptron MRMR Minimum Redundancy Maximum Relevancy NN Neural Networks OR Odds Ratio OA Ordering Ablility RBF Radial Basis Function RBFN Radial Basis Function Network ROC Receiver Operating Characteristic RF Random Forests SVM Support Vector Machines Master Thesis Proposal Author Supervisor Proposed topic Bc. Jan Řı́ha PhDr. Jozef Barunı́k Ph.D. Artificial Intelligence Approach to Credit Risk Motivation Neural networks together with other artificial intelligence approaches have been successfully used in a variety of business fields including accounting, marketing, or production management (McNelis, 2005). Majority of the studies have used these tools for forecasting stock returns, bankruptcy, exchange rate, or credit card fraud (Gouvêa et al., 2007). The granting of loans by finance companies is one of the key fields concerning decision problems that require precise treatment. Models based on artificial intelligence are believed to have potential for rendering effective credit assessments to support approval process in finance companies. Many researchers are currently concentrating on employing neural network classification models in order to divide loan applications into good and bad ones (Angelini et al., 2008; Ghodselahi et al., 2011). Generally, loan officers in Central and Eastern Europe and in Asia rely on traditional methods as logistic regression or decision trees to guide them in assessing the creditworthiness of a client (Steiner et al., 2006). The complexity of various decision tools and differences between applications is challenge for a neural-computing approach which is to provide learning power that is not offered by other techniques. Artificial intelligence tools, with their potential of capturing complex and nonlinear relationships, are promising alternative to the common classification and forecasting methods (McNelis, 2005; Khashman, 2010). Master Thesis Proposal xii Hypothesis (i) Hypothesis #1: The models based on artificial intelligence outperform the conventional decision making techniques as ordinal risk measure. This hypothesis will be tested by ordering the loans by the score assigned by different discrimination techniques. Then the development of correlation of risk and score will be assessed and compared for each model. (ii) Hypothesis #2: The models based on artificial intelligence outperform the conventional decision making techniques in terms of potential profit. This hypothesis will be tested using cost matrix and Monte-Carlo simulation. (iii) Hypothesis #3: The models based on artificial intelligence provide more time-stable performance in risk decision-making in comparison with conventional techniques. We will compare all of our models on out-of-time dataset. Methodology The first step is collection of suitable data, revision of inconsistencies or missing values and transformation to proper format. We have decided to use retail portfolio from one country in Asia. Besides, we have complex data about the applicants at hand. Then we will provide a theory underlying selected models based on artificial intelligence - variations of Neural Networks, Support Vector Machines or Random Forests. The presented models will be learned using the datasets and compared with the etalon in the risk industry - logistic regression. Afterwards the profit-making potential of each approach will be assessed using MonteCarlo simulations. At the end of the thesis, we summarize the discrimination and risk performance of all employed models and propose the most appropriate approach. Expected Contribution We expect that we will be able to prove the superior discrimination ability of the methods based on artificial intelligence. There were some researches focused on similar topic (Hull et al., 2009; Huang et al., 2004; Yu et al., 2008), nevertheless they were usually using data form developed markets. In our thesis are used data from less developed countries, where is put more emphasis on superior predictive power of models, because Master Thesis Proposal xiii the collection effectiveness in these countries is considerably lower (Eletter et al., 2010). Also the structure of defaults in these countries is different; we can observe much higher fraud and defaults in general. Thus we anticipate reaching higher predictive power of artificial intelligence techniques in comparison with more traditional approach, especially in these demanding conditions. Outline 1. Motivation: The risk costs are the most significant costs in consumer finance companies, especially in less developed countries where the local legal framework and creditworthiness of people is substantially lower than in western countries. The artificial intelligence models have the potential to enhance the current underwriting processes in credit granting and thus lowering the risk costs and increasing the profitability. 2. Theory underlying the selected models based on artificial intelligence: We will introduce the theory and its development 3. Data: We will describe the data and perform selected analyses of the data (e.g. distribution of some variables or explanation of some unusual relationships that are caused by local specifics). 4. Methods: We will use WEKA, Matlab and SQL Developer for data processing, models learning and for subsequent models assessments. 5. Results: We will compare the discrimination power of the models and their profit-making potential. The most suitable type of model will be proposed. 6. Concluding remarks: We will summarize the whole work and propose area for further research in this topic. Core bibliography 1. Angelini, E., di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk evaluation. The Quarterly Review of Economics and Finance, 733-755. 2. Ashenfelter, O., Harmon, C., Oosterbeek, H., (1999). A review of estimates of the schooling/earnings relationship, with tests for publication bias. Labour Econ. 6 (4), 453-470. 3. Eletter, S. F., & Yaseen, S. G. (2010). Applying Neural Networks for Loan Decisions in the Jordanian Commercial Banking System. 209-214. Master Thesis Proposal xiv 4. Ghodselahi, A., & Amirmadhi, A. (2011). Application of Artificial Intelligence Techniques for Credit Risk Evaluation. International Journal of Modeling and Optimization, 243-249. 5. Gouvea, M. A., & Gonçalves, E. B. (2007). Credit Risk Analysis Applying Logistic Regression, Neural Networks and Genetic Algorithms Models. 6. Hall, M. J. B., Muljawan, D., Suprayogi, & Moorena, L. (2009). Using the artificial neural network to assess bank credit risk: a case study of Indonesia. Applied Financial Economics, 1825-1846. 7. Huang, Z., Chen, X., Hsu, C. J. , Chen, W. H., and Wu, S., (2004) Credit Rating Analysis with Support Vector Machines and Neural Networks: A Market Comparative Study, Decision Support System, 543-558. 8. Khashman, A. (2010). Neural networks for credit risk evaluation: Investigation of different neural models and learning schemes. Expert Syst. Appl., 6233-6239. 9. McNelis, P. D. 2005, Neural Networks in Finance: Gaining Predictive Edge in the Market. Academic Press Advanced Finance 10. Steiner, M. T. A., Neto, P. J. S., Soma, N. Y., Shimizu, T., & Nievola, J. C. (2006). Using Neural Network Rule Extraction for Credit-Risk Evaluation. International Journal of Computer Science and Network Security, 6-16. 11. Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications, 1434-1444. Author Supervisor Chapter 1 Introduction The financial industry has experienced substantial growth in the past decades and it has become vitally important to implement advanced statistical and mathematical methods to assess the possible risk and exposures resulting from various investment activities. These methods provide fast and automatic tools that help in making effective decisions. In this thesis, we focus on the field of financial companies providing consumer loans and introduce several tools used for management of the consequent risk exposures. The main and the most advanced part of this process involves scoring of the loan applicants. The ultimate task of the credit scoring is rather unsophisticated: development of a model which is capable of distinguishing between a bad and a good debtor or, in other words, estimating the probability of being bad or good. We can reformulate for the sake of clarity: utilize the past data and use it for assessing the relationships generally applicable for the future discrimination between these two classes, and with lowest number of false positives and false negatives classifications. Nevertheless, there are many ways to perform this task and some are more powerful and robust than the others. Two scientific fields are encountering here: machine learning and statistics. While the scope and methods of these fields are generally different, drawing a clear differentiating line between them in the context of the classification is nearly impossible. For example, the logistic regression is often used for classification and is considered to be a statistical tool. On the other hand, the neural networks, and others, are usually referred to as tools from the field of machine learning. Statistics emphasizes description of the relationships among 1. Introduction 2 the variables in a model and quantifying the contribution of every single one of them, while machine learning focuses on the ability to make sufficiently correct predictions without any substantial statistical or econometric assumptions. In this thesis, we will introduce several representatives from both groups, outline the relevant theory and compare them based on their ability to mitigate the risk and generate profit. The main areas covered in this thesis will be Neural Networks, Support Vector Machines, Random Forests and Logistic Regression serving as the current market standard. Chapter 2 Motivation and Literature Review 2.1 Literature Review This thesis tries to provide a brief synthesis of theory focused on Logistic Regression, Neural Networks and their specifications, Support Vector Machines and Random Forests. The general introduction to the classification and related theory can be found in Khashman (2010), Elizondo (2006) or in great book focused on statistical learning Hastie et al. (2013). We have used the standard theory of logistic regression introduced in Agresti (2013) or Hosmer et al. (2013) and its connection to the probability of default modeling Hastie et al. (2013). There is rich theory underlying Neural Networks, the basic architecture of these models is well explained in Angelini et al. (2008) or Ghodselahi & Amirmadhi (2011). The more advanced and complex structures are described in Haykin (2009), Khashman (2010), Witten & Frank (2005) or Yu et al. (2008). Other useful book focused on Neural Network was written by McNelis (2005) which can be supplemented again by the book written by Hastie et al. (2013). On the other hand, there are not many resources describing the Support Vector Machines in detail. This is probably caused by the fact that this area is relatively new and that the major emphasis has been placed on Neural Networks. Nevertheless, the Support Vector Machines are becoming much more 2. Motivation and Literature Review 4 popular. The first paper describing this topic was published by Cortes & Vapnik (1995). Many researchers continued in this field, we recommend Bellotti & Crook (2009), Cristianini & Shawe-Taylor (2000) and Elizondo (2006) for the introduction to this topic. The mathematical foundations are well described in Schölkopf & Smola (2002), Cortes & Vapnik (1995) or Souza (2010). We have been looking into deeper detail in the mathematics of Support Vector Machines because it is important to understand the logic and because the other papers were less descriptive in this topic. The ability to estimate the probability is described in Hastie et al. (2013). However, the last type of model is rather simple and not much theory is needed. The foundations of Classification and Regression Trees were laid by Leo Braiman in his seminal paper in 1984 Breiman et al. (1984). The extension of this theory, Random Forests, was published by the same author, Breiman (2001). The Random Forest and their properties were described and researched by many authors, we can recommend Amaratunga et al. (2008), Biau et al. (2008) and Buja & Stuetzle (2006)). We have briefly covered theory focused on model development and its assessment. Rather simple variable selection is described in Derksen & Keselman (1992). More sophisticated variable selection employing relevancy and redundancy of the variables was published in Peng et al. (2005) and Ding & Peng (2005). Description of the tools used for performance assessment of the developed modes can be found in Lobo et al. (2008), Rodrı́guez et al. (2010), Siddiqi (2012) and Zweig & Campbell (1993) Many practical aspects of scorecards development process were covered by the following books Siddiqi (2012), Mays (2001) or Witten & Frank (2005). Chapter 3 Methodology 3.1 Linear and Non-linear Classification An algorithm that performs classification is known as a classifier. Based on the method employed, the algorithm is replaced by some mathematical function, which parameters need to be estimated using training dataset Khashman (2010). The main task of the supervised machine learning is deriving function which correctly describes the data in training set, and moreover is capable of generalizing from the training dataset to the unobserved data. As the prediction is the main application of the classifiers, the ability to generalize is the most important property that distinguishes a good classification from the bad one. Important classification subclass is probabilistic classification. Algorithms for this subclass employ statistical inference to determine the best class for given input data. The usual classifiers just select a best class; nevertheless the probabilistic approach to this problem generates for the input data probability of being member of some class. Then the best class is usually selected as the class with highest probability. The classification function generates a border between two classes. This border is called decision boundary, or decision space in case of higher dimensions. Based on the shape of the decision boundary, we distinguish two main classes of classifiers: linear and non-linear classifiers. Thus, if some data can be separated by linear decision boundary, then the data are linearly separable. The more formal definition follows Elizondo (2006). 3. Methodology 6 Two subsets A, B ∈ Rn are linearly separable if there can be built a hyperplane of Rn , such that elements of A and elements of B are on the opposite side of the hyperplane. It is convenient to formalize the classification problem Elizondo (2006): • Let the dot product space Rn be the data universe • Let S be a sample set, S ⊂ Rn • Let f : Rn → {−1, +1} be the target mapping function • Let D = {(x, y); x ∈ A ∧ y = f (x)} be the training set Then the estimated classifier is a function fˆ : Rn → {−1, +1} using D such that fˆ(x) ≈ f (x) for all x ∈ Rn . Figure 3.1: Linear and Non-linear Separable Cases Source: own processing Depending on n we construct a line, plane or hyperplane that separates the two classes as best as possible. Using vectors, we can express this decision boundary in the form as w·x+b=0 (3.1) The advantage of the vector form (3.1) is the fact, that it is valid for any number of dimensions, thus it can be line, plane or n-dimensional hyperplane. In 3. Methodology 7 the machine learning literature the b stands for bias and w for weight vector. The literature related to statistics refers both as parameters. Having the weight vector and bias from (3.1), the classification of some observation is simple: if the observation belongs to the +1 class, then the value of (3.1) will be positive and thus above the decision boundary. On the contrary, should the observation belong to the −1 class, the point will lie under the decision boundary and the value of (3.1) will be negative. The decision function f might be constructed in the following way Hastie et al. (2013). fˆ(x) = sgn(w · x + b) (3.2) where sgn() represents sign function, which results +1 for positive parameters, −1 for negative parameters and 0 for parameters with value of zero. This problem can be called binary classification problem, because the output domain Y = {−1, +1} consists of only two elements. There exists generalization of this approach; if the output domain consist of m classes, i.e. Y = {1, 2, 3, . . . , m}, then we speak about m-class classification. And in the continuous case, Y ⊆ R, we speak about regression. The obvious problem lies in the determination of the optimal bias b and weight vector w. Generally, there is major approach to this task; the discriminative learning. The discriminative learning does not have the ambition to infer the underlying probability distributions of all relevant variables and attempts only to find the most suitable mapping from the inputs to the outputs. This basically means that the model tries to learn the conditional probability distribution . p(y .. x) directly. Within this class should be mentioned, among others, Neural networks, Logistic regression or Support vector machines. This thesis will focus on the aforementioned models. The difference in performance of various linear classifiers is caused by different assumptions regarding the underlying distributions or by the various methods used for the estimation of the weight vectors and biases. If we get back to the geometrical interpretation of the linearity and input data, we should realize 3. Methodology 8 that the decision boundary generated by different linear classifiers should be relatively similar (if both established on reasonable assumptions). Nevertheless the reason why some linear classifiers outperform the other, in general depends on three points: i) The ability of handling the linearly inseparable data. ii) How well the classifier copes with the random noise in the data and with outliers. iii) Whether, and how well, can the classifier utilize the non-linear relationships in the dataset. In the following chapters will be shown that some linear classifiers lack ability of dealing with some of these points and how some classifiers handle these issues. 3.2 Logistic Regression The logistic regression belongs to the class of linear classifiers although the name would indicate otherwise. Due to its stable performance and widespread application in finance and other areas, it serves as an industry standard. The other reasons for its success are its easy implementation and interpretation of the business logic behind the values of each variable. In the following section, which is mainly based on Hosmer et al. (2013) and Agresti (2013), we introduce the mathematics underlying the logistic model. For given vector x = (x0 , x1 , . . . , xp )0 of input data we consider some dependent variable Yx with binary distribution ( Yx = 1 for default and Yx = 0 otherwise). The expected value of the dependent variable Yx can be written as E(Yx ) = 1 ∗ P (Yx = 1) + 0 ∗ P (Yx = 0) = 1 ∗ P (Yx = 1) = π(x) (3.3) where the term π(x) = P (Yx = 1) represents the conditional probability of default for a given input vector of variables x. The primary objective is defining an appropriate model to capture the dependence of the probability of default on the vector of input variables. The ordinary linear regression model might 3. Methodology 9 be easy to use, but is not suitable for systems where the binary dependent variables occur. This is mainly because the probability π(x) ranges between zero and one whereas the values generated by linear regression can by any realvalued numbers. Therefore, it is necessary to define so called odds-function as the ratio of the probability of default and the probability of survival (i.e. non-default). odds(x) = π(x) P (Yx = 1) = P (Yx = 0) 1 − π(x) (3.4) The domain of this function are mapped into interval h0, ∞). Nevertheless we need mapping into interval h0, 1i Witzany (2010), thus we need to first use logarithmic transformation logit(x) = ln odds(x) = ln π(x) 1 − π(x) (3.5) and then set the logit(x) = β 0 x in order get an explicit formula for our logistic regression which ranges within the desired interval. The final form is defined as Hosmer et al. (2013). 0 eβ x π(x) = (3.6) 0 1 + eβ x Instead of logit transformation (3.5) can be used any other transformation from h0, 1i to R. Other possible transformation is an application of the distribution function Φ of standard normal distribution, known as probit Hastie et al. (2013): probit(x) = Φ−1 π(x) (3.7) 3. Methodology 10 Figure 3.2: Logit and Probit Distribution Function Source: own processing Nevertheless the logit transformation is preferred for its simplicity because unlike in the case of probit, the logit transformation has closed-form function and is easy to compute. In addition, the parameters in logistic regression have a straightforward interpretation - for one unit change in characteristic xi , the impact on the odds is calculated by simple multiplication by parameter eβi . Suppose that the input data contain s categorical variables for each loan where the i-th variable is composed of pi categories. We thus define set of all pairs (i, j) of the variables i with corresponding categories j as Agresti (2013). Z = (i, j) : i ∈ {1, . . . , p}, j ∈ {1, . . . , pi } (3.8) For each loan k is defined vector of dummy variables (where (xij )k = 1 if loan k lies in the corresponding category j of the variable i. xk = (xij )k : (i, j) ∈ Z (3.9) Then we define a set B consisting of all bad (defaulted) loan and the set G, where are the good (non-defaulted) loans only Agresti (2013). n o Bji = k : k ∈ B, (xij )k = 1 n o Gij = k : k ∈ G, (xij )k = 1 (3.10) (3.11) 3. Methodology 11 Now the total odds ratio can be defined as the number of bad clients against the number of goods clients. |B| |G| odds = (3.12) This ratio can be defined for particular categories j and variables i as well: |Bji | |Gij | oddsij = (3.13) Finally, we introduce the odds ratio (ORji ) as oddsij = odds Based on Agresti (2013) we can calculate the odds(x) as ORij |B| Y odds(x) = |G| (i,j)∈Z oddsij odds (3.14) ! Y = odds ORij xij (3.15) (i,j)∈Z The Full logistic model can be introduced. In this model, we put particular weight to each category of the each dummy variable. Thus, the form of the scoring function is following S F LM (x, λ) = odds Y ORij λij xij , (3.16) (i,j)∈Z where the set λ = (λij : (i, j) ∈ Z) is the set of parameters. 3.3 Neural Networks Similarly to the linear approximation methods, neural network transforms set of input variables into a set of output variables. The main differences in neural networks with comparison with other approximation methods are the hidden layers, where the input variables are at least once transformed by multiple activation functions Angelini et al. (2008). These hidden layers may appear to be hardly comprehensible; nevertheless they constitute a very powerful approach to malleable modeling of nonlinear relationships. 3. Methodology 3.3.1 12 Feedforward Networks The architecture of feedforward neural networks is illustrated in the next figure. This simple network consists of an input layer containing n input neurons {xi }, i = 1, 2, 3, . . . , n one hidden layer with m neurons and one output neuron Ghodselahi & Amirmadhi (2011). The parallel processing of information in this system is a crucial advantage over the typical linear systems, where we usually observe only sequential processing. However, neural networks with simple feedforward architecture can efficiently approximate any basic model. For example, the linear model could be synthesized by feedforward network with one neuron in the hidden layer with linear activation function. Figure 3.3: Architecture of Basic Feedforward Network Hidden Layer Input Layer Inputs X1 X2 Output Layer X3 Output Xn Connection Weights Source: own processing inspired by Hastie et al. (2013) Neural networks process the input data in two ways. Firstly are the data transformed into linear combinations with different weights and then are these linear combinations processed by activation functions. The activation function can be any function, but usually are used linear, step, Gaussian, tansigmoid or 3. Methodology 13 logsigmoid functions Hastie et al. (2013). The following figure illustrates these functions. The linear combinations of data from input layers are transformed by these functions before they are transmitted to the next layers or to the output neuron. Figure 3.4: Step, Tansigmoid, Gaussian and Logsigmoid Functions Source: own processing The attractiveness of the logsigmoid (or also called logistic) activation function comes from its behavior, which reasonably well describes majority of types of responses to development in underlying variables. For example if the probability of default of some government on its sovereign debt is very high or very low, then small changes in this probability will have little impact on the decision whether to buy this debt or not. Nonetheless, between the ranges of these two exceptional situations, even relatively minor changes in the riskiness of this debt could significantly influence the overall buy-sell opinion of the market. The feedforward network with one hidden layer is described by these equations Hastie et al. (2013): i X nk,t = ωk,0 + ωk,i xi,t (3.17) i=1 Nk,t = L(nk,t ) = 1 1 + e−nk,t (3.18) 3. Methodology 14 yt = γ0 + k X γk Nk,t (3.19) k=1 The term L(nk ) represents the logsigmoid activation function, and index i represents the number of input variables {x} and index k the number of neurons. The variable nk is formed by a linear combination of these inputs based on weights ωk,i together with the constant term ωk,0 . This variable is then here transformed by logsigmoid activation function into neuron Nk,t at observation t. Then the set of k neurons at observation t are used for creation of a linear combination with the coefficient vector {γk } and with the constant term γ0 in order to forecast yˆt at observation t. This example of feedforward neural network where the logsigmoid activation function is applied is known as the multi-layer perceptron network Haykin (2009). It is the fundamental architecture of neural network and is often used as a benchmark for alternative architectures Khashman (2010). Aforementioned activation function can be replaced by tansigmoid or cumulative Gaussian function. The only change in comparison with the previous equations describing network with logistic activation function would be Hastie et al. (2013) enk,t − e−nk,t (3.20) Nk,t = T (nk,t ) = nk,t + e−nk,t e for tansigmoid activation function and Z nk,t Nk,t = Φ(nk,t ) = −∞ r 1 −n2k,t e 2 2π (3.21) In the following figure are used shows the examples of decision boundaries created by neural networks with different activation function. 3. Methodology 15 Figure 3.5: Separation using Neural Network with Different Activation Functions Source: own processing 3.3.2 Neural Networks with Radial Basis Functions In radial basis function (RBF) networks serves the Gaussian density function as an activation function. In many specialized publications focused on neural networks is the Gaussian density function (serving as activation function) called radial basis function. Nevertheless the architecture of the RBF network is substantially different from feedforward neural network we have presented in previous sub-chapter. Moreover, the central idea of RBF networks is different as well. In plain terms, an RBF network performs a classification by calculating the inputs resemblance to examples from the training data. 3. Methodology 16 The RBF network has three usual layers: output, hidden and the input layer Angelini et al. (2008). In the input layer is one neuron for each input variable. The processing before the input layer usually standardizes the range of the values and then the information is fed to each neuron in the subsequent hidden layer. In the hidden layer is variable amount of neurons, where each of them consists of a radial basis function which is centered on a point that has as many dimensions as is the number of input variables Yu et al. (2008). It is important to mention that the radius (also called spread) could be different for each dimension. When processing a data, then the neurons in hidden layer calculate the Euclidean distances of the processing observations from the neurons’ center points and subsequently transforms these distances by the RBF kernel functions (here the Gaussian functions) with usage of the radius values Hastie et al. (2013). The computed values are passed to the output layer, where are multiplied with corresponding weights and summed together. For classification problems, this value is the probability that the evaluated case falls into a particular category. The RBF network is described by the following system of equations Hastie et al. (2013): T X (yt − yˆt )2 (3.22) min ω,µ,γ t=0 nt = ω0 + i X ωi xi,t (3.23) i=1 s Rk,t = φ(nt , µk ) = 1 exp 2πσn−µk yˆt = γ0 + k X γk Rk,t (−[nt − µk ])2 (σn−µk )2 (3.24) (3.25) k=1 where the x again represent the input variables, n the linear transformation of the input variables with application of weights w. There are k different centers (µk ) for transformation by radial basis function, calculate the k spreads and obtain k various functions Rk . In the pre-final step are the outputs from these functions combined in a linear manner with employment of the weights γ to calculate the forecast of y. The estimation of the center points can be done in general by any clustering 3. Methodology 17 algorithm. We have decided for K-means clustering because of its ease of application and robust results, for more information see Hastie et al. (2013). Some set of clusters, each with n-dimensional centers, is determined by the number of nodes in the input layer. Then using the K-means algorithms are established the clusters’ centers, which will become the centers of the RBF units. Once the RBF units’ centers are established, the spread of each RBF unit can be estimated, for example, again with K-nearest neighbors algorithm Hastie et al. (2013). For certain K, and for each center, the K-nearest centers are found. Given that the current cluster’s center is cj ; then the spread is calculated as s rj = 3.3.3 Pk i=1 (cj k − ci )2 (3.26) Jump Connections An alternative to the ordinary feedforward networks are feedforward networks with jump connections, where the neurons in the input layer have a direct connection to the output layer McNelis (2005). Figure 3.6 shows an example of this architecture. 3. Methodology 18 Figure 3.6: Feedforward Network with Jump Connections Hidden Layer Input Layer Inputs X1 X2 Output Layer X3 Output Xn Jump Connections Source: own processing inspired by Hastie et al. (2013) The mathematical description of this architecture is very similar to the ordinary feedforward network: nk,t = ωk,0 + i X ωk,i xi,t (3.27) 1 1 + e−nk,t (3.28) i=1 Nk,t = L(nk,t ) = yt = γ0 + k X k=1 γk Nk,t + i X βi xi,t (3.29) i=1 The main disadvantage of this architecture is the increase in the number of parameters in the network by the number of inputs, i. On the other hand, a substantial advantage is that it hybridizes the pure linear model (represented by the jump connection) with the feedforward network. The consequence is straightforward: this system allows for nonlinear function that may have nonlinear component, as well as a linear component McNelis (2005). 3. Methodology 3.3.4 19 Multilayered Feedforward Networks In the case of higher complexity of the problem we want to approximate, we can use more complex architecture of the feedforward network. We can add one or more layers and jump connections. For the sake of mathematical lucidity, this architecture can be described as follows (if we assume two hidden layers and logsigmoid activation functions): nk,t = ωk,0 + i X ωk,i xi,t (3.30) i=1 Nk,t = 1 1 + e−nk,t pl,t = ρl,0 + k X (3.31) ρl,i Nk,t (3.32) k=1 Pl,t = yt = γ0 + l X 1 1 + e−pl,t γl Pl,t + i X (3.33) βi xi,t (3.34) i=1 l=1 Figure 3.7: Architecture of Multilayered Feedforward Network Multiple Hidden Layers Input Layer Inputs X1 X2 Output Layer X3 v Output Xn Source: own processing inspired by Hastie et al. (2013) This architecture with multiple hidden layers (as shows the Figure 3.7) allows for higher complexity, which can lead to the improvements in predictive power. However, there negative impacts - we need to estimate much more para- 3. Methodology 20 meters, which consume the valuable degrees of freedom and computation time Witten & Frank (2005). With more parameters to estimate, there is also increase in the probability that the estimates can converge to the local, not global, optima Hastie et al. (2013). 3.4 Support Vector Machines Support Vector Machines (SVM) is a relatively new learning method used for regression and binary classification. The crucial idea of SVM is to find optimal hyperplane which separates the multi-dimensional data into two classes. However, the input data is usually not linearly separable, thus SVM introduce novel approach of ”kernel induced feature space”Cortes & Vapnik (1995). which transfers the data into higher dimensional space where the data are linearly separable. The SVM were firstly introduced by Vladimir Vapnik and colleagues in a research paper in 1995. In the following sections will be introduced the basic idea of the SVM and then we investigate this topic in higher detail then the Neural Networks. There are many research papers describing the mathematics behind this topic, but there is only few resources covering the mathematics of SVM. Hence we focused on this aspect as well. Cortes & Vapnik (1995) The crucial idea of classification done by SVM is finding a hyperplane that maximizes the imaginary margin between two classes. In two-dimensional space is used line, in three-dimensional space is applied plane and in space with more dimensions is used hyperplane Elizondo (2006). The following figure shows the separation of two classes in two-dimensional space with a line. 3. Methodology 21 Figure 3.8: Separable and Non-separable Cases xβ + β0 = 0 xβ + β0 = 0 ξ2* ξ1* ξ3* M M M M ξ4* Margin Margin Source: own processing, based on Cortes & Vapnik (1995) The first panel shows separable case, where the solid line is the decision 2 . The boundary and the broken lines specify the margin of width 2M = kβk other panel depicts the overlapping and linearly non-separable case. The points with labels ξj∗ are apparently on the other side of the margin. The size of this overstepping is equal to ξj∗ = M ξj ; this metrics is for points on the correct side of the margin are equal to zero Bellotti & Crook (2009). The size of the margin P P ∗ is maximized with condition that ξj ≤ constant. The summation ξj is adjusted total distance of all points from margin. To start with, assume the input dataset with N pairs {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where xi ∈ Rp and xi ∈ {−1, 1}. A hyperplane is defined by {x : f (x) = xT β + β0 } (3.35) where the β represents a unit vector; kβk = 1 . The classification rule for two classes is determined by G(x) = sgn[xT β + β0 ] (3.36) The function f (x) produces a positive or negative distance from the margin to point xi . If the classes are separable, then it can be shown that there exists a function f (x) = xT β + β0 with yi f (xi ) > 0; ∀i. Hence, it is possible to find 3. Methodology 22 some hyperplane that produces the biggest margin the two classes. In the case of the separable case, the optimization problem is as follows Hastie et al. (2013) minimize kβk (3.37) β,β0 subject to yi (xTi β + β0 ) ≥ 1, i = 1, . . . , N (3.38) The criterion is quadratic and with linear inequality constraints, hence this problem is convex and easily solvable. Nevertheless we can face data where the classes are non-separable and thus can overlap each other. We can deal with this feature by still maximizing the margin, but with allowing for some points to be on the other side of the margin. For this purpose we need to define slack variable ξ = (ξ1 , ξ2 , . . . , ξN ). The optimization problem remains the same, but the constraints are different Hastie et al. (2013): minimize kβk (3.39) β,β0 subject to yi (xTi β + β0 ) ≥ M (1 − ξi ), i = 1, . . . , N X ξi ≤ constant (3.40) ξi ≥ 0 (3.42) (3.41) The logic of this adjustment is straightforward. The value ξi represents amount by which the prediction f (xi ) = xTi β + β0 falls on the wrong side of the margin. The misclassification occurs when ξi > 1, hence bounding the sum of these slack variables at some value V , limits the total number of training misclassifications at value V . Now is obvious one attractive property of the SVM. By the nature of the aforementioned optimization constraints, we see that the points a far away from their decision boundary do not play a substantial role in shaping that boundary Cristianini & Shawe-Taylor (2000). Hence the SVM are less prone to be affected by outliers. 3. Methodology 3.4.1 23 Computation of the Support Vector Classifier The following subsection is merely of a mathematical nature, nevertheless for the sake of lucidity in the subsequent section it is important to describe at least briefly the way of calculating the Support Vector classifiers. For more deep understanding of this issue we advise to look into a book describing this issue on much higher level Hastie et al. (2013). The optimization problem in case of non-separability from the previous section is convex optimization problem and thus quadratic programming solution using Lagrange multipliers can be stated. For computation purposes, it is more convenient to express this optimization problem in this form Schölkopf & Smola (2002) X 1 ξi (3.43) minimize kβk2 + C 2 β,β 0 s.t. ξi ≥ 0; yi (xTi β + β0 ) ≥ 1 − ξi (3.44) where the parameter C replaces the ”constant”from inequality (3.41). In the case of linear separability, the cost parameter C is equal to infinity. The primal Lagrange function is then as follows Schölkopf & Smola (2002) N N N X X X 1 LP = kβk2 + C ξi − αi yi (xTi β + β0 ) − (1 − ξi ) − µi ξi 2 i=1 i=1 i=1 (3.45) This primal Lagrange function is minimized with respect to parameters β, β0 and ξi . If we derive this function and set the relevant derivatives to zero, we get (for the sake of brevity we exclude the calculations of derivatives) β= N X α i y i xi (3.46) i=1 0= N X αi yi (3.47) i=1 αi = C − µi , ∀i (3.48) αi , µi , ξi ≥ 0 ∀i (3.49) 3. Methodology 24 By substituting, we get the dual Lagrangian objective function LD = N X i=1 N N 1 XX αi − αi αi0 yi yi0 xTi xi0 2 i=1 i0 =1 (3.50) This dual Lagrangian function generates a lower bound on the primal Lagrange function (3.45) for all feasible points Schölkopf & Smola (2002). We perform P maximization of the LD s.t. 0 ≤ αi ≤ C and N i=1 αi yi = 0. We also employ the Kuhn-Tucker conditions and get following constraints αi yi (xTi β + β0 ) − (1 − ξi ) = 0 (3.51) µi ξi = 0 (3.52) yi (xTi β + β0 ) − (1 − ξi ) ≥ 0 (3.53) for i = 1, . . . , N . These Kuhn-Tucker constraints and the previous ones ((3.46) - (3.49)) uniquely define the solution to the primal and dual problem. We get from (3.46) the solution for β Schölkopf & Smola (2002): β̂ = N X α̂i yi xi (3.54) i=1 where the coefficients αi are nonzero for the observations i where the constraints (3.53) are exactly met (because of (3.51)). These observations are so-called support vectors, because the β̂ is defined by them alone. Some of these support points lie exactly on the edge of the margin, (i.e. ξi = 0) and thus from (3.52) and (3.48) we get for them relationships 0 < αi < C. For support vector where ξi > 0 (points beyond margin) applies αi = C. From (3.51) emerges that for a solution for β0 can be used any margin point (i.e. ξˆi = 0, 0 < αi ). Given the solutions βˆ0 and β̂, the desired decision function can be finally written as Schölkopf & Smola (2002) b G(x) = sgn fˆ(x) = sgn xT β̂ + βˆ0 3.4.2 (3.55) Support Vector Machines and Kernels Up to this part of the thesis we have described the application of support vector classifier in cases, where the input feature space is linearly separable. These linear boundaries in the extended space correspond to nonlinear boundaries in 3. Methodology 25 the original feature space. On the following plots is shown basic example with two circles, each of them represents one class of the data Hastie et al. (2013). These two classes are not linearly separable in the two dimensional space. However, if we employ suitable function for mapping these data into higher dimensional space, then these two classes can be separated with linear hyperplane. Then this decision boundary is transformed back into the original space. The second and third graphs show mapping into three dimensional spaces, the last graph shows the decision boundary calculated using SVM approach. Figure 3.9: (Non)separability Demonstration Source: own processing paritally inspired by Hastie et al. (2013) If we have selected the basis functions hz (x), z = 1, . . . , Z then the procedure of classifying is in essence the same as without kernel transformation. We estimate the support vector classifier using input data h(xi ) = (h1 (xi ), h2 (xi ), . . . , hM (xi )), i = 3. Methodology 26 1, . . . , N and then generate the nonlinear function fˆ(x) = h(x)T β̂ + βˆ0 (3.56) b The classifier is the same as before: G(x) = sgn fˆ(x) Hastie et al. (2013). The number of dimensions of the extended space is allowed to get high, even infinite in selected cases. On first glance, in these cases it might appear that the computations might be too prohibitive and with appropriate basis function the input data always can be separable, and hence overfitting might occur. Fortunately, the SVM approach deals with both of these serious issues successfully Souza (2010). The optimization problem (3.45) can be re-specified in a way, where the input data are involved in the form of inner products Souza (2010). This is done directly for the transformed input vectors h(xi ). It can be shown, that for appropriate choices of function h, the inner products could be computed without excessive computations Schölkopf & Smola (2002). The Lagrange dual function (3.50) can be restated as follows LD = N X i=1 N N 1 XX αi αi0 yi yi0 hh(xi ), h(xi0 )i αi − 2 i=1 i0 =1 (3.57) The solution function f (x) can be rewritten T f (x) = h(x) β + β0 = N X αi yi hh(x), h(xi )i + β0 (3.58) i=1 Then the parameter β0 can be easily determined (given αi ) by solving yi f (xi ) = 1 for all xi for which 0 < αi < C, of course. We can see in the two previous equations ((3.57) and (3.58)) one crucial thing: both involve the function h(x) only in the form of its inner products. In fact, there is no necessity to specify this function at all, we are only required to know the corresponding kernel function Cortes & Vapnik (1995) K(x, x0 ) = hh(x), h(x0 )i (3.59) that is used for calculation of inner products in the transformed higher-dimensional 3. Methodology 27 space Souza (2010). Based on Hastie et al. (2013) the most popular choices for K in the SVM applications are Linear: K(x, x0 ) = x · x0 (3.60) d Polynomial of d-th degree: K(x, x0 ) = 1 + hx, x0 i Radial basis: K(x, x0 ) = exp − γkx − x0 k2 Sigmoid: K(x, x0 ) = tanh k1 hx, x0 i + k2 (3.61) (3.62) (3.63) The function of the tuning parameter C is clearer in much more extended feature space, because the perfect separation is possible there. Larger value of the parameter C will suppress any positive ξi and will with high probability lead to overfitting in the original feature space. Lower value of the parameter C induces lower value of kβk, which causes f (x) to be smoother. Smoother function f (x) apparently implies a smoother boundary. The following figure shows the same random data as in the previous section (focused on Neural Networks). Here the SVM with different kernel functions were applied. 3. Methodology 28 Figure 3.10: Separation using SVM with Different Kernel Functions Source: own processing There exist no general rules regarding the selection of the optimal kernel, at least to our knowledge. The suitability of certain kernel function highly depends on the substance of the problem being solved and on the relationships among the input variables. In case of simple tasks with entirely linear relationships, the linear kernel function is sufficient and application of more complex kernels should not bring any improvement. On the other hand, if the input data are more complex and providing us with nonlinearities and interdependences, then the more advanced kernels can bring major improvements in the classification ability. We can mention for example face recognition, genetics or natural language processing. The credit scoring is in general between these extremes. However this is heavily dependent on the quality and availability of the input data. 3. Methodology 3.4.3 29 Support Vector Machines and Regressions In the following section we will show how the SVM can be adjusted for regression producing a quantitative response. We firstly remind the linear regression model f (x) = xT β + β0 (3.64) and then focus on its nonlinear generalizations. In order to estimate β, we should consider minimization of a following expression H(β, β0 ) = N X i=1 λ V yi − f (xi ) + kβk2 2 (3.65) where the error function V is can defined in the following way (for more versions of this error function see Hastie et al. (2013) ): ( V (r) = 0 ; |r| < |r| − ; otherwise (3.66) The logic of this -sensitive error measure is simple - the errors of size lower than are ignored. We can see here some analogy with the support vector classification logic, where the points on the right side of the decision boundary and far away from it (thus we are not speaking now about the support vector points), are ignored during the optimization process. There is one very desirable property of the error function V (r) - the fitting of this regression model is not much sensitive to the outliers. Moreover, this error measure possess linear tails (beyond ), and in addition it diminishes the contributions of the cases with small residuals. The next figure shows the -sensitive error function employed by the SVM. 3. Methodology 30 Figure 3.11: -sensitive Error Function Source: Hastie et al. (2013) Given that the β̂, βˆ0 are minimizing the H (3.65), then the solution function can have the following form Hastie et al. (2013) β̂ = N X α̂i∗ − α̂i xi (3.67) i=1 f (x) = N X α̂i∗ − α̂i hx, xi i + β0 (3.68) i=1 This problem is solvable by minimizing the following expression min ∗ αi ,αi N X i=1 N N X 1 X αi∗ −αi − yi αi∗ −αi + αi∗ −αi αi∗0 −αi0 hxi , xi0 i (3.69) 2 i,i0 =1 i=1 subject to the following constrains αi ≥ 0; αi∗ N 1 X ∗ ≤ ; (αi − αi ) = 0; αi αi∗ = 0 λ i=1 (3.70) The solution of this problem on the input values only through their inner product hxi , x0i i, the same is valid in the classification case. Due to this property, 3. Methodology 31 we can generalize this approach to spaces of higher dimensions by defining an appropriate inner product. The mathematics of the kernel trick in case of SVM regression is demanding and beyond the scope of this thesis, nevertheless we can recommend Hastie, 2009. 3.5 Random Forests and Trees In this section will be introduced another prospective part of machine learning theory, the Random Forests. The Random Forest algorithm as firstly developed by (Breiman, 2001) is a representative of ensamble method. This means that this model consists of many other models, but the final predictions and other relevant quantities are obtained by certain combinations of the outputs of all of the underlying models. First, we provide an overview of classification and regression trees (CART), because they are the constituents of the Random Forests and then we describe compiling of CARTs into Random Forests. 3.5.1 Classification and Regression Trees The following part is based on the seminal paper presented by Leo Breiman with his colleagues in 1984 Breiman et al. (1984). The underlying logic of CART lies in repeated partitioning of the input data in order to estimate the conditional distribution of the dependent variable. Let the response of interest be a vector with observations y = (y1 , y2 , . . . , yn )T (3.71) and the set of explanatory variables (i.e. features or predictors) a matrix X = (x1 , x2 , . . . , xp ), (3.72) xj = (x1j , . . . , xnj )T for j ∈ {1, . . . , p} (3.73) where The ultimate goal the algorithm is to divide y conditional on the values of the inputs X in such way that the originated subgroups of y as much homogenous (e.g. in terms of riskiness) as possible Breiman et al. (1984). The CART algorithm considers every unique value in each input variable as a potential 3. Methodology 32 candidate for a binary split and then calculates the homogeneity of the resulting subgroups of the dependent variable. For better understanding follows a simple example with an explanation. Figure 3.12: CART Plot and Explanation of Logic Using Simple Data Client Type New Repeating Income Age <30 000 >=30 000 >=30 years <30 years Only Elementary School Sex Yes No High Risk Low Risk Low Risk Male Female High Risk Low Risk Low Risk Source: own processing Now we can extend the CART logic from this uncomplicated example of binary classification to other types of outcome variable. The crucial part here is the loss function and its role within the algorithm. For the sake of accuracy and clarity, we need to introduce a few definitions. The data at the current node m can be written as m y m = y1m , . . . , ynm ; X m = xm 1 , . . . , xp (3.74) The explanatory variable used for current split, xm s , has the unique values C m = {xm i }i∈{1,...,nm } . (3.75) and the value c ∈ C m is the value of the respective explanatory variable consi- 3. Methodology 33 dered for a split. Then the data in the corresponding daughter nodes created by the previous split in c are necessarily y ml and y mr . The y ml contains every mr contains all elements where element of y m whose values of xm s ≤ c and y xm s > c. The reduction in error (i.e. gain) from a split at node m in variable xs at value c can be finally defined as (Breiman, 1984) ∆(y m ) = L(y m ) − h nm l i nmr ml mr ) − L(y L(y ) . nm nm (3.76) Here the terms nml and nmr represent the number of cases that fall to the left and right side of the split, L(·) represents the aforementioned loss function. The logic of the loss function is straightforward - it measures the level of impurity, or misclassification, at each node. For categorical outcomes, let us denote the set of unique categories of y m as Dm = {yim }i∈{1,...,nm } . (3.77) In order to evaluate the level of impurity of the node we need to calculate the proportion of cases belonging to each class d ∈ Dm and denote it as pm (d). The class occurring most frequently will be denoted as ŷ m . Then the impurity of the node can be finally obtained from the following relationship Breiman et al. (1984): nm 1 X m Lmc (y ) = m I yim 6= ŷ m = 1 − pm ŷ m . (3.78) n i=1 The function I(·) is so-called indicator function that is equal to one if the argument is true. The definitions and relationships stated above only describe the following logic - the level of impurity of given node can be expressed as the proportion of cases that would be incorrectly classified under the application of the major rule. For continuous outcomes a different loss function is needed to measure the level of impurity of the node. The most widespread loss function is this case is the mean squared error (MSE) Ghodselahi & Amirmadhi (2011). m m Lmse (y ) = n X yim − ŷ m 2 (3.79) i=1 The predicted value ŷ m usually represents the mean of the observations in y m . 3. Methodology 34 Special case are unordered categorical variables, which need to been handled with different logic. The reason is that if we make a split in certain category on an unordered discrete variable, then the categorization in values to the right and left from this split has no meaning, since there is no ordering. Therefore all feasible combinations of the elements from Dm need to be considered. When an appropriate loss function is selected, then at each node the quantity ∆(y m ). is calculated for all variables and for all feasible splits in the variables. Then the combination of variable and corresponding split that produces the highest ∆ is selected. This process continues in all resulting new nodes until the stopping criterion is met. The importance of the stopping criterion, or criteria, is obvious - it is necessary to avoid trees that are overly complex and are overfitting the data. In this case the tree would try to generalize the noise and not the real signal in the data. Commonly used stopping criteria include the number of observations in each of the terminal node, homogeneity or the number of observations in the terminal nodes Hastie et al. (2013). The most substantial deficiency of the CART is the high variance of the fitted values, i.e., CART prone to the overfitting. These fitted values can be very unstable, thus CART can produce very different classifications when some even minor changes to the data used to fit the model are done (e.g. the age increases a little and then then this case will be processed by different branch of the tree and assessed based on different criteria). This problem is inevitable in case of CART applications, but can be substantially mitigated if we employ the Random Forests, which are sort of extension of the CART logic. 3.5.2 Random Forests Leo Breiman, Breiman (1996), proposed bootstrap aggregating that can be used to decrease the variance of the fitted values from the CART. This bootstrap aggregating, called bagging, has one pivotal idea - in order to decrease the predictions’ variance of one model, more models can be fitted and then their predictions can be averaged in order to obtain the final prediction. Each component model is trained only on the bootstrap sample of the data, because we need to decrease the risk of overfitting. Thus each of these data samples excludes some part of the original dataset, which is known as out-of-bag data. 3. Methodology 35 Then using each of these samples a CART is built and from these components the final Random Forest is composed. By combining of the predicted variables for each observation an ensamble estimate is produced. This ensamble estimate has lower variance that would have a prediction made by only one CART trained on the original data Buja & Stuetzle (2006). Breiman (2001) has further extended the logic of the Random Forests. Instead of selecting the split from all available explanatory variables at each node and in each tree, only a random subset of the explanatory variables are considered. The intended direct consequence is diversifying the splits across the trees. If there are some highly predictive variables that could overshadow the impact of weaker (but still predictive) variables, then this approach gives the chance to these weaker variables to be selected into some underlying tree. Thus this does not only decrease the risk of overlooking of these variables, but it additionally allows to a large dataset of input variables to be analyzed. With individual trees trained on the independent datasets and with different subsets of available variables, trees produce predictions that are far more diverse and with lower variance. For sake of clarity, the out-of-bag data, that is data that was not drawn in the bootstrap process used to train a particular tree, is used for each tree’s prediction. In case of continuous outcomes is the prediction of the Random Forest given by the simple average of the predictions generated by each of the underlying tree Breiman (2001): T 1X t f Xi∈B̄ t fˆ(X) = T t=1 (3.80) The term T represents the number of all trees in the final forest, and f t (·) the t-th tree, B̄ t is the out-of-bag data for the tth tree. For discrete outcomes, the final prediction of the forest is the majority prediction from all underlying trees. The number of trees and the number of available variables at each node are the tuning parameters and the optimal choice of these depends on the data available and the task at hand. Therefore, these parameters should be chosen to minimize the expected generalization error. This can be done for example by application of resampling methods such as the widespread cross-validation. More details can be found in Amaratunga et al. (2008) or Biau et al. (2008). 3. Methodology 36 The following figure shows the comparison of various specifications of the Random Forests. The same dataset was used for Neural Networks and SVM. It is obvious that the number of trees in the Random Forest has significant influence on possible overfitting. Figure 3.13: Comparison of Random Forests Source: own processing Chapter 4 Model Development In the following section selected aspects connected to the development of the final models will be described. We will briefly outline theory focused on the selection of variables, selection of certain specification of some model and on performance assessment of various models. Then follows introduction of the hypotheses, data description and exploratory analysis, building of the models and we will conclude this section with evaluation of the hypotheses. For brief introduction into application of scorecards in underwriting process and portfolio management in finance please see appendix. 4.1 Variable Selection In practice, the available dataset contains a very high number of potential input variables, albeit usually not all of them are relevant to the classification or regression problem. In the past, the credit scoring was not a typical example of high-dimensional problems because not so many information about the client were available. The datasets rarely contained more than 100 input variables. Nevertheless, with the expansion of the internet, various smart technologies and devices and high-speed data connections, the availability of the data has increased dramatically. It is not uncommon today that the risk department has at disposal hundreds of potential input variables containing for example history of banking transactions, information from pension funds or mobile phone operators, credit bureaus or social media. Even if the client calls to the company’s call-center, his voice can be analyzed and can be used for scoring. 4. Model Development 38 As we have outlined earlier, decreasing the dimensionality will be more and more important task in future. The expected impacts of the dimensionality reduction are generally lower computational costs and performance increase of the final model. Especially, the latter is crucial as the irrelevant input variables may actually bring additional noise to the model and hence decrease the performance and, in case of credit scoring, increase the overall credit losses. Because of these reasons, some sort of variable selection or filtering needs to be done before the model development can commence. There are several established methods (and many new and progressive ones), many of them can be used more widely, not only for the models introduced in this thesis. Forward selection is a simple sequential iterative procedure for selection of an appropriate set of relevant input variables Derksen & Keselman (1992). The algorithm starts with an empty set of input variables and then iterates over all potential input variables and uses exactly one at a time, hence creating n various models in the first round of iterations. All these models are assessed and the input variable with best performance is chosen and included into the preliminary model. In the next step, the other n − 1 variables are used for creation of n − 1 models, given that each incorporate two input variables; the one from the first round of iterations and one arbitrarily selected variable from the remaining n − 1 variables. Subsequently the model’s performance is assessed and the best performing variable is added into the preliminary model. There is no generally accepted optimal stopping criterion of this iterative process. The criterion can be reaching some number of input variables or minimum performance improvement for the last included input variable. Backward selection algorithm is basically a modification of the forward selection process. The difference consists in the fact that all potential input variables are included in the initial model. Then, in each round of iterations, the performance of all variables is evaluated and the worst performing variable is dropped. One of the state-of-art variable selection methods is Minimum Redundancy Maximum Relevancy (MRMR) algorithm Peng et al. (2005). This approach is extension of the backwards and forwards variable selection and its logic is based 4. Model Development 39 on two ideas. The first idea behind this approach is that we should not employ variables which are significantly correlated in the model. In other words, the redundancy among variables needs to be taken into account, thus we should keep variables with high levels of dissimilarities among them. Let U represent some set of uni-dimensional discrete variables {X1 , X2 , . . . , Xn } and C is a known class which can take its values in {c1 , c2 , . . . , cm }. The set S ⊆ U represents any subset of the set U . One of the ways of global measuring of redundancy among the variables in the subset S is Ding & Peng (2005): WI (S) = 1 |S|2 X M I(Xi , Xj ), (4.1) Xi ,Xj ∈S where the M I(Xi , Xj ) is representing the measure of mutual information between the two input variables. More details about this measure can be found in Peng et al. (2005). The second idea underlying this concept of variable selection is that minimized redundancy should be accompanied by the maximum relevance criterion of the input variables with respect to the dependent variable. An acceptable measure of the overall relevance of the variables in S with respect to the dependent variable is VI (S) = argmax S⊆U X M I(C, Xi ) (4.2) Xi ∈S The combination of the relevancy and redundancy used to obtain a suitable subset of variables is as follows S ∗ = argmax V1 (S) − W1 (S) (4.3) S⊆U The aforementioned description of MRMR is designated for discrete variables. The MRMR variable selection for continuous variables employs the same logic Peng et al. (2005). Traditionally, the relevance of some variable is the most important selection criterion and thus the majority of variable selection algorithms concentrate 4. Model Development 40 nearly exclusively only on the relevance of the given variable. Nevertheless, in order to have a more complex variable subset with superior generalization potential, the selected variables needs to be non-redundant. In other words, each variable needs to bring new information into the problem. The backward or forward variable selection approaches focus on the relevancy condition. The MRMR selection algorithm focuses on the relevancy as well, and moreover, it introduces limitation on the redundancy of the variables. And by its nature, this decreases the risk of multicollinearity in the traditional econometric models. 4.2 Model Selection In practice, there are usually many models to choose from and it cannot be clear which one will provide us with the highest performance for a given task. For example, in the case of the support vector machines, the suitable kernel function should be selected and then its optimal parameters found. In general, there are more methods based on cross-validation that differ in the required computing resources and precision. We will introduce the k-fold cross validation process as this method is not extremely computationally demanding but still provides good results Rodrı́guez et al. (2010). In this method, the dataset S is randomly divided into k subsets, S1 , S2 , . . . , Sk . One of them is left for testing purposes and the rest, (k − 1), is used for training the model. Thus, this process generates k sets of performance statistics. Subsequently these statistics are compared and the best model is chosen. 4.3 Performance assessment In this section, we will introduce suitable definition of bad client and subsequently, performance indices based on distribution and density functions. Probably the most important step in building a predictive model is the appropriate definition of the dependent variable. Thus, in credit scoring it is utterly important to precisely define the good and the bad client. The common practice is definition based on days past due (henceforth DPD). Furthermore, it is crucial to determine the time horizon in which the previous metric is traced. For example, bad client can be regarded to as a client who was at least 60 DPD 4. Model Development 41 on the first 5 installments. Selection of the proper default definition depends predominantly on the type of debt product. Certainly, there will be employed different definitions for consumer loans with maturities around one year and for mortgages where the maturities are measured in decades. Besides the maturity of the loan, the purpose of the model is crucial. For fraud prevention purposes very short defaults are used, e.g. only default on the first installment is taken into consideration, and for consumer loan underwriting longer time horizons are generally used, ranging from three to twelve months. Once the definition of good and bad client is done and the score is estimated (this will be done in the subsequent parts), the evaluation of the predictive model has to be conducted. We will employ indices based on the cumulative distribution function where the main representatives are Kolmogorov-Smirnov statistics, Lift and AUC coefficient. Nevertheless, these indices are focusing rather on statistical properties of the model and are not addressing properly the business needs - the profit-making potential. This aspect will be assessed in this thesis as well using Monte-Carlo simulation and cost matrix. Let us assume that score s is available for each client/observation and define the following markings: ( Dk = 1, if client is good 0, otherwise (4.4) The empirical cumulative functions of scores of the bad (good) clients are given by these relationships (Siddiqi, 2005): m 1 X I(si ≤ a ∧ Dk = 0), Fm,bad (a) = m i=1 (4.5) n 1X Fn,good (a) = I(si ≤ a ∧ Dk = 1), a ∈ [L, H] n i=1 (4.6) The quantity si represents the score of the i-th client, n is number of good clients, m represents number of bad clients and I stands for a indicator function where I(true) = 1 and I(f alse) = 0. The L and H represent the minimum and 4. Model Development 42 maximum value of the score, respectively. The empirical distribution function of scores of all clients is given by n+m 1 X Fn+m,all (a) = I(si ≤ a), a ∈ [L, H] n + m i=1 (4.7) The Kolmogorov-Smirnov statistics (sometimes written as K-S or KS) is defined as the maximal difference between the cumulative distribution functions of good and bad clients (Mays, 2005). More precisely KS = max Fn,good (a) − Fm,bad (a) (4.8) a∈[L,H] The logic of Kolmogorov-Smirnov statistics is illustrated by the following figure. We see the cumulative distribution function of Goods, Bads and the difference between these two curves. The peak of this difference curve defines the value and location of KS statistics. Figure 4.1: K-S Statistics Source: own processing The ROC (Receiver Operating Characteristic) can be also used to show the discriminatory power provided by the scoring function Warnock & Peck (2010). 4. Model Development 43 This curve can be described as y = Fn,good (a), x = Fm,bad (a), a ∈ [L, H] (4.9) (4.10) The logic of ROC is intuitive. Each point of this curve represents some share of accepted good and bad clients. The following figure exhibits this relationship. We have created ROC curve based only on 20 loans and ordered them by score. The Bads are concentrated on the lower values of the score. Using these scores we can accept e.g. 38% Goods and only 8% Bads at the same time. This is illustrated by the red point (8%, 38%). Figure 4.2: ROC, AUC Source: own processing inspired by Warnock & Peck (2010) The following terms are fundamental for understanding the theory related to ROC Zweig & Campbell (1993): • True positive: the loan is good and the model predicts good • False positive: the loan is bad and the model predicts good 4. Model Development 44 • True negative: the loan is bad and the model predicts bad • False negative: the loan is good and the model predicts bad When evaluating the accuracy of the models, the terms sensitivity and specificity are used. The sensitivity of the scoring model refers to the ability to correctly identify the good loans. Sensitivity = T rue P ositives T rue P ositives + F alse N egatives A model with 100% sensitivity correctly identifies all good loans. Model with 80% sensitivity detects 80% of good loans (true positives) but 20% of the good loans are undetected (false negatives). The specificity of model refers to the ability to correctly identify the not good loans (i.e. the bad ones). Specif icity = T rue N egatives T rue N egatives + F alse P ositives A model with high sensitivity but low specificity results in many good loans being labeled as bad. In other words - the model can not recognize the bad loans. The other term used to reference the utility of models is the likelihood ratio. This ratio is defined as how much more likely a good loan is labeled as bad compared with another labeled as bad. Likelihood ratio = Sensitivity 1 − Specif icity Another quality measure, AUC (Area Under ROC Curve), is directly connected to the ROC and describes a global quality of a given scoring function Lobo et al. (2008). The output domain of AUC measure lies between 0 and +1 when the perfect model reaches the value +1. AUC measures the overall discriminatory power. Another useful indicator of the discriminatory power is the Cumulative Lift Bhattacharyya (2000). This is rather local, not global, performance measure, which tells us how many times, at a given level of acceptance (or rejection), is the scoring model better than a random model. Or, if we put it in different way - how many times is certain part of the dataset riskier than the average. 4. Model Development 45 More precisely Lif t(a) = BadRate(a) Fm,bad (a) = , a ∈ [L, H] BadRate Fn+m,all (a) (4.11) The Cumulative Lift is nicely demonstrated on the subsequent figure. We can see that the worst 10% of loans is approximately five-times more risky than the whole portfolio of loans. Figure 4.3: Cummulative Lift Source: own processing 4.4 Hypotheses In order to assess the performance of the aforementioned approaches on real data and compare them with the logistic regression, we have defined three hypotheses: (i) The models based on artificial intelligence outperform the conventional decision making techniques as ordinal risk measure. (ii) The models based on artificial intelligence outperform the conventional decision making techniques in terms of potential profit. (iii) The models based on artificial intelligence provide more time-stable per- 4. Model Development 46 formance in risk decision-making in comparison with conventional techniques. The first hypothesis will be tested by ordering the loans by the score assigned by different discrimination techniques. Then, the development of correlation of risk and score will be assessed and compared for each model. Specifically, we will employ and analyze the ROC curve and AUC coefficient, relationship of the assigned scores and difference between the cumulative distribution functions for good and for bad clients, Kolmogorov-Smirnov statistics, dynamics of score and expected risk, cumulative lift and others. These metrics allow us to assess the ordering ability of each method. Thus, the value of the score is not vital; we only need better scores for better clients and vice versa. The second hypothesis will be tested based on maximal achievable potential profit. In order to be able to do this with high level of diligence, we will employ the Monte-Carlo simulation and cost matrix. The Monte-Carlo simulation provides us with certain level of security that the superior performance of certain approaches is not coincidental. Other fundamental part is the cost matrix. Majority of the current methods for assessment of various scoring approaches do not take into consideration one major aspect - the real price of rejecting good client and approving bad client is radically different. The previous tools like ROC curve, Cumulative Lift, Kolmogorov-Smirnov statistics, etc. do not take this into consideration. Hence, in order to assess the various approaches with regard to their business potential, we have to employ this information as well. The third hypothesis will be evaluated indirectly by using the out-of-time dataset for evaluation. The time stability is crucial for successful business implementation of a scoring model. Many models perform well on the training sample, but their performance on the out-of-time dataset rapidly decreases. This is usually caused by overfitting and improper set-up of the models on the training sample. We will take the most recent 20% of the data as out-of-time dataset and the remaining part will be split randomly between training and validation dataset. This approach ensures more reliable appraisal of the scoring preciseness and time stability. 4. Model Development 47 Figure 4.4: Out-of-time Sample Validation Sample Training Sample Out-of-time Sample Time Source: own processing 4.5 Data Descripption and Exploratory Analysis Our models will be trained and evaluated using dataset from financial institution form west part of Asia. We have dataset with many variables at hand. Nevertheless, datamining is not the topic of this thesis, thus we have selected eleven input variables with substantial discriminating power. This was done on expert basis, but this does not limit us as we need to compare several models under the same realistic conditions. The default (i.e. being bad client) was defined as being at least 30 days overdue on the first installment. This definition was selected because the financial institution had certain issues with external fraudsters who did not pay even the first installment. These fraudsters (clients) have different standard profile as the usual risky clients (who tend to default on later installments, not on the first one) and the usual relationships might not work. Thus, this is a nice opportunity for challenging the logistic regression with artificial intelligence approach. The summary for the out-of-time data sample can be found below. The characteristics of the validation and training sample are naturally very similar. As the initial exploratory analysis is crucial for further success, we are providing the statistical distributions of good and bad clients for non-categorical variables and Risk versus Category plots for categorical variables. It can be immediately seen from these plots that these variables have certain discrimination potential. The binary variable Land Line Refused is equal to one if the client refused to give his land line number, variable Historical Applications represents number of all historical loan applications in the credit bureau and the variable Term 4. Model Development 48 the length of the loan in months. Other variables are intuitive. Only consumer loans for buying specific goods are contained in the dataset (e.g. mobile phone, notebook or e-bike). The creditworthiness of these clients is rather average or sub-prime. There were not any problems with data quality, e.g. no missing values or obviously erroneous values. Table 4.1: Observations Total Goods Bads Bad Rate 16,676 16,090 586 Table 4.2: Variables Total Categorical Non-Categorical 3.51% Table 4.3: Categorical Variables Variable Client Type Education Income Type Land Line Refused Number of Children Sex Categories 2 7 2 4 3 2 11 6 5 4. Model Development 49 Table 4.4: Categorical Variables: Detailed Overview 4.6 Categorical Variable Category Share Risk Client Type New Repeating 56.77% 5.11% 43.23% 1.41% Income Type Blue Collar 34.49% 5.88% Businessman 1.24% 3.38% Government 24.57% 3.30% Academic 0.67% 2.70% Maternity Leave 28.69% 1.92% White Collar 10.27% 0.64% Other 0.08% 0.00% Sex Male Female 44.72% 55.28% Education University Elementary Other High school 0.33% 5.45% 62.08% 4.19% 1.03% 3.49% 36.56% 2.35% Number of Children Zero One More than two 56.19% 4.32% 26.88% 2.63% 16.93% 2.23% Land Line Refused Yes No 38.20% 6.09% 61.80% 1.92% 4.96% 2.34% Model Building We have decided to employ five different models: • Multi-Layer Perceptron (MLP) • Radial Basis Function Network (RBFN) • Random Forest (RF) • Support Vector Machine (SVM) • Logistic Regression (Logit) The neural networks are represented by two models because the logic of RBFN is quite different from the MLP and it will be interesting to see comparison of these two models. We are employing the decision trees in their advanced 50% Percentile 25% Percentile 30.00 12.00 4,230.00 2,400.00 3.00 2,100.00 800.00 12.00 6.00 Figure 4.7: Goods Price Mean St. Dev. Min Max 75% Percentile 56.62 71.36 552.00 78.00 4,878.26 3,291.72 450.00 31,620.00 6,610.00 5.53 7.71 103.00 8.00 3,359.59 4,466.05 - 69,200.00 4,500.00 9.65 4.79 3.00 18.00 12.00 Figure 4.6: Employment Length Non-Categorical Variables Employment Length Goods Price Historical Applications Income Term Figure 4.5: Non-Categorical Variables: Detailed Overview 4. Model Development 50 Figure 4.8: Bad & Good: Visualisation 4. Model Development 51 Figure 4.9: Categorical Variables: Visualisation 4. Model Development 52 4. Model Development 53 variation - Random Forests. The core idea of decision tree is very simple and has many drawbacks. On the other hand, it will be beneficial to see how this simple model performs in comparison with much more complicated models from the artificial intelligence field. The main idea of the SVM is very elegant - transformation of the linearly inseparable datasets into different space and subsequent separation with linear hyperplane - but this advantage can be its weak point as well. It is still unclear how to find the optimal mapping into different space. The last model is Logit as representative of current industry standard. The variable selection was done by MRMR algorithm. We have decided to use seven variables with the highest added value (high relevance and low redundancy). One of many advantages of this approach is the fact that we do not have to control much for the correlation among the input variables. One of the assumptions of logistic regression is low or even none multicollinearity. This is very difficult to attain in real applications and so lower correlation among the variables has to be accepted (e.g. < 0.20). We have decided for the same seven input variables for all models because we need to compare the models among themselves. If each model is built on different dataset, then we would not be able to say whether the difference in performance is caused by the model itself or by the different potential of the input variables. Using the same input variables for all models will ensure more level conditions. The final variables are these: • Historical Applications • Client Type • Goods Price • Income • Land Line Refused • Sex We have tried to learn many specifications of the aforementioned models on the training and validation sample and then assess them. This process was very demanding from time and computational perspective. Especially in case 4. Model Development 54 of more complex neural networks, the learning algorithm needed several hours to finish. Thus, it was impossible to try all possible combinations of all input parameters for all models. Based on preliminary results, we have decided to use the following specifications. These specifications delivered the best performance (assessed by AUC on out-of-time sample), many of them are recommended by various research papers and all of them are relatively simple without any unexpected or unusual characteristics. The final specifications are as follows: • MLP - Two hidden layers with logit function as activation function, the jump connections are allowed. • RBFN - K-means algorithm was selected for clustering purposes. • SVM - Polynomial kernel of second order, the cost parameter equals to 1. • RF - The forest comprises of 80 trees with maximal depth of 3. The estimated Logit model is below. The values of coefficients are in line with our expectations and all variables are statistically and economically significant. The table with independent variables (predictors) and estimation of parameters follows. Table 4.5: Independent Variables Independent variable x1 x2 x3 x4 x5 x6 x7 Description Historical Applications Client Type Employment Length Price Income Land Line Refused Sex Categories p-value n.a. 2 n.a. n.a. n.a. 2 2 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 4. Model Development 55 Table 4.6: Parameter Estimation Parameter b0 b1 b21 b22 b3 b4 b5 b61 b62 b71 b72 Variable Description Intercept Historical Applications Client Type Client Type Employment Length Price Income Land Line Refused Land Line Refused Sex Sex Category Desctiption n.a. n.a. New Repeating n.a. n.a. n.a. Yes No Male Female Estimate −3.2518 −1.9412 −2.3827 0 2.8493 −0.8428 1.9358 −2.4692 0 −1.3051 0 The comparison of the final specifications of all models in terms of Sensitivity, Specificity, Likelihood ratio and AUC is crucial for assessment of the discrimination power of the models. The value of sensitivity is high for all models, but the specificity is much more important here. The specificity for MLP is the highest and lowest for SVM’s. In other words, the MLP can recognize the Bads much better than any other model. This will be nicely visible on ROC and Lift charts. Table 4.7: Comparison of Final Specifications MLP Sensitivity 0.993 Specificity 0.305 Likelihood ratio 1.430 AUC 0.841 Logit 0.992 0.148 1.166 0.812 RBFN 0.992 0.174 1.201 0.819 RF SVM 0.997 1.000 0.111 0.089 1.122 1.097 0.796 0.789 All of the final models were assessed on training sample, validation sample and out-of-time sample. We have compared the AUC coefficients for all models and for all samples. There was only a slight decrease between training and validation sample. We have observed decrease of the AUC coefficient between validation sample and out-of-time sample. However, this was expected as the population of clients is changing over time. 4. Model Development 56 Figure 4.10: AUC coefficient comparison Source: own processing 4.7 Evaluation of Ordering Ability Probably the most important ability of application scorecard is ordering - less risky client needs to get better score. Of course, this relationship does not always hold, otherwise we would have perfect model. But we need model that is as close to the perfect model as possible. The following will be used for assessment of the first hypothesis: • Dynamics of risk and score • ROC curve • Difference between the cumulative distribution functions for good and for bad clients • Cumulative Lift • Distribution of Goods and Bads The following chart depicts the relationship between the risk and the score for the five models. Each line represents one model. The riskiness (y-axis) of the total portfolio is 3.51%. The x-axis represents clients (or loans) ordered by score, from the best to the worst one. This axis is in percentiles because each model assigned each client different score. For better understanding of this graph, we have compared the risk of the portfolio on the 70th percentile. We can interpret 4. Model Development 57 this in the following way: if we approve only the best 70% of clients, then the risk would be 0.97% if we use the MLP model or 1.32% in case of SVM model (i.e. risk higher approximately by one third). We can see that MLP performs substantially better than the other models, followed by RBFN, Logit, RF and SVM. Table 4.8: Portfolio Risk on the 70th Percentile: Model Comparison Percentile Logit 70.00% 1.17% MLP 0.97% RF SVM RBFN 1.30% 1.32% 1.12% Figure 4.11: Risk Dynamics Source: own processing The ROC curves show the ratio of cumulative Bads in the cumulative Goods. Specifically, it shows how many bad clients are in the best x% of good clients. Again, MLP shows the best discriminatory ability, followed by RBFN, Logit, RF, while SVM being the worst model. 4. Model Development 58 Table 4.9: K-S & AUC Estimation Results K-S AUC Logit MLP RF SVM RBFN 0.489 0.812 0.528 0.837 0.473 0.796 0.460 0.789 0.500 0.819 Figure 4.12: ROC Source: own processing The following plot shows the difference between cumulative distribution functions of Goods and Bads. The highest point of each curve represents the Kolmogorov-Smirnov Statistics. Using this plot, we see the development of the discriminatory power of various models. The shapes are rounded and wide, it means that in case of e.g. MLP the discriminatory power is high and relatively stable between the 60th and 85th percentile. 4. Model Development 59 Figure 4.13: Difference between CDF of Goods and Bads Source: own processing The Cumulative Lift graph shows the riskiness of the bottom x% loans in comparison with the riskiness of the whole portfolio. This measure is convenient for the local assessment of the discriminatory power. We see that the performance of MLP is the best one in the bottom 10% and SVM performs significantly worse the any other model. 4. Model Development 60 Figure 4.14: Cumulative Lift Source: own processing We have calculated the distribution of Goods and Bads for each model. These following distributions do not provide us with clear answer which model is better; nevertheless it is always beneficial to see these plots. It will help to understand what kind of scores the models are producing. The distribution of Goods is as expected - majority of Goods is highly concentrated on the top scores. The only exception is the RBFN where the Goods are spread over higher interval. We did not expect this and it is interesting, especially if we take into consideration that the RBFN model had usually the second best performance according to the aforementioned criteria. However, the distribution of Bads is appealing. Some of the models concentrated the Bads on higher scores (RF, SVM and partially RBFN) while other models spread the Bads over longer interval on lower scores (MLP and Logit). Figure 4.15: Goods & Bads Distribution - part 1 4. Model Development 61 Figure 4.16: Goods & Bads Distribution - part 2 4. Model Development 62 4. Model Development 4.8 63 Evaluation of Profit-Making Potential The previous section has focused mainly on statistical properties of the model, however all these models attempt to be used in business environment where the main criterion is the profit-making potential. We have estimated the following cost matrix: Table 4.10: Cost Matrix Observed Prediction Good Bad Good 1.0 0.0 -12.5 0.0 Bad We have focused on measurable costs so we are excluding the opportunity costs (rejecting Good by error). The ratio between the good one and bad one is 1 : 12.5. The direct financial implication is that we need to provide 12.5 good loans in order to cover for one bad one. We have used this higher ratio because the definition of default is delinquency on first installment and hence fraud. In these cases the probability of curing the client (repaying the overdue installments and then repaying as promised) is practically zero and the chance of success in the legal collection process is very low, too. We have employed the out-of-time sample only and calculated the maximal profit for each model. The following chart explains this logic. We have ordered the loans based on the estimated score (from the best one to the worst one). Payout based on the cost matrix (1 for good loan, −12.5 for bad loan) was assigned for each of the loan. The curve shows the cumulative profit. 4. Model Development 64 Figure 4.17: Profit Dynamics Source: own processing Process of Monte Carlo Simulation: 1. The maximal cumulative profit together with the corresponding cumulative share of loans (approval rate) using the whole out-of-time sample were calculated for each model. This approval rate can be different for each model. 2. Then we have always taken randomly 20% of the out-of-time dataset for each model and calculated the cumulative profit for the predefined approval rate (calculated in 1st point). 3. We have repeated the 2nd step 10 000 times. Table with summary of this simulation and graph showing the distribution of the profit follow. Table 4.11: Monte Carlo Simulation Summary Mean St. Dev. 10th percentile 90th percentile Kurtosis Skewness Logit MLP RBFN RF SVM 2102 2285 2183 2061 2051 161 176 164 160 161 1892 2060 1973 1854 1846 2312 2519 2402 2270 2262 -26.7% -31.4% -31.6% -29.0% -24.6% 3.6% 21.7% 19.1% 7.3% 10.8% 4. Model Development 65 Figure 4.18: Monte Carlo Simmulation Visualisation Source: own processing These results confirm the previous results. The MLP and RBFN outperform the other models in terms of profit. The only surprise is the SVM where we have expected much lower performance and higher deviation of the profit distribution. Chapter 5 Conclusion The aim of this thesis has been to introduce a number of approaches to credit risk based on artificial intelligence, outline the underlying theory and compare them. The theory supporting all the artificial intelligence approaches has been presented. We have laid more emphasis on the mathematics supporting the Support Vector Machines because this area is not covered so well as the mathematics related to the Neural Networks and their various specifications. The Random Forests were covered from the theoretical perspective as well. Nevertheless, the logic behind is not so complicated as in the case of other approaches that are substantially more complex. The current market etalon, Logistic Regression, has been described as well. All of these approaches can be easily described on tens of pages thus we refer frequently to advanced papers and books. We have focused on the major ideas and properties of all the models. Nevertheless, due to the complex nature of the aforementioned models it is impossible to describe them all in big detail in one thesis. Moreover, we have outlined another theory needed for assessment of the aforementioned models. The standard measures for performance assessment have been described; the main representatives have been the ROC curve, K-S statistics and AUC statistics. Later, the dataset from an unnamed financial company from Asia has been described. The risk management policies or even name of the data provider are not published as these are business secret. The data are remaining private because of the same reason. Nonetheless, we were allowed to perform exploratory analysis of the variables and focused on their ability to determine the riskiness 5. Conclusion 67 of the loan application. Then, an advanced MRMR algorithm has been used for the selection of the final set of variables. These variables have been employed in the model development itself. We have developed five different models based on the introduced theory and compared them based on the previously stated metrics. However, these metrics are useful for evaluation of the models from statistical perspective. Hence, we have also evaluated these models from the business perspective. We have focused on the dynamics of risk and share of the approved loans, differences between the cumulative distribution functions for good and bad loans, cumulative lift curve and on the maximal achievable profit for all of these models. Based on the aforementioned metrics and comparisons, we can conclude that the approaches based on artificial intelligence can improve the risk performance of financial company, especially in comparison with Logistic Regression. The best performing model according to all criteria has been the Multi-Layer Perceptron (specification of Neural Networks). In comparison with the current market standard (Logistic Regression) provides us with significantly lower risk for the same share of approved loans Figure 4.11 and it allows for higher profit (this was shown using the Monte-Carlo simulation and Cost Matrix). The second best model has been Radial Basis Function Network (specification of Neural Networks as well). The performance of this model has been comparable to the Logistic Regression. The other models, Random Forests and Support Vector Machines, have performed worse off. Especially, the performance of the Support Vector Machines has been lower than expected. In our opinion, the reason for this is the complexity of this approach and possibly suboptimal optimization and calculation techniques. Despite the fact that the ordinary approach, the Logistic Regression, has not been outperformed by all of the presented models based on the artificial intelligence, but only by some of them, we can conclude that the artificial intelligence approach to credit risk can be more suitable than the Logistic Regression. Hence, all of the hypotheses have been confirmed. We have been able to develop a model based on the artificial intelligence that has superior ordering ability and profit making potential in comparison with the Logistic Regression. 5. Conclusion 68 Many interesting topics have emerged during work on this thesis. One of them is the low performance of the Support Vector Machines, this can maybe be improved by further research in the optimization of this model. Another interesting topic can be the application of the artificial intelligence in the variable selection and preprocessing. Many of the variables can be grouped and mixed together and hence new variables can be created. This is currently done based on the developer’s experience, but sound theory is missing and multiclassification based on artificial intelligence might be helpful. Bibliography Agresti, A. (2013): Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley. Amaratunga, D., J. Cabrera, & Y.-S. Lee (2008): “Enriched random forests.” Bioinformatics 24(18): pp. 2010–2014. Angelini, E., G. di Tollo, & A. Roli (2008): “A neural network approach for credit risk evaluation.” The quarterly review of economics and finance 48(4): pp. 733–755. Bellotti, T. & J. Crook (2009): “Support vector machines for credit scoring and discovery of significant features.” Expert Systems with Applications 36(2): pp. 3302–3308. Bhattacharyya, S. (2000): “Evolutionary algorithms in data mining: Multiobjective performance modeling for direct marketing.” In “Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining,” pp. 465–473. ACM. Biau, G., L. Devroye, & G. Lugosi (2008): “Consistency of random forests and other averaging classifiers.” The Journal of Machine Learning Research 9: pp. 2015–2033. Breiman, L. (1996): “Bagging predictors.” Machine learning 24(2): pp. 123– 140. Breiman, L. (2001): “Random forests.” Machine learning 45(1): pp. 5–32. Breiman, L., J. Friedman, C. Stone, & R. Olshen (1984): Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis. Bibliography 70 Buja, A. & W. Stuetzle (2006): “Observations on bagging.” Statistica Sinica 16(2): p. 323. Cortes, C. & V. Vapnik (1995): “Support-vector networks.” Machine learning 20(3): pp. 273–297. Cristianini, N. & J. Shawe-Taylor (2000): An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. Derksen, S. & H. Keselman (1992): “Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables.” British Journal of Mathematical and Statistical Psychology 45(2): pp. 265–282. Ding, C. & H. Peng (2005): “Minimum redundancy feature selection from microarray gene expression data.” Journal of bioinformatics and computational biology 3(02): pp. 185–205. Elizondo, D. (2006): “The linear separability problem: Some testing methods.” Neural Networks, IEEE Transactions on 17(2): pp. 330–344. Ghodselahi, A. & A. Amirmadhi (2011): “Application of artificial intelligence techniques for credit risk evaluation.” International Journal of Modeling and Optimization 1(3): pp. 243–249. Gouvea, M. & E. B. Gonçalves (2007): “Credit risk analysis applying logistic regression, neural networks and genetic algorithms models.” In “POMS 18th Annual Conference,” . Hastie, T., R. Tibshirani, & J. Friedman (2013): The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York. Haykin, S. (2009): Neural Networks and Learning Machines. Number sv. 10 in Neural networks and learning machines. Prentice Hall. Hosmer, D., S. Lemeshow, & R. Sturdivant (2013): Applied Logistic Regression. Wiley Series in Probability and Statistics. Wiley. Khashman, A. (2010): “Neural networks for credit risk evaluation: Investigation of different neural models and learning schemes.” Expert Systems with Applications 37(9): pp. 6233–6239. Bibliography 71 Lobo, J. M., A. Jiménez-Valverde, & R. Real (2008): “Auc: a misleading measure of the performance of predictive distribution models.” Global ecology and Biogeography 17(2): pp. 145–151. Mays, E. (2001): Handbook of Credit Scoring. Business Series. Global Professional Publishing. McNelis, P. (2005): Neural Networks in Finance: Gaining Predictive Edge in the Market. Academic Press Advanced Finance. Elsevier Science. Peng, H., F. Long, & C. Ding (2005): “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(8): pp. 1226–1238. Rodrı́guez, J. D., A. Perez, & J. A. Lozano (2010): “Sensitivity analysis of k-fold cross validation in prediction error estimation.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(3): pp. 569–575. Schölkopf, B. & A. Smola (2002): Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive computation and machine learning. MIT Press. Siddiqi, N. (2012): Credit risk scorecards: developing and implementing intelligent credit scoring, volume 3. John Wiley & Sons. Souza, C. R. (2010): “Kernel functions for machine learning applications.” Creative Commons Attribution-Noncommercial-Share Alike 3. Warnock, D. G. & C. C. Peck (2010): “A roadmap for biomarker qualification.” Nature biotechnology 28(5): pp. 444–445. Witten, I. H. & E. Frank (2005): Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Witzany, J. (2010): Credit risk management and modeling. Oeconomica. Yu, L., S. Wang, & K. K. Lai (2008): “Credit risk assessment with a multistage neural network ensemble learning approach.” Expert Systems with Applications 34(2): pp. 1434–1444. Bibliography 72 Zweig, M. H. & G. Campbell (1993): “Receiver-operating characteristic (roc) plots: a fundamental evaluation tool in clinical medicine.” Clinical chemistry 39(4): pp. 561–577. Appendix A Underwriting Process in Finance The consumer credit risk can be defined as the risk of suffering loss due to the customer’s default on a corresponding credit product, such as unsecured personal loan, credit card, mortgage, overdraft etc. Majority of companies involved in lending to ordinary consumers have divisions dedicated to the consumer credit risk management, where the main aspects are measurement, prediction and mitigation of losses attributable to the credit risk. One of the most widespread methods for predicting the credit risk is credit scorecard. The scorecard is usually some statistical model assigning some number to a client. This number reflects the estimated probability that the client will behave in a certain manner. Within the process of the score estimation variety of data sources can be employed, including information from credit bureaus, application form, databases with historical behavior, internet, mobile operators or pension funds. The most common and widespread type of scorecard is the application scorecard, which is employed by finance companies when some client applies for a credit product. The ultimate goal of this scorecard is predicting the probability that the client, if the product is provided, would turn bad within a certain time horizon, and hence inducing losses to the lender. Nevertheless the crucial word here is the word ”bad”, which will be explained in the following sections. The definition of ”being bad”can vary across lenders or product types. As we have mentioned above, the score represents some probability, thus it A. Underwriting Process in Finance II should range between 0 and 1. Usually the estimated client’s score is transformed, used for underwriting purposes, and then possibly communicated to the client, which is in some countries required by law. Nevertheless for internal purposes is often used the estimated probability without any transformation. Because transformation of score brings us no additional value, we will work in this thesis only with the probabilities. Moreover we will predict the probability that the client will not become bad, because we want to preserve the logic that higher score is better. There are other important types of scorecards, for example behavioral scorecards which attempt to predict the probability of an current client becoming bad, collections scorecards, where is predicted the reaction to different strategies for collecting overdue installments and outstanding principal. Widespread are the propensity scorecards, which aim to predict the probability that the client will accept another product, leave, stop using credit card or fully utilize his credit card’s limit. The loan underwriting process is concerned with employing the predictions provided by scorecards in the decision whether to accept or reject a loan application. If the scorecard is the main tool for the underwriting purposes and final yes/no decision, then ”cut-off”points are used. A cut-off point is some certain value of the score below which clients has their application declined. If the client—s score is above the cutoff, then the applications may be approved or some additional process may follow. The setting of this threshold is closely linked to the price (i.e. usually interest rate and fees) that the lender is charging for the corresponding product. The higher pricing allows for greater credit losses and still remaining profitable. Thus with a higher price the company can accept clients with higher estimated probability of becoming bad and move the cut-off point down. However majority of sophisticated lending companies go further and charge the clients based on their score. This compensates the lenders for the higher risk of the less creditworthy clients and also allow for charging less the better clients with higher scores. A. Underwriting Process in Finance III The credit strategies in consumer finance are also dealing with the ongoing controlling of clients’ accounts, particularly with products with revolving feature such as credit cards, flexible loans or overdrafts, where the clien’s balance (and exposure of the company) can go down as well as up. Behavioral scorecards can be applied on regular basis to provide an up-to-date status of the credit-quality of the portfolio and of its sub-segments.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Artificial Intelligence Approach to Credit Risk