Download Artificial Intelligence Approach to Credit Risk

Document related concepts

Financial economics wikipedia , lookup

Transcript
Charles University in Prague
Faculty of Social Sciences
Institute of Economic Studies
MASTER THESIS
Artificial Intelligence Approach to
Credit Risk
Author: Bc. Jan Řı́ha
Supervisor: PhDr. Jozef Barunı́k Ph.D.
Academic Year: 2015/2016
Declaration of Authorship
The author hereby declares that he compiled this thesis independently, using
only the listed resources and literature.
The author grants to Charles University permission to reproduce and to distribute copies of this thesis document in whole or in part.
Prague, January 3, 2016
Signature
Acknowledgments
I would like to thank to PhDr. Jozef Barunı́k Ph.D. for his comments and
valuable feedback, my family and my partner for her continuous moral support.
Abstract
This thesis focuses on application of artificial intelligence techniques in credit risk
management. Moreover, these modern tools are compared with the current industry
standard - Logistic Regression. We introduce the theory underlying Neural Networks,
Support Vector Machines, Random Forests and Logistic Regression. In addition, we
present methodology for statistical and business evaluation and comparison of the
aforementioned models. We find that models based on Neural Networks approach
(specifically Multi-Layer Perceptron and Radial Basis Function Network) are outperforming the Logistic Regression in the standard statistical metrics and in the
business metrics as well. The performance of the Random Forest and Support Vector
Machines is not satisfactory and these models do not prove to be superior to Logistic
Regression in our application.
JEL Classification
Keywords
Author’s e-mail
Supervisor’s e-mail
G23, C15, C45, C53, C58
Credit Risk, Scoring, Neural Networks, Support
Vector Machines, Random Forests
[email protected]
[email protected]
Abstrakt
Tato práce se zabývá aplikaci umělé inteligence v řı́zenı́ kreditnı́ho rizika. Tento
modernı́ přı́stup je porovnán s aktuálnı́m standardem trhu, s logistickou regresı́. V
práci prezentujeme teorii zaměřenou na neuronové sı́tě, podpůrné vektorové stroje,
náhodné lesy a logistickou regresi. Také se zabýváme metodologiı́ na vyhodnocenı́ a
porovnávánı́ těchto modelů ze statistického a obchodnı́ho hlediska. Zjistili jsme, že
modely z kategorie neuronových sı́tı́, zejména Multi-Layer Perceptron a Radial Basis
Function Network, překonávajı́ logistickou regresi ve standardnı́ch statistických a obchodnı́ch kritériı́ch. Výkonnost náhodných lesů a podpůrných vektorových strojů nenı́
dostatečná a v našı́ práci jejich výkonnost nedosahovala výkonnosti logistické regrese.
Klasifikace JEL
Klı́čová slova
G23, C15, C45, C53, C58
Kreditnı́ Riziko, Scoring, Neuronové Sı́tě,
Podpůrné Vektorové Stroje, Náhodné Lesy,
Logistická Regrese
E-mail autora
[email protected]
E-mail vedoucı́ho práce [email protected]
Table of Contents
List of Tables
vii
List of Figures
viii
Acronyms
x
Thesis Proposal
xi
1 Introduction
1
2 Motivation and Literature Review
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
3 Methodology
3.1 Linear and Non-linear Classification . . . . . . . . . .
3.2 Logistic Regression . . . . . . . . . . . . . . . . . . .
3.3 Neural Networks . . . . . . . . . . . . . . . . . . . .
3.3.1 Feedforward Networks . . . . . . . . . . . . .
3.3.2 Neural Networks with Radial Basis Functions
3.3.3 Jump Connections . . . . . . . . . . . . . . .
3.3.4 Multilayered Feedforward Networks . . . . . .
3.4 Support Vector Machines . . . . . . . . . . . . . . . .
3.4.1 Computation of the Support Vector Classifier
3.4.2 Support Vector Machines and Kernels . . . . .
3.4.3 Support Vector Machines and Regressions . .
3.5 Random Forests and Trees . . . . . . . . . . . . . . .
3.5.1 Classification and Regression Trees . . . . . .
3.5.2 Random Forests . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
8
11
12
15
17
19
20
23
24
29
31
31
34
Contents
4 Model Development
4.1 Variable Selection . . . . . . . . . . . . . . .
4.2 Model Selection . . . . . . . . . . . . . . . .
4.3 Performance assessment . . . . . . . . . . .
4.4 Hypotheses . . . . . . . . . . . . . . . . . .
4.5 Data Descripption and Exploratory Analysis
4.6 Model Building . . . . . . . . . . . . . . . .
4.7 Evaluation of Ordering Ability . . . . . . . .
4.8 Evaluation of Profit-Making Potential . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
40
40
45
47
49
56
63
5 Conclusion
66
Bibliography
72
A Underwriting Process in Finance
I
List of Tables
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
Observations . . . . . . . . . . . . . . . . . . . . . . . . .
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
Categorical Variables . . . . . . . . . . . . . . . . . . . .
Categorical Variables: Detailed Overview . . . . . . . . .
Independent Variables . . . . . . . . . . . . . . . . . . .
Parameter Estimation . . . . . . . . . . . . . . . . . . .
Comparison of Final Specifications . . . . . . . . . . . .
Portfolio Risk on the 70th Percentile: Model Comparison
K-S & AUC Estimation Results . . . . . . . . . . . . . .
Cost Matrix . . . . . . . . . . . . . . . . . . . . . . . . .
Monte Carlo Simulation Summary . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
48
48
49
54
55
55
57
58
63
64
List of Figures
3.1
3.2
3.3
3.4
3.5
6
10
12
13
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
Linear and Non-linear Separable Cases . . . . . . . . . . . . . .
Logit and Probit Distribution Function . . . . . . . . . . . . . .
Architecture of Basic Feedforward Network . . . . . . . . . . . .
Step, Tansigmoid, Gaussian and Logsigmoid Functions . . . . .
Separation using Neural Network with Different Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Feedforward Network with Jump Connections . . . . . . . . . .
Architecture of Multilayered Feedforward Network . . . . . . . .
Separable and Non-separable Cases . . . . . . . . . . . . . . . .
(Non)separability Demonstration . . . . . . . . . . . . . . . . .
Separation using SVM with Different Kernel Functions . . . . .
-sensitive Error Function . . . . . . . . . . . . . . . . . . . . .
CART Plot and Explanation of Logic Using Simple Data . . . .
Comparison of Random Forests . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
K-S Statistics . . . . . . . . . . . . . . . . . .
ROC, AUC . . . . . . . . . . . . . . . . . . .
Cummulative Lift . . . . . . . . . . . . . . . .
Out-of-time Sample . . . . . . . . . . . . . . .
Non-Categorical Variables: Detailed Overview
Employment Length . . . . . . . . . . . . . .
Goods Price . . . . . . . . . . . . . . . . . . .
Bad & Good: Visualisation . . . . . . . . . . .
Categorical Variables: Visualisation . . . . . .
AUC coefficient comparison . . . . . . . . . .
Risk Dynamics . . . . . . . . . . . . . . . . .
ROC . . . . . . . . . . . . . . . . . . . . . . .
Difference between CDF of Goods and Bads .
42
43
45
47
50
50
50
51
52
56
57
58
59
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
18
19
21
25
28
30
32
36
List of Figures
4.14
4.15
4.16
4.17
4.18
Cumulative Lift . . . . . . . . . . . . .
Goods & Bads Distribution - part 1 . .
Goods & Bads Distribution - part 2 . .
Profit Dynamics . . . . . . . . . . . . .
Monte Carlo Simmulation Visualisation
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
61
62
64
65
Acronyms
AUC
Area Under ROC Curve
CART
Classification And Regression Trees
CDF
Cumulative Distribution Function
DPD
Days Past Due
FLM
Full Logistic Model
LLR
Land Line Refused
MCS
Monte Carlo Simulation
MLP
Multi-Layer Perceptron
MRMR
Minimum Redundancy Maximum Relevancy
NN
Neural Networks
OR
Odds Ratio
OA
Ordering Ablility
RBF
Radial Basis Function
RBFN
Radial Basis Function Network
ROC
Receiver Operating Characteristic
RF
Random Forests
SVM
Support Vector Machines
Master Thesis Proposal
Author
Supervisor
Proposed topic
Bc. Jan Řı́ha
PhDr. Jozef Barunı́k Ph.D.
Artificial Intelligence Approach to Credit Risk
Motivation Neural networks together with other artificial intelligence approaches have been successfully used in a variety of business fields including accounting, marketing, or production management (McNelis, 2005). Majority of
the studies have used these tools for forecasting stock returns, bankruptcy, exchange rate, or credit card fraud (Gouvêa et al., 2007).
The granting of loans by finance companies is one of the key fields concerning decision problems that require precise treatment. Models based on artificial
intelligence are believed to have potential for rendering effective credit assessments to support approval process in finance companies. Many researchers are
currently concentrating on employing neural network classification models in
order to divide loan applications into good and bad ones (Angelini et al., 2008;
Ghodselahi et al., 2011).
Generally, loan officers in Central and Eastern Europe and in Asia rely on
traditional methods as logistic regression or decision trees to guide them in
assessing the creditworthiness of a client (Steiner et al., 2006). The complexity
of various decision tools and differences between applications is challenge for
a neural-computing approach which is to provide learning power that is not
offered by other techniques. Artificial intelligence tools, with their potential
of capturing complex and nonlinear relationships, are promising alternative to
the common classification and forecasting methods (McNelis, 2005; Khashman,
2010).
Master Thesis Proposal
xii
Hypothesis
(i) Hypothesis #1: The models based on artificial intelligence outperform
the conventional decision making techniques as ordinal risk measure.
This hypothesis will be tested by ordering the loans by the score assigned
by different discrimination techniques. Then the development of correlation of risk and score will be assessed and compared for each model.
(ii) Hypothesis #2: The models based on artificial intelligence outperform
the conventional decision making techniques in terms of potential profit.
This hypothesis will be tested using cost matrix and Monte-Carlo simulation.
(iii) Hypothesis #3: The models based on artificial intelligence provide more
time-stable performance in risk decision-making in comparison with conventional techniques.
We will compare all of our models on out-of-time dataset.
Methodology The first step is collection of suitable data, revision of inconsistencies or missing values and transformation to proper format. We have decided
to use retail portfolio from one country in Asia. Besides, we have complex data
about the applicants at hand.
Then we will provide a theory underlying selected models based on artificial intelligence - variations of Neural Networks, Support Vector Machines or
Random Forests. The presented models will be learned using the datasets and
compared with the etalon in the risk industry - logistic regression. Afterwards
the profit-making potential of each approach will be assessed using MonteCarlo simulations. At the end of the thesis, we summarize the discrimination
and risk performance of all employed models and propose the most appropriate
approach.
Expected Contribution We expect that we will be able to prove the superior discrimination ability of the methods based on artificial intelligence. There
were some researches focused on similar topic (Hull et al., 2009; Huang et
al., 2004; Yu et al., 2008), nevertheless they were usually using data form developed markets. In our thesis are used data from less developed countries,
where is put more emphasis on superior predictive power of models, because
Master Thesis Proposal
xiii
the collection effectiveness in these countries is considerably lower (Eletter et
al., 2010). Also the structure of defaults in these countries is different; we can
observe much higher fraud and defaults in general. Thus we anticipate reaching
higher predictive power of artificial intelligence techniques in comparison with
more traditional approach, especially in these demanding conditions.
Outline
1. Motivation: The risk costs are the most significant costs in consumer
finance companies, especially in less developed countries where the local
legal framework and creditworthiness of people is substantially lower than
in western countries. The artificial intelligence models have the potential
to enhance the current underwriting processes in credit granting and thus
lowering the risk costs and increasing the profitability.
2. Theory underlying the selected models based on artificial intelligence: We
will introduce the theory and its development
3. Data: We will describe the data and perform selected analyses of the
data (e.g. distribution of some variables or explanation of some unusual
relationships that are caused by local specifics).
4. Methods: We will use WEKA, Matlab and SQL Developer for data processing, models learning and for subsequent models assessments.
5. Results: We will compare the discrimination power of the models and
their profit-making potential. The most suitable type of model will be
proposed.
6. Concluding remarks: We will summarize the whole work and propose area
for further research in this topic.
Core bibliography
1. Angelini, E., di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk
evaluation. The Quarterly Review of Economics and Finance, 733-755.
2. Ashenfelter, O., Harmon, C., Oosterbeek, H., (1999). A review of estimates of the
schooling/earnings relationship, with tests for publication bias. Labour Econ. 6 (4),
453-470.
3. Eletter, S. F., & Yaseen, S. G. (2010). Applying Neural Networks for Loan Decisions
in the Jordanian Commercial Banking System. 209-214.
Master Thesis Proposal
xiv
4. Ghodselahi, A., & Amirmadhi, A. (2011). Application of Artificial Intelligence Techniques for Credit Risk Evaluation. International Journal of Modeling and Optimization, 243-249.
5. Gouvea, M. A., & Gonçalves, E. B. (2007). Credit Risk Analysis Applying Logistic
Regression, Neural Networks and Genetic Algorithms Models.
6. Hall, M. J. B., Muljawan, D., Suprayogi, & Moorena, L. (2009). Using the artificial
neural network to assess bank credit risk: a case study of Indonesia. Applied Financial
Economics, 1825-1846.
7. Huang, Z., Chen, X., Hsu, C. J. , Chen, W. H., and Wu, S., (2004) Credit Rating
Analysis with Support Vector Machines and Neural Networks: A Market Comparative
Study, Decision Support System, 543-558.
8. Khashman, A. (2010). Neural networks for credit risk evaluation: Investigation of
different neural models and learning schemes. Expert Syst. Appl., 6233-6239.
9. McNelis, P. D. 2005, Neural Networks in Finance: Gaining Predictive Edge in the
Market. Academic Press Advanced Finance
10. Steiner, M. T. A., Neto, P. J. S., Soma, N. Y., Shimizu, T., & Nievola, J. C. (2006).
Using Neural Network Rule Extraction for Credit-Risk Evaluation. International Journal of Computer Science and Network Security, 6-16.
11. Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural
network ensemble learning approach. Expert Systems with Applications, 1434-1444.
Author
Supervisor
Chapter 1
Introduction
The financial industry has experienced substantial growth in the past decades
and it has become vitally important to implement advanced statistical and
mathematical methods to assess the possible risk and exposures resulting from
various investment activities. These methods provide fast and automatic tools
that help in making effective decisions. In this thesis, we focus on the field
of financial companies providing consumer loans and introduce several tools
used for management of the consequent risk exposures. The main and the most
advanced part of this process involves scoring of the loan applicants.
The ultimate task of the credit scoring is rather unsophisticated: development
of a model which is capable of distinguishing between a bad and a good debtor
or, in other words, estimating the probability of being bad or good. We can reformulate for the sake of clarity: utilize the past data and use it for assessing the
relationships generally applicable for the future discrimination between these
two classes, and with lowest number of false positives and false negatives classifications. Nevertheless, there are many ways to perform this task and some
are more powerful and robust than the others.
Two scientific fields are encountering here: machine learning and statistics.
While the scope and methods of these fields are generally different, drawing
a clear differentiating line between them in the context of the classification is
nearly impossible. For example, the logistic regression is often used for classification and is considered to be a statistical tool. On the other hand, the
neural networks, and others, are usually referred to as tools from the field of
machine learning. Statistics emphasizes description of the relationships among
1. Introduction
2
the variables in a model and quantifying the contribution of every single one of
them, while machine learning focuses on the ability to make sufficiently correct
predictions without any substantial statistical or econometric assumptions.
In this thesis, we will introduce several representatives from both groups, outline the relevant theory and compare them based on their ability to mitigate the
risk and generate profit. The main areas covered in this thesis will be Neural
Networks, Support Vector Machines, Random Forests and Logistic Regression
serving as the current market standard.
Chapter 2
Motivation and Literature Review
2.1
Literature Review
This thesis tries to provide a brief synthesis of theory focused on Logistic Regression, Neural Networks and their specifications, Support Vector Machines
and Random Forests.
The general introduction to the classification and related theory can be found
in Khashman (2010), Elizondo (2006) or in great book focused on statistical
learning Hastie et al. (2013).
We have used the standard theory of logistic regression introduced in Agresti
(2013) or Hosmer et al. (2013) and its connection to the probability of default
modeling Hastie et al. (2013).
There is rich theory underlying Neural Networks, the basic architecture of these
models is well explained in Angelini et al. (2008) or Ghodselahi & Amirmadhi
(2011). The more advanced and complex structures are described in Haykin
(2009), Khashman (2010), Witten & Frank (2005) or Yu et al. (2008). Other
useful book focused on Neural Network was written by McNelis (2005) which
can be supplemented again by the book written by Hastie et al. (2013).
On the other hand, there are not many resources describing the Support Vector Machines in detail. This is probably caused by the fact that this area is
relatively new and that the major emphasis has been placed on Neural Networks. Nevertheless, the Support Vector Machines are becoming much more
2. Motivation and Literature Review
4
popular. The first paper describing this topic was published by Cortes & Vapnik (1995). Many researchers continued in this field, we recommend Bellotti &
Crook (2009), Cristianini & Shawe-Taylor (2000) and Elizondo (2006) for the
introduction to this topic. The mathematical foundations are well described in
Schölkopf & Smola (2002), Cortes & Vapnik (1995) or Souza (2010). We have
been looking into deeper detail in the mathematics of Support Vector Machines
because it is important to understand the logic and because the other papers
were less descriptive in this topic. The ability to estimate the probability is
described in Hastie et al. (2013).
However, the last type of model is rather simple and not much theory is needed.
The foundations of Classification and Regression Trees were laid by Leo Braiman in his seminal paper in 1984 Breiman et al. (1984). The extension of this
theory, Random Forests, was published by the same author, Breiman (2001).
The Random Forest and their properties were described and researched by
many authors, we can recommend Amaratunga et al. (2008), Biau et al. (2008)
and Buja & Stuetzle (2006)).
We have briefly covered theory focused on model development and its assessment. Rather simple variable selection is described in Derksen & Keselman
(1992). More sophisticated variable selection employing relevancy and redundancy of the variables was published in Peng et al. (2005) and Ding & Peng
(2005).
Description of the tools used for performance assessment of the developed modes can be found in Lobo et al. (2008), Rodrı́guez et al. (2010), Siddiqi (2012)
and Zweig & Campbell (1993)
Many practical aspects of scorecards development process were covered by the
following books Siddiqi (2012), Mays (2001) or Witten & Frank (2005).
Chapter 3
Methodology
3.1
Linear and Non-linear Classification
An algorithm that performs classification is known as a classifier. Based on the
method employed, the algorithm is replaced by some mathematical function,
which parameters need to be estimated using training dataset Khashman (2010).
The main task of the supervised machine learning is deriving function which
correctly describes the data in training set, and moreover is capable of generalizing from the training dataset to the unobserved data. As the prediction
is the main application of the classifiers, the ability to generalize is the most
important property that distinguishes a good classification from the bad one.
Important classification subclass is probabilistic classification. Algorithms for
this subclass employ statistical inference to determine the best class for given
input data. The usual classifiers just select a best class; nevertheless the probabilistic approach to this problem generates for the input data probability of
being member of some class. Then the best class is usually selected as the class
with highest probability.
The classification function generates a border between two classes. This border is called decision boundary, or decision space in case of higher dimensions.
Based on the shape of the decision boundary, we distinguish two main classes
of classifiers: linear and non-linear classifiers. Thus, if some data can be separated by linear decision boundary, then the data are linearly separable. The
more formal definition follows Elizondo (2006).
3. Methodology
6
Two subsets A, B ∈ Rn are linearly separable if there can be
built a hyperplane of Rn , such that elements of A and elements of
B are on the opposite side of the hyperplane.
It is convenient to formalize the classification problem Elizondo (2006):
• Let the dot product space Rn be the data universe
• Let S be a sample set, S ⊂ Rn
• Let f : Rn → {−1, +1} be the target mapping function
• Let D = {(x, y); x ∈ A ∧ y = f (x)} be the training set
Then the estimated classifier is a function fˆ : Rn → {−1, +1} using D such
that fˆ(x) ≈ f (x) for all x ∈ Rn .
Figure 3.1: Linear and Non-linear Separable Cases
Source: own processing
Depending on n we construct a line, plane or hyperplane that separates
the two classes as best as possible. Using vectors, we can express this decision
boundary in the form as
w·x+b=0
(3.1)
The advantage of the vector form (3.1) is the fact, that it is valid for any number of dimensions, thus it can be line, plane or n-dimensional hyperplane. In
3. Methodology
7
the machine learning literature the b stands for bias and w for weight vector.
The literature related to statistics refers both as parameters.
Having the weight vector and bias from (3.1), the classification of some observation is simple: if the observation belongs to the +1 class, then the value of
(3.1) will be positive and thus above the decision boundary. On the contrary,
should the observation belong to the −1 class, the point will lie under the decision boundary and the value of (3.1) will be negative. The decision function
f might be constructed in the following way Hastie et al. (2013).
fˆ(x) = sgn(w · x + b)
(3.2)
where sgn() represents sign function, which results +1 for positive parameters,
−1 for negative parameters and 0 for parameters with value of zero.
This problem can be called binary classification problem, because the output domain Y = {−1, +1} consists of only two elements. There exists generalization of this approach; if the output domain consist of m classes, i.e.
Y = {1, 2, 3, . . . , m}, then we speak about m-class classification. And in the
continuous case, Y ⊆ R, we speak about regression.
The obvious problem lies in the determination of the optimal bias b and weight
vector w. Generally, there is major approach to this task; the discriminative
learning.
The discriminative learning does not have the ambition to infer the underlying probability distributions of all relevant variables and attempts only to
find the most suitable mapping from the inputs to the outputs. This basically
means that the model tries to learn the conditional probability distribution
.
p(y .. x) directly. Within this class should be mentioned, among others, Neural
networks, Logistic regression or Support vector machines. This thesis will focus
on the aforementioned models.
The difference in performance of various linear classifiers is caused by different
assumptions regarding the underlying distributions or by the various methods
used for the estimation of the weight vectors and biases. If we get back to the
geometrical interpretation of the linearity and input data, we should realize
3. Methodology
8
that the decision boundary generated by different linear classifiers should be
relatively similar (if both established on reasonable assumptions). Nevertheless
the reason why some linear classifiers outperform the other, in general depends
on three points:
i) The ability of handling the linearly inseparable data.
ii) How well the classifier copes with the random noise in the data and with
outliers.
iii) Whether, and how well, can the classifier utilize the non-linear relationships
in the dataset.
In the following chapters will be shown that some linear classifiers lack
ability of dealing with some of these points and how some classifiers handle
these issues.
3.2
Logistic Regression
The logistic regression belongs to the class of linear classifiers although the
name would indicate otherwise. Due to its stable performance and widespread
application in finance and other areas, it serves as an industry standard. The
other reasons for its success are its easy implementation and interpretation of
the business logic behind the values of each variable.
In the following section, which is mainly based on Hosmer et al. (2013) and
Agresti (2013), we introduce the mathematics underlying the logistic model.
For given vector x = (x0 , x1 , . . . , xp )0 of input data we consider some dependent
variable Yx with binary distribution ( Yx = 1 for default and Yx = 0 otherwise).
The expected value of the dependent variable Yx can be written as
E(Yx ) = 1 ∗ P (Yx = 1) + 0 ∗ P (Yx = 0) = 1 ∗ P (Yx = 1) = π(x)
(3.3)
where the term π(x) = P (Yx = 1) represents the conditional probability of default for a given input vector of variables x. The primary objective is defining
an appropriate model to capture the dependence of the probability of default
on the vector of input variables. The ordinary linear regression model might
3. Methodology
9
be easy to use, but is not suitable for systems where the binary dependent
variables occur. This is mainly because the probability π(x) ranges between
zero and one whereas the values generated by linear regression can by any realvalued numbers.
Therefore, it is necessary to define so called odds-function as the ratio of the
probability of default and the probability of survival (i.e. non-default).
odds(x) =
π(x)
P (Yx = 1)
=
P (Yx = 0)
1 − π(x)
(3.4)
The domain of this function are mapped into interval h0, ∞). Nevertheless we
need mapping into interval h0, 1i Witzany (2010), thus we need to first use
logarithmic transformation
logit(x) = ln odds(x) = ln
π(x)
1 − π(x)
(3.5)
and then set the logit(x) = β 0 x in order get an explicit formula for our logistic
regression which ranges within the desired interval. The final form is defined
as Hosmer et al. (2013).
0
eβ x
π(x) =
(3.6)
0
1 + eβ x
Instead of logit transformation (3.5) can be used any other transformation from
h0, 1i to R. Other possible transformation is an application of the distribution
function Φ of standard normal distribution, known as probit Hastie et al. (2013):
probit(x) = Φ−1 π(x)
(3.7)
3. Methodology
10
Figure 3.2: Logit and Probit Distribution Function
Source: own processing
Nevertheless the logit transformation is preferred for its simplicity because
unlike in the case of probit, the logit transformation has closed-form function
and is easy to compute. In addition, the parameters in logistic regression have
a straightforward interpretation - for one unit change in characteristic xi , the
impact on the odds is calculated by simple multiplication by parameter eβi .
Suppose that the input data contain s categorical variables for each loan where
the i-th variable is composed of pi categories. We thus define set of all pairs
(i, j) of the variables i with corresponding categories j as Agresti (2013).
Z = (i, j) : i ∈ {1, . . . , p}, j ∈ {1, . . . , pi }
(3.8)
For each loan k is defined vector of dummy variables (where (xij )k = 1 if loan
k lies in the corresponding category j of the variable i.
xk = (xij )k : (i, j) ∈ Z
(3.9)
Then we define a set B consisting of all bad (defaulted) loan and the set G,
where are the good (non-defaulted) loans only Agresti (2013).
n
o
Bji = k : k ∈ B, (xij )k = 1
n
o
Gij = k : k ∈ G, (xij )k = 1
(3.10)
(3.11)
3. Methodology
11
Now the total odds ratio can be defined as the number of bad clients against
the number of goods clients.
|B|
|G|
odds =
(3.12)
This ratio can be defined for particular categories j and variables i as well:
|Bji |
|Gij |
oddsij =
(3.13)
Finally, we introduce the odds ratio (ORji ) as
oddsij
=
odds
Based on Agresti (2013) we can calculate the odds(x) as
ORij
|B| Y
odds(x) =
|G|
(i,j)∈Z
oddsij
odds
(3.14)
!
Y
= odds
ORij
xij
(3.15)
(i,j)∈Z
The Full logistic model can be introduced. In this model, we put particular
weight to each category of the each dummy variable. Thus, the form of the
scoring function is following
S F LM (x, λ) = odds
Y
ORij
λij xij
,
(3.16)
(i,j)∈Z
where the set λ = (λij : (i, j) ∈ Z) is the set of parameters.
3.3
Neural Networks
Similarly to the linear approximation methods, neural network transforms set
of input variables into a set of output variables. The main differences in neural
networks with comparison with other approximation methods are the hidden
layers, where the input variables are at least once transformed by multiple activation functions Angelini et al. (2008). These hidden layers may appear to be
hardly comprehensible; nevertheless they constitute a very powerful approach
to malleable modeling of nonlinear relationships.
3. Methodology
3.3.1
12
Feedforward Networks
The architecture of feedforward neural networks is illustrated in the next figure. This simple network consists of an input layer containing n input neurons
{xi }, i = 1, 2, 3, . . . , n one hidden layer with m neurons and one output neuron
Ghodselahi & Amirmadhi (2011).
The parallel processing of information in this system is a crucial advantage
over the typical linear systems, where we usually observe only sequential processing. However, neural networks with simple feedforward architecture can
efficiently approximate any basic model. For example, the linear model could
be synthesized by feedforward network with one neuron in the hidden layer
with linear activation function.
Figure 3.3: Architecture of Basic Feedforward Network
Hidden Layer
Input Layer
Inputs
X1
X2
Output Layer
X3
Output
Xn
Connection Weights
Source: own processing inspired by Hastie et al. (2013)
Neural networks process the input data in two ways. Firstly are the data
transformed into linear combinations with different weights and then are these
linear combinations processed by activation functions. The activation function
can be any function, but usually are used linear, step, Gaussian, tansigmoid or
3. Methodology
13
logsigmoid functions Hastie et al. (2013). The following figure illustrates these
functions. The linear combinations of data from input layers are transformed by
these functions before they are transmitted to the next layers or to the output
neuron.
Figure 3.4: Step, Tansigmoid, Gaussian and Logsigmoid Functions
Source: own processing
The attractiveness of the logsigmoid (or also called logistic) activation
function comes from its behavior, which reasonably well describes majority
of types of responses to development in underlying variables. For example if
the probability of default of some government on its sovereign debt is very high
or very low, then small changes in this probability will have little impact on
the decision whether to buy this debt or not. Nonetheless, between the ranges
of these two exceptional situations, even relatively minor changes in the riskiness of this debt could significantly influence the overall buy-sell opinion of the
market.
The feedforward network with one hidden layer is described by these equations
Hastie et al. (2013):
i
X
nk,t = ωk,0 +
ωk,i xi,t
(3.17)
i=1
Nk,t = L(nk,t ) =
1
1 + e−nk,t
(3.18)
3. Methodology
14
yt = γ0 +
k
X
γk Nk,t
(3.19)
k=1
The term L(nk ) represents the logsigmoid activation function, and index
i represents the number of input variables {x} and index k the number of
neurons. The variable nk is formed by a linear combination of these inputs
based on weights ωk,i together with the constant term ωk,0 . This variable is
then here transformed by logsigmoid activation function into neuron Nk,t at
observation t. Then the set of k neurons at observation t are used for creation
of a linear combination with the coefficient vector {γk } and with the constant
term γ0 in order to forecast yˆt at observation t. This example of feedforward
neural network where the logsigmoid activation function is applied is known
as the multi-layer perceptron network Haykin (2009). It is the fundamental
architecture of neural network and is often used as a benchmark for alternative
architectures Khashman (2010).
Aforementioned activation function can be replaced by tansigmoid or cumulative Gaussian function. The only change in comparison with the previous
equations describing network with logistic activation function would be Hastie
et al. (2013)
enk,t − e−nk,t
(3.20)
Nk,t = T (nk,t ) = nk,t
+ e−nk,t
e
for tansigmoid activation function and
Z
nk,t
Nk,t = Φ(nk,t ) =
−∞
r
1 −n2k,t
e 2
2π
(3.21)
In the following figure are used shows the examples of decision boundaries
created by neural networks with different activation function.
3. Methodology
15
Figure 3.5: Separation using Neural Network with Different Activation Functions
Source: own processing
3.3.2
Neural Networks with Radial Basis Functions
In radial basis function (RBF) networks serves the Gaussian density function
as an activation function. In many specialized publications focused on neural networks is the Gaussian density function (serving as activation function)
called radial basis function. Nevertheless the architecture of the RBF network
is substantially different from feedforward neural network we have presented in
previous sub-chapter. Moreover, the central idea of RBF networks is different
as well. In plain terms, an RBF network performs a classification by calculating
the inputs resemblance to examples from the training data.
3. Methodology
16
The RBF network has three usual layers: output, hidden and the input
layer Angelini et al. (2008). In the input layer is one neuron for each input
variable. The processing before the input layer usually standardizes the range
of the values and then the information is fed to each neuron in the subsequent
hidden layer. In the hidden layer is variable amount of neurons, where each of
them consists of a radial basis function which is centered on a point that has
as many dimensions as is the number of input variables Yu et al. (2008). It is
important to mention that the radius (also called spread) could be different for
each dimension. When processing a data, then the neurons in hidden layer calculate the Euclidean distances of the processing observations from the neurons’
center points and subsequently transforms these distances by the RBF kernel
functions (here the Gaussian functions) with usage of the radius values Hastie
et al. (2013). The computed values are passed to the output layer, where are
multiplied with corresponding weights and summed together. For classification
problems, this value is the probability that the evaluated case falls into a particular category.
The RBF network is described by the following system of equations Hastie
et al. (2013):
T
X
(yt − yˆt )2
(3.22)
min
ω,µ,γ
t=0
nt = ω0 +
i
X
ωi xi,t
(3.23)
i=1
s
Rk,t = φ(nt , µk ) =
1
exp
2πσn−µk
yˆt = γ0 +
k
X
γk Rk,t
(−[nt − µk ])2
(σn−µk )2
(3.24)
(3.25)
k=1
where the x again represent the input variables, n the linear transformation of
the input variables with application of weights w. There are k different centers
(µk ) for transformation by radial basis function, calculate the k spreads and
obtain k various functions Rk . In the pre-final step are the outputs from these
functions combined in a linear manner with employment of the weights γ to
calculate the forecast of y.
The estimation of the center points can be done in general by any clustering
3. Methodology
17
algorithm. We have decided for K-means clustering because of its ease of application and robust results, for more information see Hastie et al. (2013). Some
set of clusters, each with n-dimensional centers, is determined by the number
of nodes in the input layer. Then using the K-means algorithms are established
the clusters’ centers, which will become the centers of the RBF units.
Once the RBF units’ centers are established, the spread of each RBF unit
can be estimated, for example, again with K-nearest neighbors algorithm Hastie et al. (2013). For certain K, and for each center, the K-nearest centers are
found. Given that the current cluster’s center is cj ; then the spread is calculated
as
s
rj =
3.3.3
Pk
i=1 (cj
k
− ci )2
(3.26)
Jump Connections
An alternative to the ordinary feedforward networks are feedforward networks
with jump connections, where the neurons in the input layer have a direct
connection to the output layer McNelis (2005). Figure 3.6 shows an example of
this architecture.
3. Methodology
18
Figure 3.6: Feedforward Network with Jump Connections
Hidden Layer
Input Layer
Inputs
X1
X2
Output Layer
X3
Output
Xn
Jump Connections
Source: own processing inspired by Hastie et al. (2013)
The mathematical description of this architecture is very similar to the
ordinary feedforward network:
nk,t = ωk,0 +
i
X
ωk,i xi,t
(3.27)
1
1 + e−nk,t
(3.28)
i=1
Nk,t = L(nk,t ) =
yt = γ0 +
k
X
k=1
γk Nk,t +
i
X
βi xi,t
(3.29)
i=1
The main disadvantage of this architecture is the increase in the number of
parameters in the network by the number of inputs, i. On the other hand, a
substantial advantage is that it hybridizes the pure linear model (represented by
the jump connection) with the feedforward network. The consequence is straightforward: this system allows for nonlinear function that may have nonlinear
component, as well as a linear component McNelis (2005).
3. Methodology
3.3.4
19
Multilayered Feedforward Networks
In the case of higher complexity of the problem we want to approximate, we
can use more complex architecture of the feedforward network. We can add one
or more layers and jump connections. For the sake of mathematical lucidity,
this architecture can be described as follows (if we assume two hidden layers
and logsigmoid activation functions):
nk,t = ωk,0 +
i
X
ωk,i xi,t
(3.30)
i=1
Nk,t =
1
1 + e−nk,t
pl,t = ρl,0 +
k
X
(3.31)
ρl,i Nk,t
(3.32)
k=1
Pl,t =
yt = γ0 +
l
X
1
1 + e−pl,t
γl Pl,t +
i
X
(3.33)
βi xi,t
(3.34)
i=1
l=1
Figure 3.7: Architecture of Multilayered Feedforward Network
Multiple Hidden Layers
Input Layer
Inputs
X1
X2
Output Layer
X3
v
Output
Xn
Source: own processing inspired by Hastie et al. (2013)
This architecture with multiple hidden layers (as shows the Figure 3.7)
allows for higher complexity, which can lead to the improvements in predictive
power. However, there negative impacts - we need to estimate much more para-
3. Methodology
20
meters, which consume the valuable degrees of freedom and computation time
Witten & Frank (2005). With more parameters to estimate, there is also increase in the probability that the estimates can converge to the local, not global,
optima Hastie et al. (2013).
3.4
Support Vector Machines
Support Vector Machines (SVM) is a relatively new learning method used for
regression and binary classification. The crucial idea of SVM is to find optimal hyperplane which separates the multi-dimensional data into two classes.
However, the input data is usually not linearly separable, thus SVM introduce
novel approach of ”kernel induced feature space”Cortes & Vapnik (1995). which
transfers the data into higher dimensional space where the data are linearly separable. The SVM were firstly introduced by Vladimir Vapnik and colleagues
in a research paper in 1995.
In the following sections will be introduced the basic idea of the SVM and
then we investigate this topic in higher detail then the Neural Networks. There
are many research papers describing the mathematics behind this topic, but
there is only few resources covering the mathematics of SVM. Hence we focused
on this aspect as well. Cortes & Vapnik (1995)
The crucial idea of classification done by SVM is finding a hyperplane that maximizes the imaginary margin between two classes. In two-dimensional space is
used line, in three-dimensional space is applied plane and in space with more
dimensions is used hyperplane Elizondo (2006). The following figure shows the
separation of two classes in two-dimensional space with a line.
3. Methodology
21
Figure 3.8: Separable and Non-separable Cases
xβ + β0 = 0
xβ + β0 = 0
ξ2*
ξ1*
ξ3*
M
M
M
M
ξ4*
Margin
Margin
Source: own processing, based on Cortes & Vapnik (1995)
The first panel shows separable case, where the solid line is the decision
2
. The
boundary and the broken lines specify the margin of width 2M = kβk
other panel depicts the overlapping and linearly non-separable case. The points
with labels ξj∗ are apparently on the other side of the margin. The size of this
overstepping is equal to ξj∗ = M ξj ; this metrics is for points on the correct side
of the margin are equal to zero Bellotti & Crook (2009). The size of the margin
P
P ∗
is maximized with condition that
ξj ≤ constant. The summation
ξj is
adjusted total distance of all points from margin.
To start with, assume the input dataset with N pairs {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )},
where xi ∈ Rp and xi ∈ {−1, 1}. A hyperplane is defined by
{x : f (x) = xT β + β0 }
(3.35)
where the β represents a unit vector; kβk = 1 . The classification rule for two
classes is determined by
G(x) = sgn[xT β + β0 ]
(3.36)
The function f (x) produces a positive or negative distance from the margin to
point xi . If the classes are separable, then it can be shown that there exists a
function f (x) = xT β + β0 with yi f (xi ) > 0; ∀i. Hence, it is possible to find
3. Methodology
22
some hyperplane that produces the biggest margin the two classes. In the case
of the separable case, the optimization problem is as follows Hastie et al. (2013)
minimize kβk
(3.37)
β,β0
subject to
yi (xTi β + β0 ) ≥ 1, i = 1, . . . , N
(3.38)
The criterion is quadratic and with linear inequality constraints, hence this
problem is convex and easily solvable.
Nevertheless we can face data where the classes are non-separable and thus
can overlap each other. We can deal with this feature by still maximizing the
margin, but with allowing for some points to be on the other side of the margin.
For this purpose we need to define slack variable ξ = (ξ1 , ξ2 , . . . , ξN ). The optimization problem remains the same, but the constraints are different Hastie
et al. (2013):
minimize kβk
(3.39)
β,β0
subject to
yi (xTi β + β0 ) ≥ M (1 − ξi ), i = 1, . . . , N
X
ξi ≤ constant
(3.40)
ξi ≥ 0
(3.42)
(3.41)
The logic of this adjustment is straightforward. The value ξi represents amount
by which the prediction f (xi ) = xTi β + β0 falls on the wrong side of the margin.
The misclassification occurs when ξi > 1, hence bounding the sum of these
slack variables at some value V , limits the total number of training misclassifications at value V .
Now is obvious one attractive property of the SVM. By the nature of the aforementioned optimization constraints, we see that the points a far away from
their decision boundary do not play a substantial role in shaping that boundary Cristianini & Shawe-Taylor (2000). Hence the SVM are less prone to be
affected by outliers.
3. Methodology
3.4.1
23
Computation of the Support Vector Classifier
The following subsection is merely of a mathematical nature, nevertheless for
the sake of lucidity in the subsequent section it is important to describe at least briefly the way of calculating the Support Vector classifiers. For more deep
understanding of this issue we advise to look into a book describing this issue
on much higher level Hastie et al. (2013).
The optimization problem in case of non-separability from the previous section
is convex optimization problem and thus quadratic programming solution using
Lagrange multipliers can be stated. For computation purposes, it is more convenient to express this optimization problem in this form Schölkopf & Smola
(2002)
X
1
ξi
(3.43)
minimize kβk2 + C
2
β,β
0
s.t. ξi ≥ 0; yi (xTi β + β0 ) ≥ 1 − ξi
(3.44)
where the parameter C replaces the ”constant”from inequality (3.41). In the
case of linear separability, the cost parameter C is equal to infinity.
The primal Lagrange function is then as follows Schölkopf & Smola (2002)
N
N
N
X
X X
1
LP = kβk2 + C
ξi −
αi yi (xTi β + β0 ) − (1 − ξi ) −
µi ξi
2
i=1
i=1
i=1
(3.45)
This primal Lagrange function is minimized with respect to parameters β, β0
and ξi . If we derive this function and set the relevant derivatives to zero, we
get (for the sake of brevity we exclude the calculations of derivatives)
β=
N
X
α i y i xi
(3.46)
i=1
0=
N
X
αi yi
(3.47)
i=1
αi = C − µi , ∀i
(3.48)
αi , µi , ξi ≥ 0 ∀i
(3.49)
3. Methodology
24
By substituting, we get the dual Lagrangian objective function
LD =
N
X
i=1
N
N
1 XX
αi −
αi αi0 yi yi0 xTi xi0
2 i=1 i0 =1
(3.50)
This dual Lagrangian function generates a lower bound on the primal Lagrange
function (3.45) for all feasible points Schölkopf & Smola (2002). We perform
P
maximization of the LD s.t. 0 ≤ αi ≤ C and N
i=1 αi yi = 0. We also employ
the Kuhn-Tucker conditions and get following constraints
αi yi (xTi β + β0 ) − (1 − ξi ) = 0
(3.51)
µi ξi = 0
(3.52)
yi (xTi β + β0 ) − (1 − ξi ) ≥ 0
(3.53)
for i = 1, . . . , N . These Kuhn-Tucker constraints and the previous ones ((3.46)
- (3.49)) uniquely define the solution to the primal and dual problem. We get
from (3.46) the solution for β Schölkopf & Smola (2002):
β̂ =
N
X
α̂i yi xi
(3.54)
i=1
where the coefficients αi are nonzero for the observations i where the constraints (3.53) are exactly met (because of (3.51)). These observations are so-called
support vectors, because the β̂ is defined by them alone. Some of these support
points lie exactly on the edge of the margin, (i.e. ξi = 0) and thus from (3.52)
and (3.48) we get for them relationships 0 < αi < C. For support vector where
ξi > 0 (points beyond margin) applies αi = C. From (3.51) emerges that for
a solution for β0 can be used any margin point (i.e. ξˆi = 0, 0 < αi ). Given
the solutions βˆ0 and β̂, the desired decision function can be finally written as
Schölkopf & Smola (2002)
b
G(x)
= sgn fˆ(x) = sgn xT β̂ + βˆ0
3.4.2
(3.55)
Support Vector Machines and Kernels
Up to this part of the thesis we have described the application of support vector classifier in cases, where the input feature space is linearly separable. These
linear boundaries in the extended space correspond to nonlinear boundaries in
3. Methodology
25
the original feature space.
On the following plots is shown basic example with two circles, each of them
represents one class of the data Hastie et al. (2013). These two classes are
not linearly separable in the two dimensional space. However, if we employ
suitable function for mapping these data into higher dimensional space, then
these two classes can be separated with linear hyperplane. Then this decision
boundary is transformed back into the original space. The second and third
graphs show mapping into three dimensional spaces, the last graph shows the
decision boundary calculated using SVM approach.
Figure 3.9: (Non)separability Demonstration
Source: own processing paritally inspired by Hastie et al. (2013)
If we have selected the basis functions hz (x), z = 1, . . . , Z then the procedure of classifying is in essence the same as without kernel transformation. We
estimate the support vector classifier using input data h(xi ) = (h1 (xi ), h2 (xi ), . . . , hM (xi )), i =
3. Methodology
26
1, . . . , N and then generate the nonlinear function
fˆ(x) = h(x)T β̂ + βˆ0
(3.56)
b
The classifier is the same as before: G(x)
= sgn fˆ(x) Hastie et al. (2013).
The number of dimensions of the extended space is allowed to get high, even
infinite in selected cases. On first glance, in these cases it might appear that
the computations might be too prohibitive and with appropriate basis function
the input data always can be separable, and hence overfitting might occur. Fortunately, the SVM approach deals with both of these serious issues successfully
Souza (2010).
The optimization problem (3.45) can be re-specified in a way, where the input data are involved in the form of inner products Souza (2010). This is done
directly for the transformed input vectors h(xi ). It can be shown, that for
appropriate choices of function h, the inner products could be computed without excessive computations Schölkopf & Smola (2002). The Lagrange dual
function (3.50) can be restated as follows
LD =
N
X
i=1
N
N
1 XX
αi αi0 yi yi0 hh(xi ), h(xi0 )i
αi −
2 i=1 i0 =1
(3.57)
The solution function f (x) can be rewritten
T
f (x) = h(x) β + β0 =
N
X
αi yi hh(x), h(xi )i + β0
(3.58)
i=1
Then the parameter β0 can be easily determined (given αi ) by solving yi f (xi ) =
1 for all xi for which 0 < αi < C, of course. We can see in the two previous
equations ((3.57) and (3.58)) one crucial thing: both involve the function h(x)
only in the form of its inner products. In fact, there is no necessity to specify
this function at all, we are only required to know the corresponding kernel
function Cortes & Vapnik (1995)
K(x, x0 ) = hh(x), h(x0 )i
(3.59)
that is used for calculation of inner products in the transformed higher-dimensional
3. Methodology
27
space Souza (2010). Based on Hastie et al. (2013) the most popular choices for
K in the SVM applications are
Linear: K(x, x0 ) = x · x0
(3.60)
d
Polynomial of d-th degree: K(x, x0 ) = 1 + hx, x0 i
Radial basis: K(x, x0 ) = exp − γkx − x0 k2
Sigmoid: K(x, x0 ) = tanh k1 hx, x0 i + k2
(3.61)
(3.62)
(3.63)
The function of the tuning parameter C is clearer in much more extended
feature space, because the perfect separation is possible there. Larger value of
the parameter C will suppress any positive ξi and will with high probability
lead to overfitting in the original feature space. Lower value of the parameter
C induces lower value of kβk, which causes f (x) to be smoother. Smoother
function f (x) apparently implies a smoother boundary. The following figure
shows the same random data as in the previous section (focused on Neural
Networks). Here the SVM with different kernel functions were applied.
3. Methodology
28
Figure 3.10: Separation using SVM with Different Kernel Functions
Source: own processing
There exist no general rules regarding the selection of the optimal kernel,
at least to our knowledge. The suitability of certain kernel function highly depends on the substance of the problem being solved and on the relationships
among the input variables. In case of simple tasks with entirely linear relationships, the linear kernel function is sufficient and application of more complex
kernels should not bring any improvement. On the other hand, if the input data
are more complex and providing us with nonlinearities and interdependences,
then the more advanced kernels can bring major improvements in the classification ability. We can mention for example face recognition, genetics or natural
language processing. The credit scoring is in general between these extremes.
However this is heavily dependent on the quality and availability of the input
data.
3. Methodology
3.4.3
29
Support Vector Machines and Regressions
In the following section we will show how the SVM can be adjusted for regression producing a quantitative response. We firstly remind the linear regression
model
f (x) = xT β + β0
(3.64)
and then focus on its nonlinear generalizations. In order to estimate β, we
should consider minimization of a following expression
H(β, β0 ) =
N
X
i=1
λ
V yi − f (xi ) + kβk2
2
(3.65)
where the error function V is can defined in the following way (for more versions
of this error function see Hastie et al. (2013) ):
(
V (r) =
0
; |r| < |r| − ; otherwise
(3.66)
The logic of this -sensitive error measure is simple - the errors of size lower
than are ignored. We can see here some analogy with the support vector
classification logic, where the points on the right side of the decision boundary
and far away from it (thus we are not speaking now about the support vector
points), are ignored during the optimization process. There is one very desirable
property of the error function V (r) - the fitting of this regression model is not
much sensitive to the outliers. Moreover, this error measure possess linear tails
(beyond ), and in addition it diminishes the contributions of the cases with
small residuals. The next figure shows the -sensitive error function employed
by the SVM.
3. Methodology
30
Figure 3.11: -sensitive Error Function
Source: Hastie et al. (2013)
Given that the β̂, βˆ0 are minimizing the H (3.65), then the solution function
can have the following form Hastie et al. (2013)
β̂ =
N
X
α̂i∗ − α̂i xi
(3.67)
i=1
f (x) =
N
X
α̂i∗ − α̂i hx, xi i + β0
(3.68)
i=1
This problem is solvable by minimizing the following expression
min
∗
αi ,αi
N
X
i=1
N
N
X
1 X
αi∗ −αi −
yi αi∗ −αi +
αi∗ −αi αi∗0 −αi0 hxi , xi0 i (3.69)
2 i,i0 =1
i=1
subject to the following constrains
αi ≥ 0;
αi∗
N
1 X ∗
≤ ;
(αi − αi ) = 0; αi αi∗ = 0
λ i=1
(3.70)
The solution of this problem on the input values only through their inner
product hxi , x0i i, the same is valid in the classification case. Due to this property,
3. Methodology
31
we can generalize this approach to spaces of higher dimensions by defining an
appropriate inner product. The mathematics of the kernel trick in case of SVM
regression is demanding and beyond the scope of this thesis, nevertheless we
can recommend Hastie, 2009.
3.5
Random Forests and Trees
In this section will be introduced another prospective part of machine learning
theory, the Random Forests. The Random Forest algorithm as firstly developed
by (Breiman, 2001) is a representative of ensamble method. This means that
this model consists of many other models, but the final predictions and other
relevant quantities are obtained by certain combinations of the outputs of all
of the underlying models.
First, we provide an overview of classification and regression trees (CART),
because they are the constituents of the Random Forests and then we describe
compiling of CARTs into Random Forests.
3.5.1
Classification and Regression Trees
The following part is based on the seminal paper presented by Leo Breiman with
his colleagues in 1984 Breiman et al. (1984). The underlying logic of CART lies
in repeated partitioning of the input data in order to estimate the conditional
distribution of the dependent variable. Let the response of interest be a vector
with observations
y = (y1 , y2 , . . . , yn )T
(3.71)
and the set of explanatory variables (i.e. features or predictors) a matrix
X = (x1 , x2 , . . . , xp ),
(3.72)
xj = (x1j , . . . , xnj )T for j ∈ {1, . . . , p}
(3.73)
where
The ultimate goal the algorithm is to divide y conditional on the values of the
inputs X in such way that the originated subgroups of y as much homogenous (e.g. in terms of riskiness) as possible Breiman et al. (1984). The CART
algorithm considers every unique value in each input variable as a potential
3. Methodology
32
candidate for a binary split and then calculates the homogeneity of the resulting subgroups of the dependent variable.
For better understanding follows a simple example with an explanation.
Figure 3.12: CART Plot and Explanation of Logic Using Simple Data
Client Type
New
Repeating
Income
Age
<30 000
>=30 000
>=30 years
<30 years
Only
Elementary
School
Sex
Yes
No
High Risk
Low Risk
Low Risk
Male
Female
High Risk
Low Risk
Low Risk
Source: own processing
Now we can extend the CART logic from this uncomplicated example of
binary classification to other types of outcome variable. The crucial part here
is the loss function and its role within the algorithm. For the sake of accuracy
and clarity, we need to introduce a few definitions. The data at the current
node m can be written as
m
y m = y1m , . . . , ynm ; X m = xm
1 , . . . , xp
(3.74)
The explanatory variable used for current split, xm
s , has the unique values
C m = {xm
i }i∈{1,...,nm } .
(3.75)
and the value c ∈ C m is the value of the respective explanatory variable consi-
3. Methodology
33
dered for a split. Then the data in the corresponding daughter nodes created
by the previous split in c are necessarily y ml and y mr . The y ml contains every
mr
contains all elements where
element of y m whose values of xm
s ≤ c and y
xm
s > c. The reduction in error (i.e. gain) from a split at node m in variable
xs at value c can be finally defined as (Breiman, 1984)
∆(y m ) = L(y m ) −
h nm l
i
nmr
ml
mr
)
−
L(y
L(y
)
.
nm
nm
(3.76)
Here the terms nml and nmr represent the number of cases that fall to the left
and right side of the split, L(·) represents the aforementioned loss function. The
logic of the loss function is straightforward - it measures the level of impurity,
or misclassification, at each node.
For categorical outcomes, let us denote the set of unique categories of y m as
Dm = {yim }i∈{1,...,nm } .
(3.77)
In order to evaluate the level of impurity of the node we need to calculate the
proportion of cases belonging to each class d ∈ Dm and denote it as pm (d). The
class occurring most frequently will be denoted as ŷ m . Then the impurity of
the node can be finally obtained from the following relationship Breiman et al.
(1984):
nm
1 X
m
Lmc (y ) = m
I yim 6= ŷ m = 1 − pm ŷ m .
(3.78)
n i=1
The function I(·) is so-called indicator function that is equal to one if the argument is true. The definitions and relationships stated above only describe the
following logic - the level of impurity of given node can be expressed as the
proportion of cases that would be incorrectly classified under the application
of the major rule.
For continuous outcomes a different loss function is needed to measure the
level of impurity of the node. The most widespread loss function is this case is
the mean squared error (MSE) Ghodselahi & Amirmadhi (2011).
m
m
Lmse (y ) =
n
X
yim − ŷ m
2
(3.79)
i=1
The predicted value ŷ m usually represents the mean of the observations in y m .
3. Methodology
34
Special case are unordered categorical variables, which need to been handled
with different logic. The reason is that if we make a split in certain category on
an unordered discrete variable, then the categorization in values to the right
and left from this split has no meaning, since there is no ordering. Therefore
all feasible combinations of the elements from Dm need to be considered.
When an appropriate loss function is selected, then at each node the quantity
∆(y m ). is calculated for all variables and for all feasible splits in the variables.
Then the combination of variable and corresponding split that produces the
highest ∆ is selected. This process continues in all resulting new nodes until
the stopping criterion is met. The importance of the stopping criterion, or criteria, is obvious - it is necessary to avoid trees that are overly complex and
are overfitting the data. In this case the tree would try to generalize the noise
and not the real signal in the data. Commonly used stopping criteria include
the number of observations in each of the terminal node, homogeneity or the
number of observations in the terminal nodes Hastie et al. (2013).
The most substantial deficiency of the CART is the high variance of the fitted
values, i.e., CART prone to the overfitting. These fitted values can be very unstable, thus CART can produce very different classifications when some even
minor changes to the data used to fit the model are done (e.g. the age increases a little and then then this case will be processed by different branch of
the tree and assessed based on different criteria). This problem is inevitable in
case of CART applications, but can be substantially mitigated if we employ
the Random Forests, which are sort of extension of the CART logic.
3.5.2
Random Forests
Leo Breiman, Breiman (1996), proposed bootstrap aggregating that can be
used to decrease the variance of the fitted values from the CART. This bootstrap aggregating, called bagging, has one pivotal idea - in order to decrease
the predictions’ variance of one model, more models can be fitted and then
their predictions can be averaged in order to obtain the final prediction. Each
component model is trained only on the bootstrap sample of the data, because
we need to decrease the risk of overfitting. Thus each of these data samples
excludes some part of the original dataset, which is known as out-of-bag data.
3. Methodology
35
Then using each of these samples a CART is built and from these components
the final Random Forest is composed. By combining of the predicted variables
for each observation an ensamble estimate is produced. This ensamble estimate
has lower variance that would have a prediction made by only one CART trained on the original data Buja & Stuetzle (2006).
Breiman (2001) has further extended the logic of the Random Forests. Instead
of selecting the split from all available explanatory variables at each node and
in each tree, only a random subset of the explanatory variables are considered.
The intended direct consequence is diversifying the splits across the trees. If
there are some highly predictive variables that could overshadow the impact of
weaker (but still predictive) variables, then this approach gives the chance to
these weaker variables to be selected into some underlying tree. Thus this does
not only decrease the risk of overlooking of these variables, but it additionally
allows to a large dataset of input variables to be analyzed. With individual
trees trained on the independent datasets and with different subsets of available variables, trees produce predictions that are far more diverse and with
lower variance.
For sake of clarity, the out-of-bag data, that is data that was not drawn in
the bootstrap process used to train a particular tree, is used for each tree’s
prediction. In case of continuous outcomes is the prediction of the Random
Forest given by the simple average of the predictions generated by each of the
underlying tree Breiman (2001):
T
1X t
f Xi∈B̄ t
fˆ(X) =
T t=1
(3.80)
The term T represents the number of all trees in the final forest, and f t (·) the
t-th tree, B̄ t is the out-of-bag data for the tth tree. For discrete outcomes, the
final prediction of the forest is the majority prediction from all underlying trees.
The number of trees and the number of available variables at each node are
the tuning parameters and the optimal choice of these depends on the data
available and the task at hand. Therefore, these parameters should be chosen
to minimize the expected generalization error. This can be done for example
by application of resampling methods such as the widespread cross-validation.
More details can be found in Amaratunga et al. (2008) or Biau et al. (2008).
3. Methodology
36
The following figure shows the comparison of various specifications of the Random Forests. The same dataset was used for Neural Networks and SVM. It is
obvious that the number of trees in the Random Forest has significant influence
on possible overfitting.
Figure 3.13: Comparison of Random Forests
Source: own processing
Chapter 4
Model Development
In the following section selected aspects connected to the development of the
final models will be described. We will briefly outline theory focused on the
selection of variables, selection of certain specification of some model and on
performance assessment of various models. Then follows introduction of the
hypotheses, data description and exploratory analysis, building of the models
and we will conclude this section with evaluation of the hypotheses.
For brief introduction into application of scorecards in underwriting process
and portfolio management in finance please see appendix.
4.1
Variable Selection
In practice, the available dataset contains a very high number of potential input variables, albeit usually not all of them are relevant to the classification or
regression problem. In the past, the credit scoring was not a typical example of
high-dimensional problems because not so many information about the client
were available. The datasets rarely contained more than 100 input variables.
Nevertheless, with the expansion of the internet, various smart technologies
and devices and high-speed data connections, the availability of the data has
increased dramatically. It is not uncommon today that the risk department has
at disposal hundreds of potential input variables containing for example history
of banking transactions, information from pension funds or mobile phone operators, credit bureaus or social media. Even if the client calls to the company’s
call-center, his voice can be analyzed and can be used for scoring.
4. Model Development
38
As we have outlined earlier, decreasing the dimensionality will be more and
more important task in future. The expected impacts of the dimensionality
reduction are generally lower computational costs and performance increase of
the final model. Especially, the latter is crucial as the irrelevant input variables
may actually bring additional noise to the model and hence decrease the performance and, in case of credit scoring, increase the overall credit losses.
Because of these reasons, some sort of variable selection or filtering needs to
be done before the model development can commence. There are several established methods (and many new and progressive ones), many of them can be
used more widely, not only for the models introduced in this thesis.
Forward selection is a simple sequential iterative procedure for selection of
an appropriate set of relevant input variables Derksen & Keselman (1992). The
algorithm starts with an empty set of input variables and then iterates over
all potential input variables and uses exactly one at a time, hence creating n
various models in the first round of iterations. All these models are assessed
and the input variable with best performance is chosen and included into the
preliminary model.
In the next step, the other n − 1 variables are used for creation of n − 1
models, given that each incorporate two input variables; the one from the first
round of iterations and one arbitrarily selected variable from the remaining
n − 1 variables. Subsequently the model’s performance is assessed and the best
performing variable is added into the preliminary model. There is no generally
accepted optimal stopping criterion of this iterative process. The criterion can
be reaching some number of input variables or minimum performance improvement for the last included input variable.
Backward selection algorithm is basically a modification of the forward selection
process. The difference consists in the fact that all potential input variables are
included in the initial model. Then, in each round of iterations, the performance
of all variables is evaluated and the worst performing variable is dropped.
One of the state-of-art variable selection methods is Minimum Redundancy
Maximum Relevancy (MRMR) algorithm Peng et al. (2005). This approach is
extension of the backwards and forwards variable selection and its logic is based
4. Model Development
39
on two ideas. The first idea behind this approach is that we should not employ
variables which are significantly correlated in the model. In other words, the
redundancy among variables needs to be taken into account, thus we should
keep variables with high levels of dissimilarities among them.
Let U represent some set of uni-dimensional discrete variables {X1 , X2 , . . . , Xn }
and C is a known class which can take its values in {c1 , c2 , . . . , cm }. The set
S ⊆ U represents any subset of the set U .
One of the ways of global measuring of redundancy among the variables in
the subset S is Ding & Peng (2005):
WI (S) =
1
|S|2
X
M I(Xi , Xj ),
(4.1)
Xi ,Xj ∈S
where the M I(Xi , Xj ) is representing the measure of mutual information between the two input variables. More details about this measure can be found
in Peng et al. (2005).
The second idea underlying this concept of variable selection is that minimized
redundancy should be accompanied by the maximum relevance criterion of the
input variables with respect to the dependent variable. An acceptable measure
of the overall relevance of the variables in S with respect to the dependent
variable is
VI (S) = argmax
S⊆U
X
M I(C, Xi )
(4.2)
Xi ∈S
The combination of the relevancy and redundancy used to obtain a suitable
subset of variables is as follows
S ∗ = argmax V1 (S) − W1 (S)
(4.3)
S⊆U
The aforementioned description of MRMR is designated for discrete variables. The MRMR variable selection for continuous variables employs the same
logic Peng et al. (2005).
Traditionally, the relevance of some variable is the most important selection
criterion and thus the majority of variable selection algorithms concentrate
4. Model Development
40
nearly exclusively only on the relevance of the given variable. Nevertheless,
in order to have a more complex variable subset with superior generalization
potential, the selected variables needs to be non-redundant. In other words,
each variable needs to bring new information into the problem. The backward
or forward variable selection approaches focus on the relevancy condition. The
MRMR selection algorithm focuses on the relevancy as well, and moreover, it
introduces limitation on the redundancy of the variables. And by its nature, this
decreases the risk of multicollinearity in the traditional econometric models.
4.2
Model Selection
In practice, there are usually many models to choose from and it cannot be
clear which one will provide us with the highest performance for a given task.
For example, in the case of the support vector machines, the suitable kernel
function should be selected and then its optimal parameters found.
In general, there are more methods based on cross-validation that differ in
the required computing resources and precision. We will introduce the k-fold
cross validation process as this method is not extremely computationally demanding but still provides good results Rodrı́guez et al. (2010). In this method,
the dataset S is randomly divided into k subsets, S1 , S2 , . . . , Sk . One of them
is left for testing purposes and the rest, (k − 1), is used for training the model. Thus, this process generates k sets of performance statistics. Subsequently
these statistics are compared and the best model is chosen.
4.3
Performance assessment
In this section, we will introduce suitable definition of bad client and subsequently, performance indices based on distribution and density functions.
Probably the most important step in building a predictive model is the appropriate definition of the dependent variable. Thus, in credit scoring it is
utterly important to precisely define the good and the bad client. The common
practice is definition based on days past due (henceforth DPD). Furthermore, it
is crucial to determine the time horizon in which the previous metric is traced.
For example, bad client can be regarded to as a client who was at least 60 DPD
4. Model Development
41
on the first 5 installments.
Selection of the proper default definition depends predominantly on the type
of debt product. Certainly, there will be employed different definitions for consumer loans with maturities around one year and for mortgages where the
maturities are measured in decades. Besides the maturity of the loan, the purpose of the model is crucial. For fraud prevention purposes very short defaults
are used, e.g. only default on the first installment is taken into consideration,
and for consumer loan underwriting longer time horizons are generally used,
ranging from three to twelve months.
Once the definition of good and bad client is done and the score is estimated (this will be done in the subsequent parts), the evaluation of the predictive
model has to be conducted. We will employ indices based on the cumulative
distribution function where the main representatives are Kolmogorov-Smirnov
statistics, Lift and AUC coefficient. Nevertheless, these indices are focusing
rather on statistical properties of the model and are not addressing properly
the business needs - the profit-making potential. This aspect will be assessed
in this thesis as well using Monte-Carlo simulation and cost matrix.
Let us assume that score s is available for each client/observation and define
the following markings:
(
Dk =
1, if client is good
0, otherwise
(4.4)
The empirical cumulative functions of scores of the bad (good) clients are given
by these relationships (Siddiqi, 2005):
m
1 X
I(si ≤ a ∧ Dk = 0),
Fm,bad (a) =
m i=1
(4.5)
n
1X
Fn,good (a) =
I(si ≤ a ∧ Dk = 1), a ∈ [L, H]
n i=1
(4.6)
The quantity si represents the score of the i-th client, n is number of good
clients, m represents number of bad clients and I stands for a indicator function
where I(true) = 1 and I(f alse) = 0. The L and H represent the minimum and
4. Model Development
42
maximum value of the score, respectively. The empirical distribution function
of scores of all clients is given by
n+m
1 X
Fn+m,all (a) =
I(si ≤ a), a ∈ [L, H]
n + m i=1
(4.7)
The Kolmogorov-Smirnov statistics (sometimes written as K-S or KS) is defined as the maximal difference between the cumulative distribution functions of
good and bad clients (Mays, 2005). More precisely
KS = max Fn,good (a) − Fm,bad (a)
(4.8)
a∈[L,H]
The logic of Kolmogorov-Smirnov statistics is illustrated by the following figure.
We see the cumulative distribution function of Goods, Bads and the difference
between these two curves. The peak of this difference curve defines the value
and location of KS statistics.
Figure 4.1: K-S Statistics
Source: own processing
The ROC (Receiver Operating Characteristic) can be also used to show the
discriminatory power provided by the scoring function Warnock & Peck (2010).
4. Model Development
43
This curve can be described as
y = Fn,good (a),
x = Fm,bad (a), a ∈ [L, H]
(4.9)
(4.10)
The logic of ROC is intuitive. Each point of this curve represents some share of
accepted good and bad clients. The following figure exhibits this relationship.
We have created ROC curve based only on 20 loans and ordered them by score.
The Bads are concentrated on the lower values of the score. Using these scores
we can accept e.g. 38% Goods and only 8% Bads at the same time. This is
illustrated by the red point (8%, 38%).
Figure 4.2: ROC, AUC
Source: own processing inspired by Warnock & Peck (2010)
The following terms are fundamental for understanding the theory related
to ROC Zweig & Campbell (1993):
• True positive: the loan is good and the model predicts good
• False positive: the loan is bad and the model predicts good
4. Model Development
44
• True negative: the loan is bad and the model predicts bad
• False negative: the loan is good and the model predicts bad
When evaluating the accuracy of the models, the terms sensitivity and specificity are used. The sensitivity of the scoring model refers to the ability to
correctly identify the good loans.
Sensitivity =
T rue P ositives
T rue P ositives + F alse N egatives
A model with 100% sensitivity correctly identifies all good loans. Model with
80% sensitivity detects 80% of good loans (true positives) but 20% of the good
loans are undetected (false negatives).
The specificity of model refers to the ability to correctly identify the not good
loans (i.e. the bad ones).
Specif icity =
T rue N egatives
T rue N egatives + F alse P ositives
A model with high sensitivity but low specificity results in many good loans
being labeled as bad. In other words - the model can not recognize the bad
loans.
The other term used to reference the utility of models is the likelihood ratio.
This ratio is defined as how much more likely a good loan is labeled as bad
compared with another labeled as bad.
Likelihood ratio =
Sensitivity
1 − Specif icity
Another quality measure, AUC (Area Under ROC Curve), is directly connected to the ROC and describes a global quality of a given scoring function Lobo
et al. (2008). The output domain of AUC measure lies between 0 and +1 when
the perfect model reaches the value +1. AUC measures the overall discriminatory power.
Another useful indicator of the discriminatory power is the Cumulative Lift
Bhattacharyya (2000). This is rather local, not global, performance measure,
which tells us how many times, at a given level of acceptance (or rejection),
is the scoring model better than a random model. Or, if we put it in different
way - how many times is certain part of the dataset riskier than the average.
4. Model Development
45
More precisely
Lif t(a) =
BadRate(a)
Fm,bad (a)
=
, a ∈ [L, H]
BadRate
Fn+m,all (a)
(4.11)
The Cumulative Lift is nicely demonstrated on the subsequent figure. We can
see that the worst 10% of loans is approximately five-times more risky than the
whole portfolio of loans.
Figure 4.3: Cummulative Lift
Source: own processing
4.4
Hypotheses
In order to assess the performance of the aforementioned approaches on real
data and compare them with the logistic regression, we have defined three
hypotheses:
(i) The models based on artificial intelligence outperform the conventional
decision making techniques as ordinal risk measure.
(ii) The models based on artificial intelligence outperform the conventional
decision making techniques in terms of potential profit.
(iii) The models based on artificial intelligence provide more time-stable per-
4. Model Development
46
formance in risk decision-making in comparison with conventional techniques.
The first hypothesis will be tested by ordering the loans by the score assigned
by different discrimination techniques. Then, the development of correlation of
risk and score will be assessed and compared for each model. Specifically, we
will employ and analyze the ROC curve and AUC coefficient, relationship of
the assigned scores and difference between the cumulative distribution functions for good and for bad clients, Kolmogorov-Smirnov statistics, dynamics of
score and expected risk, cumulative lift and others. These metrics allow us to
assess the ordering ability of each method. Thus, the value of the score is not
vital; we only need better scores for better clients and vice versa.
The second hypothesis will be tested based on maximal achievable potential
profit. In order to be able to do this with high level of diligence, we will employ
the Monte-Carlo simulation and cost matrix. The Monte-Carlo simulation provides us with certain level of security that the superior performance of certain
approaches is not coincidental. Other fundamental part is the cost matrix. Majority of the current methods for assessment of various scoring approaches do
not take into consideration one major aspect - the real price of rejecting good
client and approving bad client is radically different. The previous tools like
ROC curve, Cumulative Lift, Kolmogorov-Smirnov statistics, etc. do not take
this into consideration. Hence, in order to assess the various approaches with
regard to their business potential, we have to employ this information as well.
The third hypothesis will be evaluated indirectly by using the out-of-time dataset for evaluation. The time stability is crucial for successful business implementation of a scoring model. Many models perform well on the training
sample, but their performance on the out-of-time dataset rapidly decreases.
This is usually caused by overfitting and improper set-up of the models on the
training sample. We will take the most recent 20% of the data as out-of-time
dataset and the remaining part will be split randomly between training and validation dataset. This approach ensures more reliable appraisal of the scoring
preciseness and time stability.
4. Model Development
47
Figure 4.4: Out-of-time Sample
Validation Sample
Training Sample
Out-of-time
Sample
Time
Source: own processing
4.5
Data Descripption and Exploratory Analysis
Our models will be trained and evaluated using dataset from financial institution form west part of Asia. We have dataset with many variables at hand.
Nevertheless, datamining is not the topic of this thesis, thus we have selected
eleven input variables with substantial discriminating power. This was done on
expert basis, but this does not limit us as we need to compare several models
under the same realistic conditions.
The default (i.e. being bad client) was defined as being at least 30 days overdue on the first installment. This definition was selected because the financial
institution had certain issues with external fraudsters who did not pay even
the first installment. These fraudsters (clients) have different standard profile
as the usual risky clients (who tend to default on later installments, not on
the first one) and the usual relationships might not work. Thus, this is a nice
opportunity for challenging the logistic regression with artificial intelligence
approach.
The summary for the out-of-time data sample can be found below. The characteristics of the validation and training sample are naturally very similar. As the
initial exploratory analysis is crucial for further success, we are providing the
statistical distributions of good and bad clients for non-categorical variables
and Risk versus Category plots for categorical variables. It can be immediately
seen from these plots that these variables have certain discrimination potential.
The binary variable Land Line Refused is equal to one if the client refused
to give his land line number, variable Historical Applications represents number of all historical loan applications in the credit bureau and the variable Term
4. Model Development
48
the length of the loan in months. Other variables are intuitive.
Only consumer loans for buying specific goods are contained in the dataset
(e.g. mobile phone, notebook or e-bike). The creditworthiness of these clients
is rather average or sub-prime.
There were not any problems with data quality, e.g. no missing values or obviously erroneous values.
Table 4.1: Observations
Total
Goods
Bads
Bad Rate
16,676
16,090
586
Table 4.2: Variables
Total
Categorical
Non-Categorical
3.51%
Table 4.3: Categorical Variables
Variable
Client Type
Education
Income Type
Land Line Refused
Number of Children
Sex
Categories
2
7
2
4
3
2
11
6
5
4. Model Development
49
Table 4.4: Categorical Variables: Detailed Overview
4.6
Categorical Variable
Category
Share
Risk
Client Type
New
Repeating
56.77% 5.11%
43.23% 1.41%
Income Type
Blue Collar
34.49% 5.88%
Businessman
1.24% 3.38%
Government
24.57% 3.30%
Academic
0.67% 2.70%
Maternity Leave 28.69% 1.92%
White Collar
10.27% 0.64%
Other
0.08% 0.00%
Sex
Male
Female
44.72%
55.28%
Education
University
Elementary
Other
High school
0.33% 5.45%
62.08% 4.19%
1.03% 3.49%
36.56% 2.35%
Number of Children
Zero
One
More than two
56.19% 4.32%
26.88% 2.63%
16.93% 2.23%
Land Line Refused
Yes
No
38.20% 6.09%
61.80% 1.92%
4.96%
2.34%
Model Building
We have decided to employ five different models:
• Multi-Layer Perceptron (MLP)
• Radial Basis Function Network (RBFN)
• Random Forest (RF)
• Support Vector Machine (SVM)
• Logistic Regression (Logit)
The neural networks are represented by two models because the logic of
RBFN is quite different from the MLP and it will be interesting to see comparison of these two models. We are employing the decision trees in their advanced
50% Percentile 25% Percentile
30.00
12.00
4,230.00
2,400.00
3.00
2,100.00
800.00
12.00
6.00
Figure 4.7: Goods Price
Mean St. Dev.
Min
Max 75% Percentile
56.62
71.36
552.00
78.00
4,878.26
3,291.72 450.00 31,620.00
6,610.00
5.53
7.71
103.00
8.00
3,359.59
4,466.05
- 69,200.00
4,500.00
9.65
4.79
3.00
18.00
12.00
Figure 4.6: Employment Length
Non-Categorical Variables
Employment Length
Goods Price
Historical Applications
Income
Term
Figure 4.5: Non-Categorical Variables: Detailed Overview
4. Model Development
50
Figure 4.8: Bad & Good: Visualisation
4. Model Development
51
Figure 4.9: Categorical Variables: Visualisation
4. Model Development
52
4. Model Development
53
variation - Random Forests. The core idea of decision tree is very simple and has
many drawbacks. On the other hand, it will be beneficial to see how this simple
model performs in comparison with much more complicated models from the
artificial intelligence field. The main idea of the SVM is very elegant - transformation of the linearly inseparable datasets into different space and subsequent
separation with linear hyperplane - but this advantage can be its weak point
as well. It is still unclear how to find the optimal mapping into different space.
The last model is Logit as representative of current industry standard.
The variable selection was done by MRMR algorithm. We have decided to
use seven variables with the highest added value (high relevance and low redundancy). One of many advantages of this approach is the fact that we do not
have to control much for the correlation among the input variables. One of the
assumptions of logistic regression is low or even none multicollinearity. This is
very difficult to attain in real applications and so lower correlation among the
variables has to be accepted (e.g. < 0.20).
We have decided for the same seven input variables for all models because
we need to compare the models among themselves. If each model is built on
different dataset, then we would not be able to say whether the difference in
performance is caused by the model itself or by the different potential of the
input variables. Using the same input variables for all models will ensure more
level conditions.
The final variables are these:
• Historical Applications
• Client Type
• Goods Price
• Income
• Land Line Refused
• Sex
We have tried to learn many specifications of the aforementioned models on
the training and validation sample and then assess them. This process was
very demanding from time and computational perspective. Especially in case
4. Model Development
54
of more complex neural networks, the learning algorithm needed several hours
to finish. Thus, it was impossible to try all possible combinations of all input
parameters for all models. Based on preliminary results, we have decided to use
the following specifications. These specifications delivered the best performance
(assessed by AUC on out-of-time sample), many of them are recommended by
various research papers and all of them are relatively simple without any unexpected or unusual characteristics.
The final specifications are as follows:
• MLP - Two hidden layers with logit function as activation function, the
jump connections are allowed.
• RBFN - K-means algorithm was selected for clustering purposes.
• SVM - Polynomial kernel of second order, the cost parameter equals to
1.
• RF - The forest comprises of 80 trees with maximal depth of 3.
The estimated Logit model is below. The values of coefficients are in line with
our expectations and all variables are statistically and economically significant.
The table with independent variables (predictors) and estimation of parameters
follows.
Table 4.5: Independent Variables
Independent variable
x1
x2
x3
x4
x5
x6
x7
Description
Historical Applications
Client Type
Employment Length
Price
Income
Land Line Refused
Sex
Categories
p-value
n.a.
2
n.a.
n.a.
n.a.
2
2
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
4. Model Development
55
Table 4.6: Parameter Estimation
Parameter
b0
b1
b21
b22
b3
b4
b5
b61
b62
b71
b72
Variable Description
Intercept
Historical Applications
Client Type
Client Type
Employment Length
Price
Income
Land Line Refused
Land Line Refused
Sex
Sex
Category Desctiption
n.a.
n.a.
New
Repeating
n.a.
n.a.
n.a.
Yes
No
Male
Female
Estimate
−3.2518
−1.9412
−2.3827
0
2.8493
−0.8428
1.9358
−2.4692
0
−1.3051
0
The comparison of the final specifications of all models in terms of Sensitivity, Specificity, Likelihood ratio and AUC is crucial for assessment of the
discrimination power of the models. The value of sensitivity is high for all models, but the specificity is much more important here. The specificity for MLP
is the highest and lowest for SVM’s. In other words, the MLP can recognize
the Bads much better than any other model. This will be nicely visible on ROC
and Lift charts.
Table 4.7: Comparison of Final Specifications
MLP
Sensitivity 0.993
Specificity 0.305
Likelihood ratio 1.430
AUC 0.841
Logit
0.992
0.148
1.166
0.812
RBFN
0.992
0.174
1.201
0.819
RF SVM
0.997 1.000
0.111 0.089
1.122 1.097
0.796 0.789
All of the final models were assessed on training sample, validation sample
and out-of-time sample. We have compared the AUC coefficients for all models
and for all samples. There was only a slight decrease between training and
validation sample. We have observed decrease of the AUC coefficient between
validation sample and out-of-time sample. However, this was expected as the
population of clients is changing over time.
4. Model Development
56
Figure 4.10: AUC coefficient comparison
Source: own processing
4.7
Evaluation of Ordering Ability
Probably the most important ability of application scorecard is ordering - less
risky client needs to get better score. Of course, this relationship does not
always hold, otherwise we would have perfect model. But we need model that
is as close to the perfect model as possible. The following will be used for
assessment of the first hypothesis:
• Dynamics of risk and score
• ROC curve
• Difference between the cumulative distribution functions for good and for
bad clients
• Cumulative Lift
• Distribution of Goods and Bads
The following chart depicts the relationship between the risk and the score for
the five models. Each line represents one model. The riskiness (y-axis) of the total portfolio is 3.51%. The x-axis represents clients (or loans) ordered by score,
from the best to the worst one. This axis is in percentiles because each model
assigned each client different score. For better understanding of this graph, we
have compared the risk of the portfolio on the 70th percentile. We can interpret
4. Model Development
57
this in the following way: if we approve only the best 70% of clients, then the
risk would be 0.97% if we use the MLP model or 1.32% in case of SVM model
(i.e. risk higher approximately by one third).
We can see that MLP performs substantially better than the other models,
followed by RBFN, Logit, RF and SVM.
Table 4.8: Portfolio Risk on the 70th Percentile: Model Comparison
Percentile Logit
70.00% 1.17%
MLP
0.97%
RF SVM RBFN
1.30% 1.32%
1.12%
Figure 4.11: Risk Dynamics
Source: own processing
The ROC curves show the ratio of cumulative Bads in the cumulative Goods. Specifically, it shows how many bad clients are in the best x% of good
clients. Again, MLP shows the best discriminatory ability, followed by RBFN,
Logit, RF, while SVM being the worst model.
4. Model Development
58
Table 4.9: K-S & AUC Estimation Results
K-S
AUC
Logit
MLP
RF
SVM
RBFN
0.489
0.812
0.528
0.837
0.473
0.796
0.460
0.789
0.500
0.819
Figure 4.12: ROC
Source: own processing
The following plot shows the difference between cumulative distribution
functions of Goods and Bads. The highest point of each curve represents the
Kolmogorov-Smirnov Statistics. Using this plot, we see the development of the
discriminatory power of various models. The shapes are rounded and wide, it
means that in case of e.g. MLP the discriminatory power is high and relatively
stable between the 60th and 85th percentile.
4. Model Development
59
Figure 4.13: Difference between CDF of Goods and Bads
Source: own processing
The Cumulative Lift graph shows the riskiness of the bottom x% loans in
comparison with the riskiness of the whole portfolio. This measure is convenient for the local assessment of the discriminatory power. We see that the
performance of MLP is the best one in the bottom 10% and SVM performs
significantly worse the any other model.
4. Model Development
60
Figure 4.14: Cumulative Lift
Source: own processing
We have calculated the distribution of Goods and Bads for each model.
These following distributions do not provide us with clear answer which model
is better; nevertheless it is always beneficial to see these plots. It will help to
understand what kind of scores the models are producing.
The distribution of Goods is as expected - majority of Goods is highly concentrated on the top scores. The only exception is the RBFN where the Goods
are spread over higher interval. We did not expect this and it is interesting,
especially if we take into consideration that the RBFN model had usually the
second best performance according to the aforementioned criteria. However, the
distribution of Bads is appealing. Some of the models concentrated the Bads
on higher scores (RF, SVM and partially RBFN) while other models spread
the Bads over longer interval on lower scores (MLP and Logit).
Figure 4.15: Goods & Bads Distribution - part 1
4. Model Development
61
Figure 4.16: Goods & Bads Distribution - part 2
4. Model Development
62
4. Model Development
4.8
63
Evaluation of Profit-Making Potential
The previous section has focused mainly on statistical properties of the model,
however all these models attempt to be used in business environment where the
main criterion is the profit-making potential. We have estimated the following
cost matrix:
Table 4.10: Cost Matrix
Observed
Prediction
Good Bad
Good
1.0 0.0
-12.5 0.0
Bad
We have focused on measurable costs so we are excluding the opportunity
costs (rejecting Good by error). The ratio between the good one and bad one
is 1 : 12.5. The direct financial implication is that we need to provide 12.5
good loans in order to cover for one bad one. We have used this higher ratio
because the definition of default is delinquency on first installment and hence
fraud. In these cases the probability of curing the client (repaying the overdue
installments and then repaying as promised) is practically zero and the chance
of success in the legal collection process is very low, too.
We have employed the out-of-time sample only and calculated the maximal
profit for each model. The following chart explains this logic. We have ordered
the loans based on the estimated score (from the best one to the worst one).
Payout based on the cost matrix (1 for good loan, −12.5 for bad loan) was
assigned for each of the loan. The curve shows the cumulative profit.
4. Model Development
64
Figure 4.17: Profit Dynamics
Source: own processing
Process of Monte Carlo Simulation:
1. The maximal cumulative profit together with the corresponding cumulative share of loans (approval rate) using the whole out-of-time sample
were calculated for each model. This approval rate can be different for
each model.
2. Then we have always taken randomly 20% of the out-of-time dataset
for each model and calculated the cumulative profit for the predefined
approval rate (calculated in 1st point).
3. We have repeated the 2nd step 10 000 times.
Table with summary of this simulation and graph showing the distribution
of the profit follow.
Table 4.11: Monte Carlo Simulation Summary
Mean
St. Dev.
10th percentile
90th percentile
Kurtosis
Skewness
Logit MLP RBFN
RF SVM
2102
2285
2183
2061
2051
161
176
164
160
161
1892
2060
1973
1854
1846
2312
2519
2402
2270
2262
-26.7% -31.4%
-31.6% -29.0% -24.6%
3.6% 21.7%
19.1%
7.3% 10.8%
4. Model Development
65
Figure 4.18: Monte Carlo Simmulation Visualisation
Source: own processing
These results confirm the previous results. The MLP and RBFN outperform the other models in terms of profit. The only surprise is the SVM where
we have expected much lower performance and higher deviation of the profit
distribution.
Chapter 5
Conclusion
The aim of this thesis has been to introduce a number of approaches to credit
risk based on artificial intelligence, outline the underlying theory and compare them. The theory supporting all the artificial intelligence approaches has
been presented. We have laid more emphasis on the mathematics supporting
the Support Vector Machines because this area is not covered so well as the
mathematics related to the Neural Networks and their various specifications.
The Random Forests were covered from the theoretical perspective as well.
Nevertheless, the logic behind is not so complicated as in the case of other
approaches that are substantially more complex. The current market etalon,
Logistic Regression, has been described as well. All of these approaches can be
easily described on tens of pages thus we refer frequently to advanced papers
and books. We have focused on the major ideas and properties of all the models. Nevertheless, due to the complex nature of the aforementioned models it
is impossible to describe them all in big detail in one thesis.
Moreover, we have outlined another theory needed for assessment of the aforementioned models. The standard measures for performance assessment have
been described; the main representatives have been the ROC curve, K-S statistics and AUC statistics.
Later, the dataset from an unnamed financial company from Asia has been
described. The risk management policies or even name of the data provider are
not published as these are business secret. The data are remaining private because of the same reason. Nonetheless, we were allowed to perform exploratory
analysis of the variables and focused on their ability to determine the riskiness
5. Conclusion
67
of the loan application. Then, an advanced MRMR algorithm has been used for
the selection of the final set of variables. These variables have been employed
in the model development itself.
We have developed five different models based on the introduced theory and
compared them based on the previously stated metrics. However, these metrics are useful for evaluation of the models from statistical perspective. Hence,
we have also evaluated these models from the business perspective. We have
focused on the dynamics of risk and share of the approved loans, differences
between the cumulative distribution functions for good and bad loans, cumulative lift curve and on the maximal achievable profit for all of these models.
Based on the aforementioned metrics and comparisons, we can conclude that
the approaches based on artificial intelligence can improve the risk performance
of financial company, especially in comparison with Logistic Regression. The
best performing model according to all criteria has been the Multi-Layer Perceptron (specification of Neural Networks). In comparison with the current
market standard (Logistic Regression) provides us with significantly lower risk
for the same share of approved loans Figure 4.11 and it allows for higher profit (this was shown using the Monte-Carlo simulation and Cost Matrix). The
second best model has been Radial Basis Function Network (specification of
Neural Networks as well). The performance of this model has been comparable
to the Logistic Regression. The other models, Random Forests and Support
Vector Machines, have performed worse off. Especially, the performance of the
Support Vector Machines has been lower than expected. In our opinion, the
reason for this is the complexity of this approach and possibly suboptimal optimization and calculation techniques.
Despite the fact that the ordinary approach, the Logistic Regression, has not
been outperformed by all of the presented models based on the artificial intelligence, but only by some of them, we can conclude that the artificial intelligence approach to credit risk can be more suitable than the Logistic Regression.
Hence, all of the hypotheses have been confirmed. We have been able to develop
a model based on the artificial intelligence that has superior ordering ability
and profit making potential in comparison with the Logistic Regression.
5. Conclusion
68
Many interesting topics have emerged during work on this thesis. One of
them is the low performance of the Support Vector Machines, this can maybe
be improved by further research in the optimization of this model. Another
interesting topic can be the application of the artificial intelligence in the variable selection and preprocessing. Many of the variables can be grouped and
mixed together and hence new variables can be created. This is currently done
based on the developer’s experience, but sound theory is missing and multiclassification based on artificial intelligence might be helpful.
Bibliography
Agresti, A. (2013): Categorical Data Analysis. Wiley Series in Probability
and Statistics. Wiley.
Amaratunga, D., J. Cabrera, & Y.-S. Lee (2008): “Enriched random forests.” Bioinformatics 24(18): pp. 2010–2014.
Angelini, E., G. di Tollo, & A. Roli (2008): “A neural network approach
for credit risk evaluation.” The quarterly review of economics and finance
48(4): pp. 733–755.
Bellotti, T. & J. Crook (2009): “Support vector machines for credit scoring and discovery of significant features.” Expert Systems with Applications
36(2): pp. 3302–3308.
Bhattacharyya, S. (2000): “Evolutionary algorithms in data mining: Multiobjective performance modeling for direct marketing.” In “Proceedings of
the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining,” pp. 465–473. ACM.
Biau, G., L. Devroye, & G. Lugosi (2008): “Consistency of random forests
and other averaging classifiers.” The Journal of Machine Learning Research
9: pp. 2015–2033.
Breiman, L. (1996): “Bagging predictors.” Machine learning 24(2): pp. 123–
140.
Breiman, L. (2001): “Random forests.” Machine learning 45(1): pp. 5–32.
Breiman, L., J. Friedman, C. Stone, & R. Olshen (1984): Classification
and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability
series. Taylor & Francis.
Bibliography
70
Buja, A. & W. Stuetzle (2006): “Observations on bagging.” Statistica Sinica
16(2): p. 323.
Cortes, C. & V. Vapnik (1995): “Support-vector networks.” Machine learning 20(3): pp. 273–297.
Cristianini, N. & J. Shawe-Taylor (2000): An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press.
Derksen, S. & H. Keselman (1992): “Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and
noise variables.” British Journal of Mathematical and Statistical Psychology
45(2): pp. 265–282.
Ding, C. & H. Peng (2005): “Minimum redundancy feature selection from
microarray gene expression data.” Journal of bioinformatics and computational biology 3(02): pp. 185–205.
Elizondo, D. (2006): “The linear separability problem: Some testing methods.” Neural Networks, IEEE Transactions on 17(2): pp. 330–344.
Ghodselahi, A. & A. Amirmadhi (2011): “Application of artificial intelligence techniques for credit risk evaluation.” International Journal of Modeling and Optimization 1(3): pp. 243–249.
Gouvea, M. & E. B. Gonçalves (2007): “Credit risk analysis applying logistic regression, neural networks and genetic algorithms models.” In “POMS
18th Annual Conference,” .
Hastie, T., R. Tibshirani, & J. Friedman (2013): The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in
Statistics. Springer New York.
Haykin, S. (2009): Neural Networks and Learning Machines. Number sv. 10
in Neural networks and learning machines. Prentice Hall.
Hosmer, D., S. Lemeshow, & R. Sturdivant (2013): Applied Logistic Regression. Wiley Series in Probability and Statistics. Wiley.
Khashman, A. (2010): “Neural networks for credit risk evaluation: Investigation of different neural models and learning schemes.” Expert Systems
with Applications 37(9): pp. 6233–6239.
Bibliography
71
Lobo, J. M., A. Jiménez-Valverde, & R. Real (2008): “Auc: a misleading
measure of the performance of predictive distribution models.” Global ecology
and Biogeography 17(2): pp. 145–151.
Mays, E. (2001): Handbook of Credit Scoring. Business Series. Global Professional Publishing.
McNelis, P. (2005): Neural Networks in Finance: Gaining Predictive Edge in
the Market. Academic Press Advanced Finance. Elsevier Science.
Peng, H., F. Long, & C. Ding (2005): “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.”
Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(8):
pp. 1226–1238.
Rodrı́guez, J. D., A. Perez, & J. A. Lozano (2010): “Sensitivity analysis
of k-fold cross validation in prediction error estimation.” Pattern Analysis
and Machine Intelligence, IEEE Transactions on 32(3): pp. 569–575.
Schölkopf, B. & A. Smola (2002): Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. Adaptive computation
and machine learning. MIT Press.
Siddiqi, N. (2012): Credit risk scorecards: developing and implementing intelligent credit scoring, volume 3. John Wiley & Sons.
Souza, C. R. (2010): “Kernel functions for machine learning applications.”
Creative Commons Attribution-Noncommercial-Share Alike 3.
Warnock, D. G. & C. C. Peck (2010): “A roadmap for biomarker qualification.” Nature biotechnology 28(5): pp. 444–445.
Witten, I. H. & E. Frank (2005): Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann.
Witzany, J. (2010): Credit risk management and modeling. Oeconomica.
Yu, L., S. Wang, & K. K. Lai (2008): “Credit risk assessment with a multistage neural network ensemble learning approach.” Expert Systems with
Applications 34(2): pp. 1434–1444.
Bibliography
72
Zweig, M. H. & G. Campbell (1993): “Receiver-operating characteristic (roc)
plots: a fundamental evaluation tool in clinical medicine.” Clinical chemistry
39(4): pp. 561–577.
Appendix A
Underwriting Process in Finance
The consumer credit risk can be defined as the risk of suffering loss due to
the customer’s default on a corresponding credit product, such as unsecured
personal loan, credit card, mortgage, overdraft etc.
Majority of companies involved in lending to ordinary consumers have divisions
dedicated to the consumer credit risk management, where the main aspects are
measurement, prediction and mitigation of losses attributable to the credit risk.
One of the most widespread methods for predicting the credit risk is credit
scorecard. The scorecard is usually some statistical model assigning some number to a client. This number reflects the estimated probability that the client
will behave in a certain manner. Within the process of the score estimation
variety of data sources can be employed, including information from credit bureaus, application form, databases with historical behavior, internet, mobile
operators or pension funds.
The most common and widespread type of scorecard is the application scorecard, which is employed by finance companies when some client applies for
a credit product. The ultimate goal of this scorecard is predicting the probability that the client, if the product is provided, would turn bad within a
certain time horizon, and hence inducing losses to the lender. Nevertheless the
crucial word here is the word ”bad”, which will be explained in the following
sections. The definition of ”being bad”can vary across lenders or product types.
As we have mentioned above, the score represents some probability, thus it
A. Underwriting Process in Finance
II
should range between 0 and 1. Usually the estimated client’s score is transformed, used for underwriting purposes, and then possibly communicated to
the client, which is in some countries required by law. Nevertheless for internal
purposes is often used the estimated probability without any transformation.
Because transformation of score brings us no additional value, we will work in
this thesis only with the probabilities. Moreover we will predict the probability
that the client will not become bad, because we want to preserve the logic that
higher score is better.
There are other important types of scorecards, for example behavioral scorecards which attempt to predict the probability of an current client becoming
bad, collections scorecards, where is predicted the reaction to different strategies for collecting overdue installments and outstanding principal. Widespread
are the propensity scorecards, which aim to predict the probability that the
client will accept another product, leave, stop using credit card or fully utilize
his credit card’s limit.
The loan underwriting process is concerned with employing the predictions
provided by scorecards in the decision whether to accept or reject a loan application.
If the scorecard is the main tool for the underwriting purposes and final yes/no
decision, then ”cut-off”points are used. A cut-off point is some certain value
of the score below which clients has their application declined. If the client—s
score is above the cutoff, then the applications may be approved or some additional process may follow.
The setting of this threshold is closely linked to the price (i.e. usually interest rate and fees) that the lender is charging for the corresponding product.
The higher pricing allows for greater credit losses and still remaining profitable.
Thus with a higher price the company can accept clients with higher estimated probability of becoming bad and move the cut-off point down. However
majority of sophisticated lending companies go further and charge the clients
based on their score. This compensates the lenders for the higher risk of the
less creditworthy clients and also allow for charging less the better clients with
higher scores.
A. Underwriting Process in Finance
III
The credit strategies in consumer finance are also dealing with the ongoing
controlling of clients’ accounts, particularly with products with revolving feature such as credit cards, flexible loans or overdrafts, where the clien’s balance
(and exposure of the company) can go down as well as up. Behavioral scorecards can be applied on regular basis to provide an up-to-date status of the
credit-quality of the portfolio and of its sub-segments.