Download loss given default modelling for mortgage loans

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Predicting loss given default (LGD) for residential mortgage loans: a twostage model and empirical evidence for UK bank data
Abstract
With the implementation of the Basel II regulatory framework, it became increasingly
important for financial institutions to develop accurate loss models. This work
investigates the Loss Given Default (LGD) of mortgage loans using a large set of
recovery data of residential mortgage defaults from a major UK bank. A Probability
of Repossession Model and a Haircut Model are developed and then combined into
an expected loss percentage. We find the Probability of Repossession Model should
comprise of more than just the commonly used loan-to-value ratio, and that
estimation of LGD benefits from the Haircut Model which predicts the discount the
sale price of a repossessed property may undergo. Performance-wise, this two-stage
LGD model is shown to do better than a single-stage LGD model (which directly
models LGD from loan and collateral characteristics), as it achieves a better R-square
value, and it more accurately matches the distribution of observed LGD.
Keywords:
Regression, Finance, Credit risk modelling, Mortgage loans, Loss Given Default (LGD),
Basel II
1. Introduction
With the introduction of the Basel II Accord, financial institutions are now required to
hold a minimum amount of capital for their estimated exposure to credit risk, market
risk and operational risk. According to Pillar 1 of the new Basel II capital framework,
the minimum capital required by financial institutions to account for their exposure to
credit risk can be calculated using two approaches, either the Standardized Approach
or the Internal Ratings Based (IRB) Approach. The IRB approach is further split into
two and can be implemented using either the Foundation IRB Approach or the
Advanced IRB Approach. Under the Advanced IRB Approach, financial institutions
are required to develop their own models for the estimation of three credit risk
components, viz. Probability of Default (PD), Exposure at Default (EAD) and Loss
Given Default (LGD), and this for each section of their credit risk portfolios. The
portfolios of a financial institution can be broadly divided into either the retail sector,
consisting of consumer loans like credit cards, personal loans or residential mortgage
loans, or the wholesale sector, which would include corporate exposures such as
commercial and industrial loans. The work here pertains to residential mortgage
loans.
In the United Kingdom, as in the US, the local Basel II regulation specifies that a
mortgage loan exposure is in default if the debtor has missed payments for 180
consecutive days (The Financial Services Authority (FSA) (2009), BIPRU 4.3.56 and
4.6.20; Federal Register (2007)). When a loan goes into default, financial
institutions could contact the debtor for a re-evaluation of the loan whereby the
debtor would have to pay a slightly higher interest rate on the remaining loan but
have lower and more manageable monthly repayment amounts; or banks could
decide to sell the loan to a separate company which works specifically towards
collection of repayments from defaulted loans; or, because every mortgage loan has
a physical security (also known as collateral), i.e. a house or flat, the property could
be repossessed (i.e. enter foreclosure) and sold by the bank to cover losses. In this
case there are two possible outcomes: either the sale of the property is able to cover
the value of the loan outstanding and associated repossession costs with any excess
being returned to the customer, resulting in a zero loss rate; alternatively, the sale
proceeds are less than the outstanding balance and costs and there is a loss. Note
that the distribution of LGD in the event of repossession is thus capped at one end.
The aim of LGD modelling in the context of residential mortgage lending is to
accurately estimate this loss as a proportion of the outstanding loan, if the loan were
to go into default. In this paper, we will empirically investigate a two-stage
approach to estimate mortgage LGD on a set of recovery data of residential
mortgages from a major UK bank.
The rest of this paper is structured as follows. Section 2 consists of a literature
review and discusses some current mortgage LGD models in use in the UK, followed
by Section 3 which lists our research objectives. In Section 4, we describe the data
available, as well as the pre-processing applied to it. In Sections 5 and 6, we detail
the Probability of Repossession Model and the Haircut Model respectively. Section 7
explains how the component models are combined to form the LGD Model. In
Section 8, we look at some possible further extensions of this work and conclude.
2
2. Literature review
Much of the work on prediction of LGD, and to some extent PD, proposed in the
literature pertains to the corporate sector (see Schuermann (2004), Gupton & Stein
(2002), Jarrow (2001), Truck et al. (2005), Altman et al. (2005)), which can be
partly explained by the greater availability of (public) data and because the financial
health or status of the debtor companies can be directly inferred from share and
bond prices traded on the market. However, this is not the case in the retail sector,
which partly explains why the LGD models are not as developed as those pertaining
to corporate loans.
2.1. Risk models for residential mortgage lending
Despite the lack of publicly available data, particularly on individual loans, there are
still a number of interesting studies on credit risk models for mortgage lending that
use in-house data from lenders. However, the majority of these have in the past
focused on the prediction of default risk, as comprehensively detailed by Quercia &
Stegman (1992). One of the earliest papers on mortgage default risk is by von
Furstenberg (1969) where it was found that characteristics of a mortgage loan can
be used to predict whether default will occur. These include the loan-to-value ratio
(i.e. the ratio of loan amount over the value of the property) at origination, term of
mortgage, and age and income of the debtor. Following that, Campbell & Dietrich
(1983) further expanded on the analysis by investigating the impact of
macroeconomic variables on mortgage default risk. They found that loan-to-value
ratio is indeed a significant factor, and that the economy, especially local
unemployment rates, does affect default rates. This is confirmed more recently by
Calem & LaCour-Little (2004), who looked at estimating both default probability and
recovery (where recovery rate = 1 – LGD) on defaulted loans from the Office of
Federal Housing Enterprise Oversight (OFHEO). Of interest was how they estimated
recovery by employing spline regression to accommodate the non-linear relationships
that were observed between both loan-to-value ratios (LTV at loan start and LTV at
default) and recovery, which achieved an R-square of 0.25.
Similarly to Calem & LaCour-Little (2004), Qi & Yang (2009) also modelled loss
directly using characteristics of defaulted loans, using data from private mortgage
3
insurance companies, in particular on accounts with high loan-to-value ratios that
have gone into default. In their analysis, they were able to achieve high values of Rsquare (around 0.6) which could be attributed to their being able to re-value
properties at time of default (expert-based information that would not normally be
available to lenders on all loans; hence one would not be able to use it in the context
of Basel II which requires the estimation of LGD models that are to be applied to all
loans, not just defaulted loans).
2.2. Single vs. two-stage LGD models
Whereas the former models estimate LGD directly and will thus be referred to as
“single-stage" models, the idea of using a so-called “two-stage" model is to
incorporate two component models, the Probability of Repossession Model and the
Haircut Model, into the LGD modelling. Initially, the Probability of Repossession
Model is used to predict the likelihood of a defaulted mortgage account undergoing
repossession. It is sometimes thought that the probability of repossession is mainly
dependent on one variable, viz. loan-to-value, hence some probability of
repossession models currently in use only consist of this single variable. This is then
followed by a second model which estimates the amount of discount the sale price of
the repossessed property would undergo. The Haircut Model predicts the difference
between the forced sale price and the market valuation of the repossessed property.
These two models are then combined to get an estimate for loss, given that a
mortgage loan would go into default. An example study involving the two-stage
model is that of Somers & Whittaker (2007), who, although they did not detail the
development of their Probability of Repossession Model, acknowledged the
methodology for the estimation of mortgage loan LGD. In their paper, they focus on
the consistent discount (haircut) in sale price observed in the case of repossessed
properties and because they observe a non-normal distribution of haircut, they
propose the use of quantile regression in the estimation of predicted haircut.
Another paper that investigates the variability that the value of collateral undergoes
is by Jokivuolle & Peura (2003). Although their work was on default and recovery on
corporate loans, they highlight the correlation between the value of the collateral
and recovery.
4
In summary, despite the increased importance of LGD models in consumer lending
and the need to estimate residential mortgage loan default losses at the individual
loan level, still relatively few papers have been published in this area apart from the
ones mentioned above.
3. Research objectives
From the literature review, we observe that the few papers which looked at
mortgage loss did so either by directly modelling LGD (“single-stage" models) using
economic variables and characteristics of loans that were in default or did not look at
both components of a two-stage model, i.e. haircut as well as repossession. This
might be due to their analysis being carried out on a sample of loans which had
undergone default and subsequent repossession, and thus removed the need to
differentiate between accounts that would undergo repossession from those that
would not. We note also that there was little consideration for possible correlation
between explanatory variables.
Hence, the two main objectives of this paper are as follows. Firstly, we intend to
evaluate the added value of a Probability of Repossession Model with more than just
one variable (loan-to-value ratio). Secondly, using real-life UK bank data, we would
also like to empirically validate the approach of using two component models, the
Probability of Repossession Model and the Haircut Model, to create a model that
produces estimates for LGD. We develop the two component models before
combining them by weighting conditional loss estimates against their estimated
outcome probabilities.
4. Data
The dataset used in this study is supplied by a major UK Bank, with observations
coming from all parts of the UK, including Scotland, Wales and Northern Ireland.
There are more than 140,000 observations and 93 variables in the original dataset,
all of which are on defaulted mortgage loans, with each account being identified by a
unique account number. About 35 percent of the accounts in the dataset undergo
repossession, and time between default and repossession varies from a couple of
months to several years. After pre-processing (see later in Section 5), we retain
5
about 120,000 observations, with accounts that start between the years 1983 and
2001 (note that loans predating 1983 were removed because of the unavailability of
house price index data for these older loans) and default between the years of 1988
and 2001, with at least a two year outcome window (for repossession to happen, if
any). Note that this sample does not encompass observations from the recent
economic downturn.
Under the Basel II framework, financial institutions are required to forecast default
over a 12-month horizon and resulting losses at a given time (referred to here as
“observation time"). As such, LGD models developed should not contain information
only available at time of default. However, due to limitations in the dataset, in which
information on the state of the account in the months leading up to default (e.g.
outstanding balance at observation time) are unavailable, we use approximate
default time instead of observation time. When applying this model at a given time
point, a forward-looking adjustment could then be applied to convert the current
value of that variable, for example, outstanding balance, to an estimate at time of
default. Default-time variables for which no reasonable projection is available are
removed.
4.1. Multiple defaults
Some accounts have repeated observations, which mean that some customers were
oscillating between keeping up with their normal repayments and going into default.
Hereby, each default is recorded as a separate observation of the characteristics of
the loan at that time. Because the UK Basel II regulations state that the financial
institution should return an exposure to non-default status in the case of recovery,
and record another default should the same exposure subsequently go into default
again (The Financial Services Authority (FSA) (2009), BIPRU 4.3.71), we include all
instances of default in our analysis, and record each default that is not the final
instance of default as having zero LGD (in the absence of further cost information).
We note that other approaches to deal with repeated defaults could be considered
depending on the local regulatory guidelines.
6
4.2. Time on Book
Time on Book is calculated to be the time between the start date of the loan and the
approximate date of default1. The variable time on book exhibits an obvious
increasing trend over time (cf. Figure 12) which might be partly due to the
composition of the dataset. In the dataset, we have defaults between years 1988
and 2001, which just about coincides with the start of the economic downturn in the
UK of the early nineties. We observe that the mean time on book for observations
that default during the economic downturn is significantly lower than the mean time
on book for observations that default in normal economic times.
Figure 1: Mean time on book over time with reference to year of default
1
Date of default was estimated by the bank using the arrears status and amount of cumulated arrears
at the end of each year for each account because we are not explicitly given default date in the original
dataset.
2 Due to a data confidentiality agreement with the data provider, the scale for the y-axis has been
omitted in some of the reported figures.
7
4.3. Valuation of security at default and haircut
At the time of the loan application, information about the market value of the
property is obtained. As reassessing its value would be a costly exercise, no new
market value assessment tends to be undertaken thereafter and a valuation of the
property at various points of the loan can be obtained by updating the initial property
value using the publicly available Halifax House Price Index3 (all houses, all buyers,
non-seasonally adjusted, quarterly, regional). The valuation of security at default is
calculated according to Equation 1:
Valuation of security default 
HPIdef y r, def qtr, region
HPIstart y r, start qtr, region
 Valuation of security start
(1)
Using this valuation of security at default, other variables are then updated. One is
valuation of property as a proportion of average property value in the region, which
gives an indication of the quality of the property relative to other properties in the
same area; another is LTV at default (DLTV) which is the ratio of the outstanding
loan at default to the valuation of the security at default; and yet another is haircut4,
which we define as the ratio of forced sale price to valuation of property at default
quarter (only for observations with valid forced sale price). For example, a property
estimated to have a market value of £1,000,000 but repossessed and sold at
£700,000 would have a haircut of
£700,000
 0.7 .
£1,000,000
4.4. Training and test set splits
To obtain unbiased performance estimates of model performance, we set aside an
independent test dataset. We develop each component model on a training set
before applying the models onto a separate test set that was not involved in the
development of the model itself, to gauge the performance of the model and to
3
4
Available at: http://www.lloydsbankinggroup.com/media1/research/halifax_hpi.asp
A more common definition of haircut is the complement, 1 
sale price
.
valuation of security at default
However, in this paper, we will use the term “haircut” to refer to the ratio, rather than its complement,
to facilitate the interpretation of parameter signs and further notation. We note that which of these two
definitions is used does not affect the actual modelling but will make a difference for the interpretation
of coefficient signs.
8
ensure there is no over-fitting. To do so, we split the cleaned dataset into two-third
and one-third sub-samples, keeping the proportion of repossession the same in both
sets (i.e. stratified by repossession cases). These are then used as the respective
training and test set for the Probability of Repossession Model. However, since a
haircut can only be calculated in the event of repossession and sale, all nonrepossessions will subsequently be removed from the training and test sample for
the second Haircut Model component.
4.5. Loss given default
When a loan goes into default and the property is subsequently repossessed by the
bank and sold, legal, administrative and holding costs are incurred. As this process
might take a couple of years to complete, revenues and costs have to be discounted
to present value in the calculation of Loss Given Default (LGD), and should include
any compounded interest incurred on the outstanding balance of the loan. However,
in our analysis, we simplify the definition of LGD to exclude both the extra costs
incurred and the interest lost, because we are not provided with information about
the legal and administrative costs associated with each loan default and repossession.
Hence, LGD is defined to be the final (nominal) loss from the defaulted loan as a
proportion of the outstanding loan balance at (year end of) default, and where loss is
defined to be the difference between outstanding loan at default and forced sale
amount, if the property was sold at a price that is lower than the outstanding loan at
default (i.e. outstanding loan at default > forced sale amount). If the property was
able to fetch an amount greater than or equal to the outstanding loan at default,
then loss is defined to be zero. If the property was not repossessed, or repossessed
but not sold, loss is also assumed to be zero, in the absence of any additional
information. With loss defined to be zero, LGD is of course also 0.
5. The Probability of Repossession Model
Our first model component will provide us with an estimate for the probability of
repossession given that a loan goes into default.
9
5.1. Modelling methodology
We first identify a set of variables that are eligible for inclusion in the Repossession
Model. Variables that cannot be used are removed, including those which contain
information that is only known at time of default and for which no reasonably precise
estimate can be produced based on their value at observation time (e.g. arrears at
default), or those that have too many missing values, are related to housing or
insurance schemes that are no longer relevant, or where the computation is simply
not known. We also then check the correlation coefficient between pairs of
remaining variables, and find that none are greater than │0.6│. Using these, a
logistic regression is then fitted onto the repossession training set and a backward
selection method based on the Wald test is used to keep only the most significant
variables (p-value of at most 0.01). We then check that the signs of each parameter
estimate behave logically, and that parameter estimates of groups within categorical
variables do not contradict with intuition.
5.2. Model variations
Using the methodology above, we obtain a Probability of Repossession Model R1,
with four significant variables, loan-to-value (LTV) ratio at time of loan application
(start of loan), a binary indicator for whether this account has had a previous default,
time on book in years and type of security, i.e. detached, semi-detached, terraced,
flat or others. In a second model, we replace LTV at loan application and time on
book with LTV at default (DLTV), referred to as Probability of Repossession Model R2.
Including all three variables (LTV, DLTV and time on books) in a single model would
cause counter-intuitive parameter estimate signs. Another simpler repossession
model fitted on the same data, against which we will compare our models, is Model
R0. The latter model only has a single explanatory variable, DLTV, which is often the
main driver in models used by the retail banking industry.
5.3. Performance measures
Performance measures applied here are accuracy rate, sensitivity, specificity, and the
Area Under the ROC Curve (AUC).
10
In order to assess the accuracy rate (i.e. total number of correctly predicted
observations as a proportion of total number of observations), sensitivity (i.e.
number of observations correctly predicted to be events – in this context:
repossessions – as a proportion of total number of actual events) and specificity (i.e.
number of observations correctly predicted to be non-events – in this context: nonrepossessions – as a proportion of total number of actual non-events) of each logistic
regression model, we have to define a cut-off value for which only observations with
a probability higher than the cut-off are predicted to undergo repossession. How the
cut-off is defined affects the performance measures above, as it affects how many
observations shall be predicted to be repossessions or non-repossessions. For our
dataset, we choose the cut-off value such that the sample proportions of actual and
predicted repossessions are equal. However, we note that the exact value selected
here is unimportant in the estimation of LGD itself as the method later used to
estimate LGD does not require selecting a cut-off.
The Receiver Operating Characteristic (ROC) curve is a 2-dimensional plot of
sensitivity and 1 – specificity values for all possible cut-off values. It passes through
points (0,0), i.e. all observations are classified as non-events, and (1,1), i.e. all
observations are classified as events. A straight line through (0,0) and (1,1)
represents a model that randomly classifies observations as either events or nonevents. Thus, the more the ROC curve approaches point (0,1), the better the model
is in terms of discerning observations into either category. As the ROC curve is
independent of the cut-off threshold, the area under the curve (AUC) gives an
unbiased assessment of the effectiveness of the model in terms of classifying
observations.
We also use the DeLong, DeLong and Clarke-Pearson test (DeLong et al. (1988)) to
assess whether there are any significant differences between the AUC of different
models.
5.4. Model results
Applying the DeLong, DeLong and Clarke-Pearson test, we find that the AUC values
for model R2 is significantly better than that for R0 (cf. Table 1), whereas R1
performs worse. Hence, model R2 is selected for further inclusion in our two-stage
11
model. Table 2 gives the direction of parameter estimates used in the Probability of
Repossession Model R2, together with a possible explanation. The parameter
estimate values and p-values of all repossession model variations can be found in
Appendix, Tables A.7, A.8 and A.9.
Table 1: Repossession model performance statistics
Model
AUC
Cut-off
Specificity
Sensitivity
Accuracy
R1, Test Set (LTV, time
0.727
0.435
57.449
75.688
69.186
0.743
0.432
59.398
76.203
70.213
R0, Test Set (DLTV)
0.737
0.436
58.626
76.008
69.812
DeLong et al p-value, R1
<0.001
on books, Security,
Previous default)
R2, Test Set (DLTV,
Security, Previous def)
vs. R0
DeLong et al p-value, R2
<0.001
vs. R0
Table 2: Parameter estimate signs for Probability of Repossession Model R2
Variable
Relation to
Explanation
probability of
repossession
(given default)
DLTV (LTV at
+
default)
Previous default
If a large proportion of loan is tied up in
security, likelihood of repossession increases
+
Probability of repossession increases if account
has been in default before
Security
-
Lower-range property types such as flats are
more likely to be repossessed in the case of
default
6. The Haircut Model
The Haircut Model is only applicable to observations that have undergone the
repossession and forced sale process, where haircut is defined to be the ratio of
12
forced sale price to valuation of security at default. Therefore, securities that were
not repossessed, or repossessed but not sold do not have a haircut value, and are
thus excluded from the development of the Haircut Model.
An OLS model is also developed to explicitly model haircut standard deviation, as a
function of time on books, as suggested by Lucas (2006).
Figure 2: Distribution of haircut (solid curve references the normal distribution)
The distribution of haircut is shown in Figure 2 with the solid curve referencing the
normal distribution. Statistics from the Kolmogorov-Smirnov and Anderson-Darling
Tests (Peng (2004)) suggest non-normality with p-values of <0.01 and <0.005
respectively, but for the purposes of the prediction of LGD, we approximate haircut
by a normal distribution.
13
6.1. Modelling methodology
The top and bottom 0.05 percent of observations (26 cases) for haircut are
truncated before we establish the set of eligible variables to be considered in the
development of an OLS linear regression model for the Haircut Model. We also
check the relationship between variables and haircut. In particular, the valuation of
security at default to average property valuation in the region ratio displays high
non-linearity (cf. Figure 3) and is binned into 6 groups for model development.
Backward stepwise regression is used to remove insignificant variables and individual
parameter estimate signs are checked for intuitiveness. We also check for intuition
within categorical variables, and examine the Variance Inflation Factors (VIF)5.
Figure 3: Relationship between haircut and (ranked) valuation of security at default
to average property valuation in the region ratio
5
If variables within the model are highly correlated with each other, it would be reflected in a high
value of VIF. Any value above 10 would imply severe collinearity amongst variables while values less
than 2 would mean that variables are almost independent (Fernandez, G.C.J., 2007. Effects of
Multicollinearity in All Possible Mixed Model Selection, PharamaSUG Conference (Statistics &
Pharmacokinetics), Denver, Colorado.).
14
6.2. Model variations
Using the methodology above, we obtain a Haircut Model H1, with seven significant
variables, loan-to-value (LTV) ratio at time of loan application (start of loan), a
binary indicator for whether this account has had a previous default, time on book in
years, ratio of valuation of property to average in that region (binned), type of
security, i.e. detached, semi-detached, terraced, flat or other, age group of property
and region. In a second model, we replace LTV at loan application and time on book
with LTV at default (DLTV), referred to as Haircut Model H2; note that, as previously,
including all three variables (LTV, DLTV and time on books) in a single model would
cause counter-intuitive parameter estimate signs.
Comparative performance measures for the two models are reported in the following
section.
6.3. Performance measures
The performance measures considered here are the R-square value, Mean Squared
Error (MSE) and Mean Absolute Error (MAE). To create a graphical representation of
the results, we also present a binned scatterplot of predicted haircut value bands
against actual haircut values, where predicted haircut values are put into ascending
order and binned into equal-frequency value bands; the mean actual haircut value is
then compared against the mean predicted haircut value in each haircut band.
6.4. Model results
First, we note that all parameters for all models have low VIF values, the only ones
above 2 belonging to geographical indicators. In the Haircut Model, the combination
of LTV and time on books seems to be able to capture the information carried in
DLTV because, as it is observed from Table 3, Model H1 gives the better
performance. This could be because LTV gives an indication of the (initial) quality of
the customer whereas values of DLTV could be due to changes in house prices since
the purchase of the property. Based on this, Model H1 is selected as the Haircut
Model to be used in the LGD estimation.
15
Table 3: Haircut model performance statistics
Model
MSE
MAE
R2
H1, Test Set
0.039
0.147
0.143
H2, Test Set
0.039
0.148
0.131
Table 4: Parameter estimate signs of Haircut Model H1
Variable
Relation to
Explanation
haircut
LTV
+
Refer to Figure 4 and explanation in
Section 6.4
Ratio of valuation of
+/-
Medium-end properties (relative to the
security at default to
region the property is in) have higher
average property valuation
haircut than lower-end properties, but
in that region, binned
higher-end properties tend to have
lowest haircut (cf. Figure 3 in Section
6.1)
Previous default
+
Haircut is higher for accounts that have
previously defaulted
TOB (Time on book in
+
years)
Older loans imply greater uncertainty
and error in estimation of value of
security at default, so higher haircut is
possible
Security
+
Haircut tends to be higher for higherend property types (e.g. detached)
Age group of property
+
(oldest to newest)
Region6
Haircut tends to be higher for newer
properties
N/A
Haircut differs across regions
Table 4 details parameter estimate signs. From it, we see that a greater LTV at start
implies a higher haircut (i.e. a higher forced sale price). This would mean that the
larger the loan a debtor took at time of application in relation to property value, the
higher the forced sale price of the security would be in the event of a default and
6
Since regional differences may not persist over time, alternatively, one can choose to omit the
geographic dummy variables from the model. Our robustness tests indicate that the model fit is slightly
lower without these; parameter estimate signs and estimates of the other variables remained stable.
16
repossession. At first, it might seem as though this parameter estimate sign might
be confused due to the number of variables in the Haircut Model, or due to some
hidden correlation between variables. In order to rule out this possibility, we look at
the relationship between LTV at start and haircut. From Figure 4, we observe that
there indeed appears to be a positive relationship between haircut and LTV. Part of
the explanation for this might be found in policy decisions taken by the bank. For
loans with high loan-to-value ratios, due to the large amount (relative to the loan)
the bank has committed towards the property, when the account does go into
default and subsequent repossession, the bank may be reluctant to let the
repossessed property go unless it is able to fetch a price close to the current
property valuation. Another possible reason could be that borrowers with a low LTV
are likely to sell early and only end up in repossession when they know the house is
in a bad state and unlikely to make anything near its indexed valuation.
Figure 4: Relationship between haircut and (ranked) LTV at time of loan application
17
To further validate the model, we also include in Figure 5 a scatter plot of mean
(grouped) predicted and actual haircut. From it, we observe that our model
produces unbiased estimates of haircut. Parameter estimates of all models can be
found in Appendix, Tables A.10 and A.11.
Figure 5: Prediction performance of haircut test set
6.5. Haircut standard deviation modelling
To be able to produce an expected value for LGD (see later, Section 7.1), we will not
only require a point estimate for haircut but also a model component for haircut
variability. Further inspection reveals that the standard deviation of haircut increases
with longer time on books (cf. Figure 6), which can be expected because the
valuation of a property is usually updated using publicly available house price indices
(instead of commissioning a new valuation process), and the longer an account has
been on the books, the greater the uncertainty and error in the estimation of current
valuation of property, which will affect the error in the prediction of haircut as well.
18
As suggested by Lucas (2006), to model this relationship, a simple OLS model was
fitted that estimates the standard deviation for different time on books bins7. Time
on books is binned into equal-length intervals of 6 months, and standard deviation of
haircut is calculated for each group based on the mean haircut in that group. This
model will later on be used to calculate the expected values for LGD (cf. Section 7.1).
Performance statistics for this auxiliary model are detailed in Table 5; parameter
estimates can be found in Appendix, Table A.12.
Figure 6: Mean haircut standard deviation by time on book bins
7
Alternatively, because standard deviation of haircut is different for different groups of observations,
the weighted least squares regression method was considered to adjust for heteroscedascity in the OLS
model developed in Section 6.4. Two different weights were experimented with - the error term
variance of each observation (from running an OLS model for haircut) and time on books. Both models
produced similar parameter estimates to the selected haircut model, which suggests that the OLS model
was able to produce robust parameter estimates even though the homoscedasticity assumption was
violated. Also, because both models did not explicitly model and produce standard deviation of haircut,
which is required in the calculation of expected LGD, a separate OLS model for standard deviation is
necessary.
19
Table 5: Haircut Standard Deviation Model performance statistics
Model
MSE
MAE
R2
Training Set
0.0001
0.0046
0.9315
Test Set
0.0002
0.0105
0.8304
7. Loss given default model
After having estimated the Probability of Repossession Model, the Haircut Model and
the Haircut Standard Deviation Model, we now combine these models to get an
estimate for Loss Given Default. Here we illustrate two ways of combining the
component models, report their respective LGD predictions, and advocate use of the
more conservative approach producing an expected value for LGD that takes into
account haircut variability. We also compare these results against the single-stage
model predictions and performance statistics.
7.1. Modelling methodology
A first approach referred to in our paper as the “haircut point estimate" approach
would be to keep the probabilities derived under the Probability of Repossession
Model and apply the Haircut Model onto all observations. The latter would give all
observations a predicted haircut value in the event of repossession. Using this
predicted value of haircut, predicted sale price and predicted loss (outstanding
balance at default less sale proceeds), if any, can be calculated. We then find
predicted LGD by multiplying the probability of repossession with this predicted loss
if repossession happens. Although this method does produce some estimate for LGD,
regardless of whether the observation is predicted to enter repossession or not, it
uses only a single value of haircut (although it is the most probable value). However,
if the true haircut happens to be lower than predicted, sale proceeds would be
overestimated, which would mean that a loss could still be incurred (provided that
haircut falls below DLTV). This is an illustration of how misleading LGD predictions
could be produced if the component models were not combined appropriately.
Hence, to produce a true expected value for LGD, one should also take into account
the distribution to the haircut estimate and the associated effect on loss in its left tail.
Hence, the second and more conservative approach, suggested e.g. by Lucas (2006)
and referred to here as the “expected shortfall" approach, also takes into account
20
the probabilities of other values of haircut occurring, and the different levels of loss
associated with these different levels. To do so, we first apply the Probability of
Repossession Model to get an estimate of probability of repossession given that an
account goes into default. We then apply the Haircut Model onto the same dataset
to get an estimate for haircut, Hˆ j , for each observation j, regardless of whether the
security is likely to be repossessed. A minimum value of zero is set for predicted
haircut, as there is no meaning to a negative haircut. The Haircut Standard
Deviation Model is then applied onto each observation j to get a predicted haircut
standard deviation,  j , depending on its value of time on books (see Section 6.5).
From these predicted values, we approximate the distribution of each predicted


2
haircut by a normal distribution, h j ~ N Hˆj ,  j .
For simplicity, the subscript j, which represents individual observations, will be
dropped from here on.
As long as the haircut (sale amount as a ratio of valuation of property at default) is
greater than DLTV (outstanding balance at default as a ratio of valuation of property
at default), and ignoring any additional administrative and repossession-associated
costs, the proceeds from the sale would be able to cover the outstanding balance on
the loan, i.e. there would be no shortfall. Hence, the expected shortfall expressed as
a proportion of the indexed valuation of property is:
Eshortfall percent | repossessi on 
DLTV
 p h DLTV  h dh
(2)

where p(.) denotes the probability density function of the distribution for h.
To convert the latter into a standard normal distribution, we let:
z 
h  Hˆ
DLTV  Hˆ
~ N 0,1; D 


Hence, Expected Shortfall can easily be derived as follows:
21
Eshortfall percent | repossessi on 
D
 p z D  z dz

D

  D


  D  p z dz      p z zdz 
 
  

 DCDFZ D     PDFZ D 
(3)
where CDFZ D  and PDFZ D  denote the cumulative distribution function and
probability density function of the standard normal distribution, respectively.
Expected loss given default is then obtained from the probability of non-repossession
and the expected shortfall calculated for the repossession scenario (cf. Equation 4,
below). The probability of an account undergoing repossession given that it has
gone into default is multiplied against the expected LGD the account would incur in
the event of repossession. We also multiply the probability of an account not going
into repossession against the expected LGD for non-repossessions (denoted by c).
We can use the average observed LGD for actual non-repossessions as the expected
conditional LGD for non-repossessions.
Eshortfall percent | repossessi on


Eloss | default    indexed valuation

 PRepossessi on | default 

(4)
 c  1  PRepossessi on | default 
where c is the loss associated with non-repossessions (assumed to be 0 in the
absence of additional information).
Finally, we obtain predicted LGD by taking the ratio Eloss | default  to (estimated)
outstanding balance at default.
7.2. Alternative single-stage model
To be able to compare this two-stage model, we also developed a simple singlestage model using the same data. A backward stepwise selection on the same set of
eligible variables used earlier in the two-stage model building was applied, and
resulting model parameter estimates are added in Appendix, Table A.13. However, it
is noted that whatever the results of the single-stage model, because it directly
22
predicts LGD based on loan and collateral characteristics, it does not provide the
same insight into the two different drivers (i.e. repossession risk and sale price
haircut) of mortgage loss, and as such does not provide as rich a framework for
stress testing.
The performance measures of this single-stage model are then compared against
those of the preferred two-stage model developed in the previous section (i.e. using
the expected shortfall approach), as well as the two-stage model that would result
from the so-called “haircut point estimate" approach.
7.3. Model performance
Using the same performance measures as those used for the Haircut Model, we
compare the MSE, MAE and R-square values of our two-stage and the single-stage
models (cf. Table 6). It is observed that both two-stage model variations achieve a
substantially better R-square of just under 0.27 (compared to 0.233 for the singlestage model) on the LGD Test set, which is competitive to other LGD models
currently used in the industry.
Table 6: Performance measures of two-stage and single-stage LGD Models
Method, Dataset
MSE
MAE
R2
Single stage, Test set
0.026
0.121
0.233
Two-stage (haircut point estimate), Test set
0.025
0.108
0.268
Two-stage (expected shortfall), Test set
0.025
0.101
0.266
The distributions of predicted LGD and actual LGD for all LGD models are shown in
Figure 7. In the original empirical distribution of LGD (see top section of Figure 7),
there is a large peak near 0 (where losses were zero either because there was no
repossession, or because the sale of the house was able to cover the remaining loan
amount). Firstly, we observe that the single-stage model (shown in the bottom
section of Figure 7) is unable to produce the peak near 0. Moreover, note that the
two-stage model using the haircut point estimate seemingly is the model that most
closely reproduces the empirical distribution of LGD, as it is able to bring out the
peak near 0. Although the R-square values achieved by the two two-stage LGD
model variations are very close (see Table 6), their LGD distributions are quite
23
different. The haircut point estimate approach is shown to underestimate the
average loss (cf. mean predicted LGD from haircut point estimate method being
lower than mean observed LGD). Unlike the former approach, the expected shortfall
method takes a more conservative approach in its estimation of LGD, which takes
into account the haircut distribution and its effect on expected loss based on
probabilities of different haircut values occurring. This will make a difference
especially for observations that would be predicted to have low or zero LGD under
the haircut point estimate method because these very accounts are now assigned at
least some expected loss amount, hence moving observations out of the peak and
into the low LGD bins.
Figure 7: Distribution of observed LGD (Empirical), predicted LGD from two-stage
haircut point estimate model (HC pt. est), two-stage expected shortfall model
(E.shortfall), single-stage model (single stage) (from top to bottom)
To further verify to what extent these various models are able to produce unbiased
estimates at an LGD loan pool level, we create a graphical representation of the
24
results. We look at a binned scatterplot of predicted LGD value bands against actual
LGD values, where predicted LGD values are put into ascending order and binned
into equal-frequency value bands. For each method we used in the calculation of
LGD, we plot the mean actual LGD value against the mean predicted LGD value (for
that LGD band) onto a single graph, included in Figure 8. Observe that both of the
two-stage models are able to consistently estimate LGD fairly closely, whereas the
single-stage model either overestimates (in the lower-left hand region of the graph
which represents observations that have low LGD) or underestimates LGD (in the
upper-right hand region of the graph which represents observations with high LGD).
Furthermore, the expected shortfall approach is shown to produce the more reliable
estimates in the lower-LGD regions, outperforming the haircut point estimate
approach in the lower-left part of the graph, where the haircut point estimate
approach indeed underestimates risk (i.e. the estimates fall below the diagonal).
Figure 8: Scatterplot of predicted and actual LGD in LGD bands
25
Finally, in order to check robustness of our two-stage LGD model, we have also
experimented with re-estimating the two component models this time including only
the first instance of default for customers with multiple defaults (i.e. not all instances
of default included for observations with multiple defaults). Detailed results are not
reported here, but for both component models and the LGD model itself, we
obtained the same parameter estimate signs, and parameter estimates were similar
in size.
8. Conclusions and further research
In this paper, we developed and validated a number of models to estimate the LGD
of mortgage loans using a large set of recovery data of residential mortgage defaults
from a major UK bank. The objectives of this paper were two-fold. Firstly, we
aimed to evaluate the added value of a Probability of Repossession Model with more
than just LTV at default as its explanatory variable. We have developed a Probability
of Repossession Model with three variables, and showed that it is significantly better
than a model with only the commonly used DLTV.
Secondly, we wanted to validate the approach of using two component models, a
Probability of Repossession Model and a Haircut Model, which consists of the Haircut
Model itself and the Haircut Standard Deviation Model, to create a model that
produces estimates for LGD. Here, two methods are explained, both of which will
produce a value of predicted LGD for every default observation because the Haircut
Model, which gives a predicted sale amount and predicted shortfall, shall be applied
to all observations regardless of its probability of repossession. However, we then
show how the first method, which uses only the haircut point estimate, would end up
underestimating LGD predictions. The second and preferred method (expected
shortfall) derives expected loss from an estimated normal haircut distribution having
the predicted haircut from the Haircut Model as the mean, and with the standard
deviation obtained from the Haircut Standard Deviation Model.
For comparison purposes, we also developed a single-stage model. This model
produced a lower R-square value, and was also unable to fully emulate the actual
distribution of LGD.
26
Having shown that the proposed two-stage modelling approach works well on reallife data, in our further research, we intend to explore the inclusion of
macroeconomic variables in either or both the Probability of Repossession Model and
the Haircut Model. These macroeconomic variables might include the unemployment
rate, the inflation rate, the interest rate or some indication of the amount of
borrowing in each economic year. Finally, we also consider the use of alternative
methods, for example, survival analysis to better predict and estimate the time
periods between each milestone (repossession and sale) of a defaulted loan account.
9. Acknowledgements
We thank the bank who has kindly provided the dataset that enabled this work to be
carried out and Professor Lyn Thomas for his guidance throughout this work. We
also thank the editor and reviewers who have contributed invaluably with their
insightful feedback and recommendations. Any mistakes are solely ours.
10. References
Altman, Edward I., Brady, B., Resti, A., Sironi, A., 2005. The Link between Default
and Recovery Rates: Theory, Empirical Evidence, and Implications. The Journal of
Business 78, 2203-2228.
Calem, P.S., LaCour-Little, M., 2004. Risk-based capital requirements for mortgage
loans. Journal of Banking & Finance 28, 647-672.
Campbell, T.S., Dietrich, J.K., 1983. The Determinants of Default on Insured
Conventional Residential Mortgage Loans. The Journal of Finance 38, 1569-1581.
DeLong, E.R., DeLong, D.M., Clarke-Pearson, D.L., 1988. Comparing the Areas under
Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric
Approach. Biometrics 44, 837-845.
Federal Register, 2007. Risk-Based Capital Standards: Advanced Capital Adequacy
Framwork - Basel II; Final Rule.
Fernandez, G.C.J., 2007. Effects of Multicollinearity in All Possible Mixed Model
Selection, PharamaSUG Conference (Statistics & Pharmacokinetics), Denver,
Colorado.
Financial Services Authority, 2009. Prudential Sourcebook for Banks, Building
Societies and Investment Firms.
27
Gupton, G.M., Stein, R.M., 2002. LOSSCALC: Model for Predicting Loss Given Default
(LGD).
Jarrow, R., 2001. Default Parameter Estimation Using Market Prices. Financial
Analysts Journal 57, 75-92.
Jokivuolle, E., Peura, S., 2003. Incorporating Collateral Value Uncertainty in Loss
Given Default Estimates and Loan-to-value Ratios. European Financial Management 9,
299-314.
Lucas, A., 2006. Basel II Problem Solving, Conference on Basel II & Credit Risk
Modelling in Consumer Lending Southampton, UK.
Peng, G., 2004. Testing Normality of Data Using SAS, PharmaSUG, San Diego,
California.
Qi, M., Yang, X., 2009. Loss given default of high loan-to-value residential mortgages.
Journal of Banking & Finance 33, 788-799.
Quercia, R.G., Stegman, M.A., 1992. Residential Mortgage Default: A Review of the
Literature. Journal of Housing Research 3, 341-379.
Schuermann, T., 2004. What Do We Know About Loss Given Default, in: Shimko, D.
(Ed.), Credit Risk: Models and Management, 2nd ed. Risk Books.
Somers, M., Whittaker, J., 2007. Quantile regression for modelling distributions of
profit and loss. European Journal of Operational Research 183, 1477-1487.
Truck, S., Harpaintner, S., Rachev, S.T., 2005. A Note on Forecasting Aggregate
Recovery Rates with Macroeconomic Variables.
von Furstenberg, G.M., 1969. Default Risk on FHA-Insured Home Mortgages as a
Function of the Terms of Financing: A Quantitative Analysis. The Journal of Finance
24, 459-477.
28
Appendix
Table A7: Parameter estimates for Probability of Repossession Model R0
Variable
Variable explanation
Estimate
StdErr
WaldChiSq
ProbChiSq
Intercept
-
-3.069
0.028
12235.289
<0.01
DLTV
Loan to value at default 2.821
0.029
9449.349
<0.01
Table A8: Parameter estimates for Probability of Repossession Model R1
Variable
Variable
Estimate
StdErr
WaldChiSq
ProbChiSq
Explanation
Intercept
-
-1.138
0.040
795.605
<0.01
LTV
Loan to value
2.101
0.040
2809.703
<0.01
-0.188
0.003
2899.616
<0.01
0.102
0.034
8.869
<0.01
at loan
application
TOB
Time on
books (in
years)
Previous default
Indicator for
previous
default
security0 (base)
Flat or other
-
-
-
-
security1
Detached
-0.625
0.031
413.989
<0.01
security2
Semi-
-0.670
0.024
787.436
<0.01
-0.421
0.021
395.497
<0.01
detached
security3
Terraced
Table A9: Parameter estimates for Probability of Repossession Model R2
Variable
Variable
Estimate
StdErr
WaldChiSq
ProbChiSq
Explanation
Intercept
-
-2.570
0.034
5769.803
<0.01
DLTV
Loan to value
2.679
0.029
8295.648
<0.01
-0.471
0.032
211.064
<0.01
at default
Previous
Indicator for
default
previous
29
default
security0
-
-
-
-
(base)
Flat or other
security1
Detached
-0.461
0.031
219.425
<0.01
security2
Semi-
-0.546
0.024
503.458
<0.01
-0.343
0.022
253.470
<0.01
detached
security3
Terraced
Table A10: Parameter estimates for Haircut Model H1
Variable
Variable Explanation
Estimate
StdErr
ProbT
VIF
Intercept
-
0.508
0.009
<0.01
0.000
LTV
Loan to value at loan
0.243
0.007
<0.01
1.136
application
TOB
Time on book (in years)
0.005
0.001
<0.01
1.251
VVAratio1
Value of property /
-
-
-
-
(base)
region average <= 0.9
VVAratio2
0.9 < Value of property / -0.005
0.004
0.248
1.134
0.006
<0.01
1.149
0.008
<0.01
1.127
0.009
<0.01
1.161
-0.138
0.009
<0.01
1.226
0.042
0.006
<0.01
1.168
-0.085
0.003
<0.01
1.273
-0.032
0.004
<0.01
1.194
region average <= 1.2
VVAratio3
1.2 < Value of property / -0.059
region average <= 1.5
VVAratio4
1.5 < Value of property / -0.092
region average <= 1.8
VVAratio5
1.8 < Value of property / -0.090
region average <= 2.4
VVAratio6
Value of property /
region average > 2.4
Previous default
Indicator for previous
default
Propage1
Very old property
(before 1919)
Propage2
Old property (19191945)
Propage3 (base)
Built after 1945
-
-
-
-
security0 (base)
Flat or other
-
-
-
-
30
security1
Detached
0.165
0.006
<0.01
1.875
security2
Semi-detached
0.129
0.004
<0.01
1.764
security3
Terraced
0.094
0.003
<0.01
1.739
region1
North
-0.112
0.010
<0.01
1.753
region2
Yorkshire & Humberside
-0.095
0.008
<0.01
2.898
region3
North West
-0.099
0.008
<0.01
3.163
region4
East Midlands
-0.100
0.008
<0.01
2.489
region5
West Midlands
-0.065
0.008
<0.01
2.449
region6
East Anglia
-0.067
0.009
<0.01
1.968
region7
Wales
-0.115
0.009
<0.01
2.140
region8
South West
-0.047
0.008
<0.01
3.272
region9
South East
-0.062
0.007
<0.01
6.348
region10
Greater London
-0.010
0.007
0.166
5.214
region11
Northern Ireland
-0.034
0.014
0.017
1.256
region12 (base)
Scotland or others /
-
-
-
-
missing
Table A11: Parameter estimates for Haircut Model H2
Variable
Variable Explanation
Estimate
StdErr
ProbT
VIF
Intercept
-
0.591
0.008
<0.01
0.000
DLTV
Loan to value at default
0.162
0.005
<0.01
1.175
VVAratio1
Value of property / region
-
-
-
-
(base)
average <= 0.9
VVAratio2
0.9 < Value of property /
-0.011
0.004
<0.01
1.126
-0.069
0.006
<0.01
1.141
-0.108
0.008
<0.01
1.116
-0.108
0.009
<0.01
1.149
-0.158
0.009
<0.01
1.209
0.064
0.005
<0.01
1.010
region average <= 1.2
VVAratio3
1.2 < Value of property /
region average <= 1.5
VVAratio4
1.5 < Value of property /
region average <= 1.8
VVAratio5
1.8 < Value of property /
region average <= 2.4
VVAratio6
Value of property / region
average > 2.4
Previous default
Indicator for previous default
31
Propage1
Very old property (before
-0.079
0.003
<0.01
1.261
-0.030
0.004
<0.01
1.193
-
-
-
-
security0 (base) Flat or other
-
-
-
-
security1
Detached
0.162
0.006
<0.01
1.874
security2
Semi-detached
0.126
0.004
<0.01
1.761
security3
Terraced
0.092
0.003
<0.01
1.736
region1
North
-0.109
0.010
<0.01
1.752
region2
Yorkshire & Humberside
-0.094
0.008
<0.01
2.897
region3
North West
-0.098
0.008
<0.01
3.159
region4
East Midlands
-0.112
0.008
<0.01
2.497
region5
West Midlands
-0.076
0.008
<0.01
2.454
region6
East Anglia
-0.102
0.009
<0.01
2.007
region7
Wales
-0.125
0.009
<0.01
2.141
region8
South West
-0.080
0.008
<0.01
3.325
region9
South East
-0.095
0.007
<0.01
6.489
region10
Greater London
-0.042
0.007
<0.01
5.323
region11
Northern Ireland
-0.030
0.014
0.040
1.256
region12 (base)
Scotland or Others / Missing
-
-
-
-
1919)
Propage2
Old property (1919-1945)
Propage3
(base)
Built after 1945
Table A12: Parameter estimates for Haircut Standard Deviation Model
Variable
Variable Explanation
Estimate
StdErr
ProbT
Intercept
-
0.181
<0.001
<0.01
TOB bins
Time on book (in years)
0.010
<0.001
<0.01
Table A13: Parameter estimates for single-stage LGD model
Variable
Variable Explanation
Estimate
StdErr ProbT
VIF
Intercept
-
-0.093
0.005
<0.01 0.000
DLTV
Loan to value at default
0.230
0.002
<0.01 1.263
secondapp
Second applicant present
-0.003
0.001
0.012
VVAratio1
Value of property / region
-0.049
0.004
<0.01 8.976
1.105
average <= 0.9
32
VVAratio2
0.9 < Value of property /
-0.050
0.004
<0.01 5.416
-0.035
0.004
<0.01 3.093
-0.018
0.005
<0.01 2.148
-0.018
0.005
<0.01 2.037
-
-
-
region average <= 1.2
VVAratio3
1.2 < Value of property /
region average <= 1.5
VVAratio4
1.5 < Value of property /
region average <= 1.8
VVAratio5
1.8 < Value of property /
region average <= 2.4
VVAratio6 (base) Value of property / region
-
average > 2.4
Previous default
Indicator for previous default
-0.032
0.002
<0.01 1.018
Propage1
Built before 1919
0.023
0.002
<0.01 1.653
Propage2 (base)
Built between 1919 and
-
-
-
-
1945
Propage3
Built after 1945
-0.010
0.001
<0.01 1.536
Propage4
Age unknown
-0.133
0.014
<0.01 1.017
security0
Flat or other
0.065
0.002
<0.01 1.370
security1
Detached
-0.020
0.002
<0.01 1.628
security2
Semi-detached
-0.013
0.002
<0.01 1.370
security3 (base)
Terraced
-
-
-
region0
Others or Missing
0.054
0.013
<0.01 1.054
region1
North
0.041
0.004
<0.01 1.758
region2
Yorkshire & Humberside
0.041
0.003
<0.01 3.011
region3
North West
0.047
0.003
<0.01 2.940
region4
East Midlands
0.052
0.004
<0.01 2.233
region5
West Midlands
0.037
0.004
<0.01 2.364
region6
East Anglia
0.047
0.004
<0.01 1.782
region7
Wales
0.047
0.004
<0.01 2.047
region8
South West
0.038
0.003
<0.01 2.936
region9
South East
0.050
0.003
<0.01 5.244
region10
Greater London
0.030
0.003
<0.01 4.265
region11
Northern Ireland
0.028
0.006
<0.01 1.333
region12 (base)
Scotland
-
-
-
-
-
33