Download modeling and data analysis in the credit card industry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Transcript
2002 IEEE Systems and Information Design Symposium•University of Virginia
MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY:
BANKRUPTCY, FRAUD, AND COLLECTIONS
Student team: Christopher Allred, Kathryn Hite, Stephen Fonzone, Jennifer Greenspan, Josh Larew
Faculty Advisor: William Scherer
Department of Systems and Information Engineering
Graduate Advisor: Thomas Pomroy
Department of Systems and Information Engineering
Client Advisor: Douglas Fuller
Providian Financial
San Francisco, Ca
[email protected]
KEYWORDS: CART, Clustering, Distressed debt,
Fraudster, Identity theft, Regression, Probabilistic
modeling
CLASIFYING FRAUDULENT TRANSACTIONS
ABSTRACT
Providian suffers significant losses every year from
fraudulent transactions on their credit cards. There are
three main types of fraud that cause the most significant
losses, adding up to millions of dollars each year. The
three types the accompanying analysis focused upon
were lost/stolen, forged response, and non-receipt.
Lost/stolen fraud occurs when a customer losses their
card or the card is stolen while the customer has the
card. Forged response is when a fraudster fills out an
application pretending to be someone else with a better
credit history. This is done typically after the fraudster
steals personal information on someone, called Identity
Theft. Non-receipt fraud occurs when the card is first
sent to the good customer once the application is
approved. The card is typically stolen in the mail, and
the good customer never receives their card. The
following figure depicts the lifetime of a credit card and
pinpoints where each instance of fraud occurs. Since
forged response involves identity theft, the figure also
shows when identity theft can take place.
In order to effectively produce quality decisions in
the modern credit card industry, knowledge must be
gained through effective data analysis and modeling.
Through the use of dynamic data-driven decision
making tools and procedures, information can be
gathered to successfully evaluate all aspects of credit
card operations. Specifically, areas of bankruptcy,
fraud, and collections were focused upon to show the
salutary benefits implementation of such practices
could provide. Methodologies ranging from Markov
chains, to clustering, to rule-based decision theory were
combined with tools such as CART, S+, Excel, and
Access to yield such insights.
INTRODUCTION
San Francisco based Providian Financial prides
itself on the effective use of data driven decisionmaking throughout its business practices. In particular,
their lending strategies tend to encompass the
underserved market share of high-risk creditors. As
with any risk-oriented venture, Providian’s business
stratagem requires the utmost degree of information
quality and quantity. This necessitates the execution of
methodologies and tools discussed in the following
sections.
Background
53
Modeling and Data Analysis in the credit card industry:
The main achievement of this portion was the
transformation of raw data into useful information, with the
first step in this process being used to gain an understanding
of the data set as a whole. General descriptive statistics are
important because they provide the basic framework from
which all other conclusions are derived. Moreover, such
information acted as a metric to judge whether future
conclusions make sense and fit with the general data or
whether those conclusions should be reevaluated for errors.
Descriptive statistics also checked whether smaller subsamples of the data set are representative of the data as a
whole.
Figure 1: Fraudulent Transaction Depiction: This
graphic shows a few common ways of perpetrating
credit card fraud.
Though fraud causes significant loss, there are
proportionally few cases of fraud each month compared
to the total number of accounts. In the sample of data
given from Providian, only 0.34% of the data set was
fraudulent accounts. The following table lists the
numbers of accounts and the percentage for the data set.
General Breakdown of Accounts
Type of Account
Non-fraudulent
(N)
Fraudulent (L)
Number of
Accts
Percent of
Accounts
305,688
1,045
99.66%
0.34%
Figure 2: Fraudulent Accounts: This chart shows the
numeric and percentage values associated with
fraudulent and non-fraudulent accounts in the data set.
Rule-Based Modeling Decisions
There were two main phases to approaching the
problem of modeling the fraud risk of individual credit
transactions: (1) Gathering data for a series of
transaction characteristics and comparing the fraud and
non-fraud account averages (2) Incorporating the
characteristics which differed significantly into a risk
scorecard capable of predicting the likelihood of fraud
in a given transaction.
54
The first stage of the modeling process involved
analyzing fraud and non-fraud transactions based on all
available transaction data. Of the eleven variables
analyzed, the following six were found to be
significant: hours since transaction, number of declines,
number of cash purchases, number of ATM purchases,
merchant code, and transaction amount. A variable was
considered significant whenever the percent difference
between the average for the fraudulent population and
the non-fraudulent one exceeded 30%. In the case of
merchant code, significance was determined whenever
the rate of fraudulent transactions for a merchant
significantly exceeded the overall average rate of fraud.
These characteristics form the foundation of the model,
due to their potential for flagging transactions as
fraudulent.
The second phase of modeling developed a risk
scorecard. This model used the significant characteristics
established in phase one to generate a score for every
transaction, based on that transaction’s data. This score was
then used to asses the likelihood of that transaction being
fraudulent. The scorecard was implemented using Visual
Basic scripts in Microsoft Access. Accuracy and
performance were then analyzed using Microsoft Excel.
Each transaction was evaluated individually for each
characteristic value. Points were awarded if a characteristic
differed from the non-fraud average by greater than 10% of
the non-fraudulent standard deviation. However, points
were only awarded for a deviation that was in the direction
indicating fraud, as those accounts statistically safer than the
average should not be punished. Through iteration, a 10%
deviation was statistically significant in maximizing the
classification accuracy of the risk scorecard. In the case of
merchant code, a point was simply awarded whenever the
transaction occurred at a high-risk merchant code. Since
there are six characteristics, any transaction could have a
scorecard value from 0 to 6, depending on how many
triggers that account satisfied. For example, an account
2002 IEEE Systems and Information Design Symposium•University of Virginia
with very little time since the last transaction, making a $1
purchase at a high risk merchant, but who had not recently
made a cash purchase, ATM withdrawal, or been declined,
would have a score of 3.
how likely the account is fraudulent. For example, a
large cluster of non-fraudulent accounts is accounts that
make a few low charges on their accounts and make a
payment in the first month.
The following figure highlights the performance of the
scorecard by breaking down the percent of fraudulent
transactions which fell into each score category.
Five clusters of non-fraudulent accounts were
identified with a significant degree of accuracy, 99.97%
or better, and the five clusters contained 28% of the
total number of accounts. The result led to an
important reduction of the suspected list of accounts by
over a quarter. Providian can not only significantly save
through lowered operation costs but also focus
detection efforts on the remaining accounts which have
a higher probability of being fraudulent.
Score
0
1
2
3
4
5
6
% Fraud Transactions
15.38%
30.47%
47.25%
54.05%
71.90%
71.36%
76.92%
Figure 3: Score Fraud Frequencies: This table shows the
percent of transactions that are predicted to be fraudulent
at each risk scorecard value.
If all transactions with a risk score of 3 or greater are
predicted to be fraudulent, the accuracy in predicting
fraudulent transactions is 60.1%. If only those transactions
with scores 4 or greater are labeled fraudulent, the accuracy
level increases to 71.8%. The tradeoff faced is that the
higher the score cutoff used, the better the accuracy for that
account segment, yet a smaller number of fraudulent
accounts are actually captured. The accuracy level of 71.8%
misclassifies less non-fraudulent accounts as fraudulent, but
also misclassifies more fraudulent accounts as nonfraudulent. Providian would rather contact an account
erroneously to check up upon suspicious purchases than let
fraudulent transactions slip through. Since this second, false
negative, error is the more serious for Providian, the 60.1%
measure was used to take advantage of the lower false
negative rate. Therefore, any transaction with a scorecard
value of 3 or greater is considered to be fraudulent, and this
method identifies fraudulent transactions at 60% accuracy.
Clustering
A clustering technique to detect the fraudulent
accounts was also applied to the database of credit card
accounts. The clustering procedure groups accounts
according to similar characteristics using rules. These
rules use the values of account characteristics to
determine to which cluster an account belongs. This
system was applied to the database of fraudulent
accounts in an effort to classify accounts according to
Though the clustering technique could not clearly
establish which accounts are fraudulent, it did quickly
split the accounts into suspicious and unsuspicious
groups, allowing Providian to better concentrate
resources, time, and money.
COLLECTIONS
Background
Providian’s subsidiary, First Select Corporation
(FSC), is the largest credit card debt collector in the
United States, purchasing billions of dollars worth of
defaulted credit card debt each year for approximately
six cents on the dollar. Accounts are collected through
calls, letters, and in some cases, legal action.
Throughout the payment process, FSC continually
needs to make a decision about what to do with an
account: continue to attempt collections or sell the
account. This makes knowing whether an account will
continue to pay of the utmost importance.
Value Analysis for Distressed Credit Card Debt
Isolating key account attributes proved the most
effective way to value Providian’s distressed credit card
debt portfolio. Initially, potential variables were
examined relative to desired metrics, to visually see
relationships between predictor and target variables.
Using this means of analysis, many account attributes
had either positive or negative correlations to the
account’s cash flow. The most important predictor
variables identified were recency and frequency of past
payments. If an account made a payment in any
particular month, it was determined that the account
had a 90% chance that it would make a payment in the
next two months. Correspondingly, a positive
relationship between the number of past payments and
55
Modeling and Data Analysis in the credit card industry:
probability of future payments was also discovered.
The larger number of past payments increased the
probability of future payments.
After this step was completed, a regression model
identified important characteristics that have a
predictive nature. Using a software regression
program, S+, the p-values of many predictor variables
were generated. In looking at whether an account will
pay again, there existed a high significance between the
p-values for the predictor variables, recency of past
payments, frequency of payments over the last four
months, and percentage of initial balance paid and the
target variable. Other variables showed significance at
the 0.05 level: initial balance, balance remaining,
frequency of calls made, frequency of right party
contacts, status, and rollout.
Once characteristics of accounts with predictive
nature were identified, both target and predictor
variables were entered into CART. CART is “the most
advanced decision-tree technology for data analysis,
preprocessing and predictive modeling. CART is a
robust data-analysis tool that automatically searches for
important patterns and relationships and quickly
uncovers hidden structure even in highly complex data”
[Steinberg].
CART identified rules that would separate the data
depending on different attributes. For example, in a
model attempting to predict if an account would pay
again, CART determined that most accounts that have
not paid more than 1 payment in the last 5 months
would not pay again. This rule is an all-encompassing
rule; however, at every month that Providian owned the
accounts the rules changed, depending on their
ownership of the accounts. In doing this, monthly rules
classifying the accounts were established. This
effectively formulated a methodology that could be
performed on a monthly basis to separate non-paying
accounts from accounts that continued to pay.
The final methodology incorporated the rules
given by CART for months 1-15. These rules, if
applied every month, increase Providian’s ability to
identify accounts that will pay again (have worth) from
accounts that have stopped paying (no worth).
ANALYZING BANKRUPT ACCOUNTS
The Providian bankruptcy data was grouped into
20 discrete states, allowing for a different form of
analysis. In analyzing the bankruptcy data, the flow of
an account from state to state facilitated a glimpse at the
actual state transition process account holders went
through. By tracing these paths, along with the
expected income at each state, we are able to accurately
generate an estimate of the future value of each
account.
1
p
2
Figure 5: One-Step Transition: This diagram depicts
the probability, p, for going from state 1 to state 2, or
rather, given that the model was in state 1 in the first
time period, p is the probability that the model is in
state 2 in the next time period.
Figure 4: Classification CART Tree: This figure depicts the
CART tree used to develop the classification rules for the
model. Each splitting node shows the criteria for that
splitter and the percentage of paying and non-paying
accounts that made it to that path. Each terminal node
shows the number of accounts of each type that were
classified in that node and the percentage of paying and
non-paying accounts that make up that node.
56
The first step to tracing these paths is to create a
matrix of one-step transition probabilities. Following
these states over the lifetime of Providian’s bankruptcy
process allows us to determine some underlying
characteristics of their customers.
The transition matrix for the bankruptcy model
showed high recurrence probabilities: the tendency of
an account to stay in the same state after a transition
period. This is expected due to the slow nature of many
2002 IEEE Systems and Information Design Symposium•University of Virginia
of the bankruptcy stages. Figure 2 shows the
probability of staying in each of the 20 states, as well as
the corresponding expected stay in each state. This can
be calculated by using the formula:
Σ n pn-1 (1-p) = 1/(1-p) + p
States
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Recurrence Probabilities
Transition Prob
Length of Stay
40.1%
2.0
65.3%
3.5
78.0%
5.3
64.3%
3.4
40.0%
2.0
71.5%
4.2
72.2%
4.3
10.7%
1.2
75.8%
4.9
6.1%
1.1
79.4%
5.6
81.9%
6.3
68.6%
3.8
0.0%
1.0
74.2%
4.6
93.3%
15.9
56.7%
2.8
89.6%
10.5
51.9%
2.6
80.0%
5.8
Figure 6: Transition Matrix Statistics: This chart
quantifies the recurrence probabilities associated with
the one step probability matrix, including the estimated
length of stay in each state.
Ultimately, this analysis shows us the important
characteristics of the bankruptcy lifecycle. As one can
see, the average consumer that enters state 16
(Bankruptcy) stays for 16 months, while others such as
10 and 1 do not have strong recurrent properties.
CONCLUSION
Providian is constantly modifying and updating its
data-driven decision network to formulate strategies
which best capitalize on the opportunities of this
dynamic market. By effectively using various
modeling and data analysis methods, much knowledge
was gained about the various aspects of Providian’s
credit card operations. The insight gained on basic
account operations is appreciable, because having
accurate information influences everything from policy
implementation to the bottom-line. From bankruptcy,
to fraud, to collections, our analysis proved highly
beneficial to Providian.
REFERENCES
Brieman, Freidman, Olshen, Stone, Classification and
Regression Trees, St. Louis: Wadsworth, 1984.
Dwyer, Robert. “Customer Lifetime Valuation to
Support Marketing Decision Making.” Journal of
Direct Marketing. Volume 11, Number 4 (1997): 613.
Lucas, Peter. “Why Recoveries are on the Rise; Scoring
Models and Databases are Helping Collectors Boost
Recovery Rates.” Collections & Recovery. Vol 13,
No 7. October 2000. 14 October 2001.
http://web.lexis-nexis.com/universe.
Steinberg, Dan and Phillip Colla. CART--Classification
and Regression Trees. San Diego, CA: Salford
Systems, 1998.
BIOGRAPHIES
Josh Larew is a fourth year Systems Engineer from
Morgantown, West Virginia. When Josh is not
cranking out SQL queries in Access, he can be found at
the Birdwood Golf Course scrambling to make par.
Next year Josh will either be working on a submarine
(no joke) or be unemployed and waiting to go to law
school.
Stephen Fonzone is a fourth year Systems Engineer
from Allentown, Pennsylvania. When not using RTPs
and PTPs to predict customer lifetime value, Steve can
be found singing Springsteen and playing Super Tecmo
Bowl (although not necessarily at the same time). Next
year Steve will live in a van down by the river.
Kathryn Hite is a fourth year Systems Engineer from
Huston, Texas. When not clustering transactions to
catch fraudsters, Kathryn can be found extolling the
virtues of her native state of Texas. Next year she will
follow Josh wherever he may go.
57
Modeling and Data Analysis in the credit card industry:
Jennifer Greenspan is a fourth year Systems Engineer
from Chicago, Illinois. She spends the majority of her
time establishing and analyzing fraud triggers but can
also be seen watching Office Space and running (but
she usually watches Office Space while sitting). The
only group member to actually get a real job prior to
graduation, Jen will be working in DC for Capital One.
Christopher Allred is a fourth year Systems Engineer
from Avon, Connecticut. He can usually be found
taking any kind of data and turning it into a Markov
Chain. He has also been known to drink a lot of cider
and to be surly about staying in Charlottesville for
another year, where he will be completing his masters
degree.
58