Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2002 IEEE Systems and Information Design Symposium•University of Virginia MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY: BANKRUPTCY, FRAUD, AND COLLECTIONS Student team: Christopher Allred, Kathryn Hite, Stephen Fonzone, Jennifer Greenspan, Josh Larew Faculty Advisor: William Scherer Department of Systems and Information Engineering Graduate Advisor: Thomas Pomroy Department of Systems and Information Engineering Client Advisor: Douglas Fuller Providian Financial San Francisco, Ca [email protected] KEYWORDS: CART, Clustering, Distressed debt, Fraudster, Identity theft, Regression, Probabilistic modeling CLASIFYING FRAUDULENT TRANSACTIONS ABSTRACT Providian suffers significant losses every year from fraudulent transactions on their credit cards. There are three main types of fraud that cause the most significant losses, adding up to millions of dollars each year. The three types the accompanying analysis focused upon were lost/stolen, forged response, and non-receipt. Lost/stolen fraud occurs when a customer losses their card or the card is stolen while the customer has the card. Forged response is when a fraudster fills out an application pretending to be someone else with a better credit history. This is done typically after the fraudster steals personal information on someone, called Identity Theft. Non-receipt fraud occurs when the card is first sent to the good customer once the application is approved. The card is typically stolen in the mail, and the good customer never receives their card. The following figure depicts the lifetime of a credit card and pinpoints where each instance of fraud occurs. Since forged response involves identity theft, the figure also shows when identity theft can take place. In order to effectively produce quality decisions in the modern credit card industry, knowledge must be gained through effective data analysis and modeling. Through the use of dynamic data-driven decision making tools and procedures, information can be gathered to successfully evaluate all aspects of credit card operations. Specifically, areas of bankruptcy, fraud, and collections were focused upon to show the salutary benefits implementation of such practices could provide. Methodologies ranging from Markov chains, to clustering, to rule-based decision theory were combined with tools such as CART, S+, Excel, and Access to yield such insights. INTRODUCTION San Francisco based Providian Financial prides itself on the effective use of data driven decisionmaking throughout its business practices. In particular, their lending strategies tend to encompass the underserved market share of high-risk creditors. As with any risk-oriented venture, Providian’s business stratagem requires the utmost degree of information quality and quantity. This necessitates the execution of methodologies and tools discussed in the following sections. Background 53 Modeling and Data Analysis in the credit card industry: The main achievement of this portion was the transformation of raw data into useful information, with the first step in this process being used to gain an understanding of the data set as a whole. General descriptive statistics are important because they provide the basic framework from which all other conclusions are derived. Moreover, such information acted as a metric to judge whether future conclusions make sense and fit with the general data or whether those conclusions should be reevaluated for errors. Descriptive statistics also checked whether smaller subsamples of the data set are representative of the data as a whole. Figure 1: Fraudulent Transaction Depiction: This graphic shows a few common ways of perpetrating credit card fraud. Though fraud causes significant loss, there are proportionally few cases of fraud each month compared to the total number of accounts. In the sample of data given from Providian, only 0.34% of the data set was fraudulent accounts. The following table lists the numbers of accounts and the percentage for the data set. General Breakdown of Accounts Type of Account Non-fraudulent (N) Fraudulent (L) Number of Accts Percent of Accounts 305,688 1,045 99.66% 0.34% Figure 2: Fraudulent Accounts: This chart shows the numeric and percentage values associated with fraudulent and non-fraudulent accounts in the data set. Rule-Based Modeling Decisions There were two main phases to approaching the problem of modeling the fraud risk of individual credit transactions: (1) Gathering data for a series of transaction characteristics and comparing the fraud and non-fraud account averages (2) Incorporating the characteristics which differed significantly into a risk scorecard capable of predicting the likelihood of fraud in a given transaction. 54 The first stage of the modeling process involved analyzing fraud and non-fraud transactions based on all available transaction data. Of the eleven variables analyzed, the following six were found to be significant: hours since transaction, number of declines, number of cash purchases, number of ATM purchases, merchant code, and transaction amount. A variable was considered significant whenever the percent difference between the average for the fraudulent population and the non-fraudulent one exceeded 30%. In the case of merchant code, significance was determined whenever the rate of fraudulent transactions for a merchant significantly exceeded the overall average rate of fraud. These characteristics form the foundation of the model, due to their potential for flagging transactions as fraudulent. The second phase of modeling developed a risk scorecard. This model used the significant characteristics established in phase one to generate a score for every transaction, based on that transaction’s data. This score was then used to asses the likelihood of that transaction being fraudulent. The scorecard was implemented using Visual Basic scripts in Microsoft Access. Accuracy and performance were then analyzed using Microsoft Excel. Each transaction was evaluated individually for each characteristic value. Points were awarded if a characteristic differed from the non-fraud average by greater than 10% of the non-fraudulent standard deviation. However, points were only awarded for a deviation that was in the direction indicating fraud, as those accounts statistically safer than the average should not be punished. Through iteration, a 10% deviation was statistically significant in maximizing the classification accuracy of the risk scorecard. In the case of merchant code, a point was simply awarded whenever the transaction occurred at a high-risk merchant code. Since there are six characteristics, any transaction could have a scorecard value from 0 to 6, depending on how many triggers that account satisfied. For example, an account 2002 IEEE Systems and Information Design Symposium•University of Virginia with very little time since the last transaction, making a $1 purchase at a high risk merchant, but who had not recently made a cash purchase, ATM withdrawal, or been declined, would have a score of 3. how likely the account is fraudulent. For example, a large cluster of non-fraudulent accounts is accounts that make a few low charges on their accounts and make a payment in the first month. The following figure highlights the performance of the scorecard by breaking down the percent of fraudulent transactions which fell into each score category. Five clusters of non-fraudulent accounts were identified with a significant degree of accuracy, 99.97% or better, and the five clusters contained 28% of the total number of accounts. The result led to an important reduction of the suspected list of accounts by over a quarter. Providian can not only significantly save through lowered operation costs but also focus detection efforts on the remaining accounts which have a higher probability of being fraudulent. Score 0 1 2 3 4 5 6 % Fraud Transactions 15.38% 30.47% 47.25% 54.05% 71.90% 71.36% 76.92% Figure 3: Score Fraud Frequencies: This table shows the percent of transactions that are predicted to be fraudulent at each risk scorecard value. If all transactions with a risk score of 3 or greater are predicted to be fraudulent, the accuracy in predicting fraudulent transactions is 60.1%. If only those transactions with scores 4 or greater are labeled fraudulent, the accuracy level increases to 71.8%. The tradeoff faced is that the higher the score cutoff used, the better the accuracy for that account segment, yet a smaller number of fraudulent accounts are actually captured. The accuracy level of 71.8% misclassifies less non-fraudulent accounts as fraudulent, but also misclassifies more fraudulent accounts as nonfraudulent. Providian would rather contact an account erroneously to check up upon suspicious purchases than let fraudulent transactions slip through. Since this second, false negative, error is the more serious for Providian, the 60.1% measure was used to take advantage of the lower false negative rate. Therefore, any transaction with a scorecard value of 3 or greater is considered to be fraudulent, and this method identifies fraudulent transactions at 60% accuracy. Clustering A clustering technique to detect the fraudulent accounts was also applied to the database of credit card accounts. The clustering procedure groups accounts according to similar characteristics using rules. These rules use the values of account characteristics to determine to which cluster an account belongs. This system was applied to the database of fraudulent accounts in an effort to classify accounts according to Though the clustering technique could not clearly establish which accounts are fraudulent, it did quickly split the accounts into suspicious and unsuspicious groups, allowing Providian to better concentrate resources, time, and money. COLLECTIONS Background Providian’s subsidiary, First Select Corporation (FSC), is the largest credit card debt collector in the United States, purchasing billions of dollars worth of defaulted credit card debt each year for approximately six cents on the dollar. Accounts are collected through calls, letters, and in some cases, legal action. Throughout the payment process, FSC continually needs to make a decision about what to do with an account: continue to attempt collections or sell the account. This makes knowing whether an account will continue to pay of the utmost importance. Value Analysis for Distressed Credit Card Debt Isolating key account attributes proved the most effective way to value Providian’s distressed credit card debt portfolio. Initially, potential variables were examined relative to desired metrics, to visually see relationships between predictor and target variables. Using this means of analysis, many account attributes had either positive or negative correlations to the account’s cash flow. The most important predictor variables identified were recency and frequency of past payments. If an account made a payment in any particular month, it was determined that the account had a 90% chance that it would make a payment in the next two months. Correspondingly, a positive relationship between the number of past payments and 55 Modeling and Data Analysis in the credit card industry: probability of future payments was also discovered. The larger number of past payments increased the probability of future payments. After this step was completed, a regression model identified important characteristics that have a predictive nature. Using a software regression program, S+, the p-values of many predictor variables were generated. In looking at whether an account will pay again, there existed a high significance between the p-values for the predictor variables, recency of past payments, frequency of payments over the last four months, and percentage of initial balance paid and the target variable. Other variables showed significance at the 0.05 level: initial balance, balance remaining, frequency of calls made, frequency of right party contacts, status, and rollout. Once characteristics of accounts with predictive nature were identified, both target and predictor variables were entered into CART. CART is “the most advanced decision-tree technology for data analysis, preprocessing and predictive modeling. CART is a robust data-analysis tool that automatically searches for important patterns and relationships and quickly uncovers hidden structure even in highly complex data” [Steinberg]. CART identified rules that would separate the data depending on different attributes. For example, in a model attempting to predict if an account would pay again, CART determined that most accounts that have not paid more than 1 payment in the last 5 months would not pay again. This rule is an all-encompassing rule; however, at every month that Providian owned the accounts the rules changed, depending on their ownership of the accounts. In doing this, monthly rules classifying the accounts were established. This effectively formulated a methodology that could be performed on a monthly basis to separate non-paying accounts from accounts that continued to pay. The final methodology incorporated the rules given by CART for months 1-15. These rules, if applied every month, increase Providian’s ability to identify accounts that will pay again (have worth) from accounts that have stopped paying (no worth). ANALYZING BANKRUPT ACCOUNTS The Providian bankruptcy data was grouped into 20 discrete states, allowing for a different form of analysis. In analyzing the bankruptcy data, the flow of an account from state to state facilitated a glimpse at the actual state transition process account holders went through. By tracing these paths, along with the expected income at each state, we are able to accurately generate an estimate of the future value of each account. 1 p 2 Figure 5: One-Step Transition: This diagram depicts the probability, p, for going from state 1 to state 2, or rather, given that the model was in state 1 in the first time period, p is the probability that the model is in state 2 in the next time period. Figure 4: Classification CART Tree: This figure depicts the CART tree used to develop the classification rules for the model. Each splitting node shows the criteria for that splitter and the percentage of paying and non-paying accounts that made it to that path. Each terminal node shows the number of accounts of each type that were classified in that node and the percentage of paying and non-paying accounts that make up that node. 56 The first step to tracing these paths is to create a matrix of one-step transition probabilities. Following these states over the lifetime of Providian’s bankruptcy process allows us to determine some underlying characteristics of their customers. The transition matrix for the bankruptcy model showed high recurrence probabilities: the tendency of an account to stay in the same state after a transition period. This is expected due to the slow nature of many 2002 IEEE Systems and Information Design Symposium•University of Virginia of the bankruptcy stages. Figure 2 shows the probability of staying in each of the 20 states, as well as the corresponding expected stay in each state. This can be calculated by using the formula: Σ n pn-1 (1-p) = 1/(1-p) + p States 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Recurrence Probabilities Transition Prob Length of Stay 40.1% 2.0 65.3% 3.5 78.0% 5.3 64.3% 3.4 40.0% 2.0 71.5% 4.2 72.2% 4.3 10.7% 1.2 75.8% 4.9 6.1% 1.1 79.4% 5.6 81.9% 6.3 68.6% 3.8 0.0% 1.0 74.2% 4.6 93.3% 15.9 56.7% 2.8 89.6% 10.5 51.9% 2.6 80.0% 5.8 Figure 6: Transition Matrix Statistics: This chart quantifies the recurrence probabilities associated with the one step probability matrix, including the estimated length of stay in each state. Ultimately, this analysis shows us the important characteristics of the bankruptcy lifecycle. As one can see, the average consumer that enters state 16 (Bankruptcy) stays for 16 months, while others such as 10 and 1 do not have strong recurrent properties. CONCLUSION Providian is constantly modifying and updating its data-driven decision network to formulate strategies which best capitalize on the opportunities of this dynamic market. By effectively using various modeling and data analysis methods, much knowledge was gained about the various aspects of Providian’s credit card operations. The insight gained on basic account operations is appreciable, because having accurate information influences everything from policy implementation to the bottom-line. From bankruptcy, to fraud, to collections, our analysis proved highly beneficial to Providian. REFERENCES Brieman, Freidman, Olshen, Stone, Classification and Regression Trees, St. Louis: Wadsworth, 1984. Dwyer, Robert. “Customer Lifetime Valuation to Support Marketing Decision Making.” Journal of Direct Marketing. Volume 11, Number 4 (1997): 613. Lucas, Peter. “Why Recoveries are on the Rise; Scoring Models and Databases are Helping Collectors Boost Recovery Rates.” Collections & Recovery. Vol 13, No 7. October 2000. 14 October 2001. http://web.lexis-nexis.com/universe. Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1998. BIOGRAPHIES Josh Larew is a fourth year Systems Engineer from Morgantown, West Virginia. When Josh is not cranking out SQL queries in Access, he can be found at the Birdwood Golf Course scrambling to make par. Next year Josh will either be working on a submarine (no joke) or be unemployed and waiting to go to law school. Stephen Fonzone is a fourth year Systems Engineer from Allentown, Pennsylvania. When not using RTPs and PTPs to predict customer lifetime value, Steve can be found singing Springsteen and playing Super Tecmo Bowl (although not necessarily at the same time). Next year Steve will live in a van down by the river. Kathryn Hite is a fourth year Systems Engineer from Huston, Texas. When not clustering transactions to catch fraudsters, Kathryn can be found extolling the virtues of her native state of Texas. Next year she will follow Josh wherever he may go. 57 Modeling and Data Analysis in the credit card industry: Jennifer Greenspan is a fourth year Systems Engineer from Chicago, Illinois. She spends the majority of her time establishing and analyzing fraud triggers but can also be seen watching Office Space and running (but she usually watches Office Space while sitting). The only group member to actually get a real job prior to graduation, Jen will be working in DC for Capital One. Christopher Allred is a fourth year Systems Engineer from Avon, Connecticut. He can usually be found taking any kind of data and turning it into a Markov Chain. He has also been known to drink a lot of cider and to be surly about staying in Charlottesville for another year, where he will be completing his masters degree. 58