Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using Classification Tree Outcomes to Enhance Logistic Regression Models Tim Millard, Union Bank of California, Orange, CA ABSTRACT Traditional Logistic Regression techniques can involve complex data preparation. Often, in order to obtain a better linear fit, it is necessary to mathematically transform Continuous variables and/or convert Categorical variables into binary outcomes. While these methods can be successfully accomplished by an experience modeler they are rather time consuming and do not directly address issues of non-linear variables and variable interactions. For example, using Proc Freq to find non-linear aspects of data can be rather difficult at best. However, by utilizing Classification Tree techniques, non-linear as well as linear variable interactions can be captured and then incorporated into Logistic Regression modeling in an easy to understand and timely manner. Examples of employing this method for purposes of predicting Delinquency in a bank portfolio will be explored. INTRODUCTION Building a successful Logistic Regression model can be a complex undertaking, involving a great deal of data preparation and exploration. Additionally, finding suitable variables from your data repositories may be a difficult process, depending on the state of the data resources available. Therefore, making maximum use of your variables can be important. HISTORY Traditionally, modelers have used Logistic Regression for various reasons. One of those reasons is that Logistic Regression is very good at detecting linear relationships and then combining those relationships into an equation that provides the odds of the dependent variable reaching a particular outcome, when the various independent variables are fed into the resulting equation. Another reason that Logistic Regression models are widely used is that they are considered robust and not prone to overfitting the data. This means you can build them and have reasonable assurance that they will continue to do the job, assuming your scored population is stable. However, Logistic Regression models do require a high level of data preparation. This is much less the case with Classification Trees, which will build a highly accurate model with very little data preparation required. The price for this is that Tree models tend to overfit the data, potentially causing you to rebuild a model more often. Fortunately, however, new Tree models are generally easily built since data preparation on the scale of what Logistical Regression requires is not necessary. Another benefit of using Classification Tree modeling occurs because not all data relationships are linear in nature. An example of this would be a Variable "A" where high and low extremes represent higher risk while the middle values represent a low risk. Sometimes Logistic Regression modeling can deal with this scenario, but now let us suppose that the best solution is that low extreme values are more predictive when combined in an equation with Variable "B", while the high extreme values are more predictive combined with Variable "C". Logistic Regression may still come through if both Variable "B" and "C" happen to be binary in nature, but it is more likely that a Logistic Regression model would ignore one or the other of these variables. Logistic Regression would perhaps only utilize the stronger of the two, or depending on the other variables involved, not utilize either. The above example is exactly the kind of data interaction that Tree models are designed to discover. DISCOVERY Recently, when exploring a Behavior Score modeling situation, I struggled with how to improve on an existing model while only having the same variables available that had previously been used. This particular model is used to set priorities for a collection unit by predicting which accounts in their workload are most likely to reach a 30-day delinquency status. At the time of scoring, accounts will have already reached a level of 15-days delinquent. History shows that many of these accounts will cure before reaching 30-days delinquent, but a subset will reach 30 days or more delinquency, and best practices show that contacting these customers is important in preventing further account delinquency status. Also, since a large portion of customers will self-cure, money and time can be saved by "knowing" which customers are likely to cure without contact. One way to create new variables is to try various data transformations. Of course, this is a part of the normal Logistic Regression modeling process. Creating new data transformations is often more art than science since it involves a large amount of experimentation. While exploring possible data transformation techniques, I found mentioned in Olivia Parr Rud's "Data Mining Cookbook" the idea of using a Classification Tree to discover variable interactions and then to bring those interactions as new variables into the Logistic Regression model process. Something further that should be understood, and which is the factor that makes it possible to utilize interactions in Logistic Regression, is that you are really only uncovering categorical aspects of continuous variables. A Classification Tree model might split a continuous variable by determining that everything below and equal to a particular value in a continuous variable should fall to the left and everything greater falls to right. The model then continues this process considering all available variables, even taking into account splits of variables that may be part of another branch of the tree. For example, if one of the variables was "the number of times that an account was 30 days delinquent" (delqdays30) this same variable could be used more than once within the tree and maybe split at a different point each time. One split for it could read "Delqdays30 <= 0.5" while on a different branch it could read "Delqdays30 <=1.5". The overall effect of a Classification Tree treating variables in this way is to take both continuous and categorical variables and effectively convert them into categorical. Since Logistic Regression works well with categorical variables that have been converted into binary equivalents, then using the Classification Tree outcomes in Logistic Regression is a good way of adding in new variables into the Logistic Regression process. MAKING IT WORK The figure below represents a simple Classification Tree that shows a model, which resulted when feeding all the variables available for a Behavior score modeling set into a Classification Tree program. Most Classification Tree programs will take a large number of variables and will create a model which tends to be overfitted to the particular set of data that is provided. So the first thing that is important to do is to set any parameters which will limit the size or number of outcome nodes that the Classification Tree produces. The reason you probably want to approach this with a simpler tree is that adding several hundred new variables to a logistic regression model is not likely to be beneficial. LSTPAY1 DELQDAYS30 DATE1 DELQDAYS30 TIMESINCOLL DATE2 DATE2 DATE2 LSTPAY2 MATDTE In the case of the tree above, I restricted the minimum number of cases that had to appear in each end node. The minimum number will be dependent upon the size of your sample dataset and what type of event is being modeled. In this case, requiring at least 40 cases to fall into each end node represented roughly 5% of the dataset. There is nothing magical about this number and depending on what is being modeled and the number of cases in your data a 5% rule may or may not be a good thing. In this situation it kept the Classification Tree fairly simple and as we will see, later worked out for purposes of the logistic regression as well. Counting from left to right, the tree model above has 11 end nodes. Each end node is then a potential variable to be included into your Logistic Regression model. You will either have to write your own SAS® code to determine the node value that an account would land into or you would have to utilize any functionality that your Tree generation program provides to populate your new node variable. In my case, I relied on the Tree program to determine the correct node. The reason I chose that route is that the tree program actually had some complex variable substitution rules that would have been difficult to program myself. However, since my dataset actually had very few missing values, I probably could have written my own SAS code to generate the decision rules for the Tree model. After the node is determined, you now have a categorical variable for your Logistic Regression. At this point, I ran the following SAS code to convert the single categorical variable into a series of binary variables. if if if if if if if if if if if node node node node node node node node node node node = = = = = = = = = = = 1 then n1 2 then n2 3 then n3 4 then n4 5 then n5 6 then n6 7 then n7 8 then n8 9 then n9 10 then n10 11 then n11 = = = = = = = = = = = 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; else else else else else else else else else else else n1 = 0; n2 = 0; n3 = 0; n4 = 0; n5 = 0; n6 = 0; n7 = 0; n8 = 0; n9 = 0; n10 = 0; n11 = 0; Basically, for each possible node value, I determine if the corresponding binary variable has a 1 or 0 value. There maybe more elegant ways to accomplish this, but this works and is straightforward. FORWARD MARCH At this point I continued my Logistic Regression modeling in the normal fashion with various data transformations, relevant categorical variable divisions and my eleven new variables based on the variable interactions from the Classification Tree model. So what was the result? Does the Logistic Regression model utilize the new variables? Yes, in this case, it did. In fact, nearly a fourth of the variables used were the Classification Tree nodes binary variables. Since I was trying to predict accounts reaching 30 day delinquency and some of the node values did an excellent job of separating out those kind of accounts, it should not be surprising that some of the tree node values would be used by the Logistic Regression model. Of course the interesting fact to realize about binary values that are part of a Logistic Regression model is that they have the mathematical effect of creating a series of separate models. Since each account will only be assigned to one node value in a tree, that means that when the Logistic Regression equation uses several of the tree nodes, only one will be affecting the resulting equation. For example we have account number 123456 and it has a value of n5 =1. By definition under the Tree model, all the other "n" variables for this single account number would equal 0. Now if the Logistic Regression equation is something along the line of : X = intercept(-0.6995) + delqdays30(1.53) + n5(2.3) + n9(-1.43), then for account 123456 where n5 =1 and all other "n" variables = 0 then the formula is really X = intercept(-0.6995) + delqdays30(1.53) + n5(2.3) because with n9 =0 there is nothing to add. This also means that for the above formula when we have an account where n9=1 then the actual formula is: X = intercept(-0.6995) + delqdays30(1.53) + n9(-1.43) because n5=0. And finally, in those cases where both n5 and n9 = 0 then we have a third formula: X = intercept(-0.6995) + delqdays30(1.53), thus your Logistic Regression model is able to use binary variables to create powerful custom formulas. CONCLUSION The next time you are preparing variables for a Logistic Regression modeling project, consider utilizing the data node variables from a Classification Tree to enhance your set of available variables. Since most Classification Tree programs are relatively simple to run, this extra step can be easily accomplished and you may find that some of those variables are predictive within the context of a Logistic Regression modeling effort. REFERENCES Allison, Paul D. (1999). Logistic Regression Using SAS System: Theory and Application. Cary, NC:SAS Institute Inc. Cody, Ronald P., and Jeffrey K. Smith. (1997). Applied statistics and the SAS programming language, Upper Saddle River, NJ: Prentice-Hall, Inc. Rud, Olivia Parr (2001), Data Mining Cookbook. New York: John Wiley & Sons, Inc. CONTACT INFORMATION The author welcomes questions and comments. Please direct inquires to: Tim Millard Union Bank of California 500 S. Main St., Ste. 300 Orange, CA 92868 [email protected] (714) 565-5597 (714) 565-5575 fax The contents of this paper are the work of the author and do not necessarily represent the opinions, recommendations or practices of Union Bank of California, NA. SAS and all other SAS Institute Inc. products or service names are registered trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.