Download Using Classification Tree Outcomes to Enhance Logistic Regression Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Using Classification Tree Outcomes to Enhance Logistic
Regression Models
Tim Millard, Union Bank of California, Orange, CA
ABSTRACT
Traditional Logistic Regression techniques can involve complex data preparation. Often, in order to obtain a
better linear fit, it is necessary to mathematically transform Continuous variables and/or convert Categorical
variables into binary outcomes.
While these methods can be successfully accomplished by an experience modeler they are rather time
consuming and do not directly address issues of non-linear variables and variable interactions. For example,
using Proc Freq to find non-linear aspects of data can be rather difficult at best.
However, by utilizing Classification Tree techniques, non-linear as well as linear variable interactions can be
captured and then incorporated into Logistic Regression modeling in an easy to understand and timely
manner. Examples of employing this method for purposes of predicting Delinquency in a bank portfolio will
be explored.
INTRODUCTION
Building a successful Logistic Regression model can be a complex undertaking, involving a great deal of
data preparation and exploration. Additionally, finding suitable variables from your data repositories may be
a difficult process, depending on the state of the data resources available. Therefore, making maximum use
of your variables can be important.
HISTORY
Traditionally, modelers have used Logistic Regression for various reasons. One of those reasons is that
Logistic Regression is very good at detecting linear relationships and then combining those relationships into
an equation that provides the odds of the dependent variable reaching a particular outcome, when the
various independent variables are fed into the resulting equation.
Another reason that Logistic Regression models are widely used is that they are considered robust and not
prone to overfitting the data. This means you can build them and have reasonable assurance that they will
continue to do the job, assuming your scored population is stable. However, Logistic Regression models do
require a high level of data preparation.
This is much less the case with Classification Trees, which will build a highly accurate model with very little
data preparation required. The price for this is that Tree models tend to overfit the data, potentially causing
you to rebuild a model more often. Fortunately, however, new Tree models are generally easily built since
data preparation on the scale of what Logistical Regression requires is not necessary.
Another benefit of using Classification Tree modeling occurs because not all data relationships are linear in
nature. An example of this would be a Variable "A" where high and low extremes represent higher risk while
the middle values represent a low risk. Sometimes Logistic Regression modeling can deal with this scenario,
but now let us suppose that the best solution is that low extreme values are more predictive when combined
in an equation with Variable "B", while the high extreme values are more predictive combined with Variable
"C". Logistic Regression may still come through if both Variable "B" and "C" happen to be binary in nature,
but it is more likely that a Logistic Regression model would ignore one or the other of these variables.
Logistic Regression would perhaps only utilize the stronger of the two, or depending on the other variables
involved, not utilize either. The above example is exactly the kind of data interaction that Tree models are
designed to discover.
DISCOVERY
Recently, when exploring a Behavior Score modeling situation, I struggled with how to improve on an
existing model while only having the same variables available that had previously been used. This particular
model is used to set priorities for a collection unit by predicting which accounts in their workload are most
likely to reach a 30-day delinquency status. At the time of scoring, accounts will have already reached a
level of 15-days delinquent. History shows that many of these accounts will cure before reaching 30-days
delinquent, but a subset will reach 30 days or more delinquency, and best practices show that contacting
these customers is important in preventing further account delinquency status. Also, since a large portion of
customers will self-cure, money and time can be saved by "knowing" which customers are likely to cure
without contact.
One way to create new variables is to try various data transformations. Of course, this is a part of the normal
Logistic Regression modeling process. Creating new data transformations is often more art than science
since it involves a large amount of experimentation.
While exploring possible data transformation techniques, I found mentioned in Olivia Parr Rud's "Data
Mining Cookbook" the idea of using a Classification Tree to discover variable interactions and then to bring
those interactions as new variables into the Logistic Regression model process. Something further that
should be understood, and which is the factor that makes it possible to utilize interactions in Logistic
Regression, is that you are really only uncovering categorical aspects of continuous variables. A
Classification Tree model might split a continuous variable by determining that everything below and equal
to a particular value in a continuous variable should fall to the left and everything greater falls to right. The
model then continues this process considering all available variables, even taking into account splits of
variables that may be part of another branch of the tree. For example, if one of the variables was "the
number of times that an account was 30 days delinquent" (delqdays30) this same variable could be used
more than once within the tree and maybe split at a different point each time. One split for it could read
"Delqdays30 <= 0.5" while on a different branch it could read "Delqdays30 <=1.5". The overall effect of a
Classification Tree treating variables in this way is to take both continuous and categorical variables and
effectively convert them into categorical. Since Logistic Regression works well with categorical variables that
have been converted into binary equivalents, then using the Classification Tree outcomes in Logistic
Regression is a good way of adding in new variables into the Logistic Regression process.
MAKING IT WORK
The figure below represents a simple Classification Tree that shows a model, which resulted when feeding
all the variables available for a Behavior score modeling set into a Classification Tree program. Most
Classification Tree programs will take a large number of variables and will create a model which tends to be
overfitted to the particular set of data that is provided. So the first thing that is important to do is to set any
parameters which will limit the size or number of outcome nodes that the Classification Tree produces. The
reason you probably want to approach this with a simpler tree is that adding several hundred new variables
to a logistic regression model is not likely to be beneficial.
LSTPAY1
DELQDAYS30
DATE1
DELQDAYS30
TIMESINCOLL
DATE2
DATE2
DATE2
LSTPAY2
MATDTE
In the case of the tree above, I restricted the minimum number of cases that had to appear in each end
node. The minimum number will be dependent upon the size of your sample dataset and what type of event
is being modeled. In this case, requiring at least 40 cases to fall into each end node represented roughly 5%
of the dataset. There is nothing magical about this number and depending on what is being modeled and the
number of cases in your data a 5% rule may or may not be a good thing. In this situation it kept the
Classification Tree fairly simple and as we will see, later worked out for purposes of the logistic regression
as well.
Counting from left to right, the tree model above has 11 end nodes. Each end node is then a potential
variable to be included into your Logistic Regression model. You will either have to write your own SAS® code
to determine the node value that an account would land into or you would have to utilize any functionality
that your Tree generation program provides to populate your new node variable. In my case, I relied on the
Tree program to determine the correct node. The reason I chose that route is that the tree program actually
had some complex variable substitution rules that would have been difficult to program myself. However,
since my dataset actually had very few missing values, I probably could have written my own SAS code to
generate the decision rules for the Tree model.
After the node is determined, you now have a categorical variable for your Logistic Regression. At this point,
I ran the following SAS code to convert the single categorical variable into a series of binary variables.
if
if
if
if
if
if
if
if
if
if
if
node
node
node
node
node
node
node
node
node
node
node
=
=
=
=
=
=
=
=
=
=
=
1 then n1
2 then n2
3 then n3
4 then n4
5 then n5
6 then n6
7 then n7
8 then n8
9 then n9
10 then n10
11 then n11
=
=
=
=
=
=
=
=
=
=
=
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
else
else
else
else
else
else
else
else
else
else
else
n1 = 0;
n2 = 0;
n3 = 0;
n4 = 0;
n5 = 0;
n6 = 0;
n7 = 0;
n8 = 0;
n9 = 0;
n10 = 0;
n11 = 0;
Basically, for each possible node value, I determine if the corresponding binary variable has a 1 or 0 value.
There maybe more elegant ways to accomplish this, but this works and is straightforward.
FORWARD MARCH
At this point I continued my Logistic Regression modeling in the normal fashion with various data
transformations, relevant categorical variable divisions and my eleven new variables based on the variable
interactions from the Classification Tree model. So what was the result? Does the Logistic Regression
model utilize the new variables?
Yes, in this case, it did. In fact, nearly a fourth of the variables used were the Classification Tree nodes
binary variables. Since I was trying to predict accounts reaching 30 day delinquency and some of the node
values did an excellent job of separating out those kind of accounts, it should not be surprising that some of
the tree node values would be used by the Logistic Regression model. Of course the interesting fact to
realize about binary values that are part of a Logistic Regression model is that they have the mathematical
effect of creating a series of separate models.
Since each account will only be assigned to one node value in a tree, that means that when the Logistic
Regression equation uses several of the tree nodes, only one will be affecting the resulting equation.
For example we have account number 123456 and it has a value of n5 =1. By definition under the Tree
model, all the other "n" variables for this single account number would equal 0. Now if the Logistic
Regression equation is something along the line of :
X = intercept(-0.6995) + delqdays30(1.53) + n5(2.3) + n9(-1.43),
then for account 123456 where n5 =1 and all other "n" variables = 0 then the formula is really
X = intercept(-0.6995) + delqdays30(1.53) + n5(2.3) because with n9 =0 there is nothing to add.
This also means that for the above formula when we have an account where n9=1 then the actual formula
is:
X = intercept(-0.6995) + delqdays30(1.53) + n9(-1.43) because n5=0.
And finally, in those cases where both n5 and n9 = 0 then we have a third formula:
X = intercept(-0.6995) + delqdays30(1.53),
thus your Logistic Regression model is able to use binary variables to create powerful custom formulas.
CONCLUSION
The next time you are preparing variables for a Logistic Regression modeling project, consider utilizing the
data node variables from a Classification Tree to enhance your set of available variables. Since most
Classification Tree programs are relatively simple to run, this extra step can be easily accomplished and you
may find that some of those variables are predictive within the context of a Logistic Regression modeling
effort.
REFERENCES
Allison, Paul D. (1999). Logistic Regression Using SAS System: Theory and Application. Cary,
NC:SAS Institute Inc.
Cody, Ronald P., and Jeffrey K. Smith. (1997). Applied statistics and the SAS programming
language, Upper Saddle River, NJ: Prentice-Hall, Inc.
Rud, Olivia Parr (2001), Data Mining Cookbook. New York: John Wiley & Sons, Inc.
CONTACT INFORMATION
The author welcomes questions and comments. Please direct inquires to:
Tim Millard
Union Bank of California
500 S. Main St., Ste. 300
Orange, CA 92868
[email protected]
(714) 565-5597
(714) 565-5575 fax
The contents of this paper are the work of the author and do not necessarily represent the opinions,
recommendations or practices of Union Bank of California, NA.
SAS and all other SAS Institute Inc. products or service names are registered trademarks of SAS Institute
Inc. in the USA and other countries.  indicates USA registration.
Other brand and product names are trademarks of their respective companies.