Download Modeling Customer Choice

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Modeling Customer Choice
By David R. Roberts
Introduction:
Imagine you’ve just been assigned as a
brand analyst for a particular model of Sport
Utility Vehicle (SUV). Your first assignment is
to determine what, if any, relationship exists
between income and a decision to purchase an
SUV. You are given data from a random sample
of 1500 recent new car buyers, which includes
the buyer’s household income and whether the
buyer purchased an SUV or other type of car.
Your boss expects a briefing by tomorrow
morning, specifically answering the following
two questions:
1) Is there a relationship between income and
SUV purchase?, and
2) How would you “model” this relationship in
order to predict future behavior (e.g., How
likely is a person with an income of X to buy
an SUV)? Note: This example is simplified
(one predictor only) for illustrative
purposes.
The first question is easy; the second is not.
“Traditional” Techniques:
How would a business analyst (perhaps a
newly minted MBA) approach this problem? The
first step might be to look at the distribution of all
purchasers by income. This would result in the
following chart:
Now we can begin to get an understanding of
behavior. It appears that at incomes below
$70,000, buyers are more likely to purchase a nonSUV, while at incomes above $70,000, buyers are
more likely to purchase an SUV. This certainly
hints at a potential relationship between income
and SUV purchase. Do SUV purchasers have
higher incomes than non-SUV purchasers? The
chart indicates YES.
This could also be answered more explicitly
with a hypothesis test - the null hypothesis would
be that the average income of SUV buyers is the
same as the average income of non-SUV buyers.
It turns out that this null hypothesis can be rejected
with near-certain confidence (see Appendix I for
details).
Overview:
The purpose of this paper is to: 1) discuss the
shortcomings of “traditional” statistical techniques
when modeling customer choice, and 2) discuss a
powerful, but relatively obscure, technique called
Logistic Regression. Throughout the paper, I will
make use of the “SUV analysis” example,
mentioned above. This paper will review some of
the steps that a business analyst might take to
answer the two questions posed. The limitations
of these techniques will then provide the
motivation for the use of Logistic Regression. The
Logistic Regression model is:
Alternatively, we could construct a
conditional probability table of purchase decision
given a level of income. This test statistic has a
chi-square distribution, and is also very significant
for virtually any bucketing of incomes (see
Appendix II for details).
So, now we’ve shown that there is, indeed, a
relationship between income and an SUV purchase
decision. That much is fairly straightforward –
these techniques are found in many undergraduate
and graduate level statistics textbooks. However,
we also want to model this relationship. How
should we do that?
P(x) = e (B + B X) / (1 + e (B + B X)).
0
1
0
1
(The mathematics behind this will be described
later in the paper.)
Household Income of 1500 Purchasers of new Vehicles
200
180
160
# of purchasers
140
120
SUV
non-SUV
100
80
60
40
20
0
0
20
40
60
80
100
120
140
160
bottom row with one point on the top row! Try a
few simple examples on your own, and you’ll see
this same effect.
Linear Regression:
Any card-carrying MBA will know that a
linear regression model is typically well suited for
such an analysis. With income as the independent
variable, and SUV purchase (0/1) as the dependent
variable, the regression model derived from our
test data is:
A “linear” regression model is quite flexible
in terms of its ability to handle non-linear
predictors
(via
“transformations”)
and
dichotomous inputs (via “dummy” variables).
However, it is ill suited to model dichotomous
results form continuous predictors. [Note: we
could model dichotomous results from discrete
predictors, but such a model does not translate
well to continuous predictors - see Appendix III
for details.] If the result is always either a 0 or a 1,
and we want to estimate a probability, we need a
model whose calculated result is always between 0
and 1. What would such a model look like? How
can we model dichotomous results from
continuous predictors? This is the motivation
for logistic regression.
SUV=-0.21 + 0.0086 INC.
With an R-squared near 20% and an F-value over
300, we can say with near certainty that income
explains about 1/5 of the decision to buy an SUV.
This model indicates that, for a buyer with an
income of $100,000, the probability of choosing
an SUV for his or her next car purchase is 65%.
This all looks pretty reasonable so far. As a final
sanity check before submitting this to the boss,
let’s find the probability of SUV purchase for
someone with an income of $150,000. Plugging
$150,000 into the regression model yields a result
of 1.08! A new car buyer with an income of
$150,000 has a 108% chance of buying an SUV!
What’s going on here?
Logistic Regression:
Now let’s “derive” the logistic regression
P
model presented in the introduction. Let (x) =
the probability that some event occurs. Let I1 = an
input (independent) variable.
We want a
Perhaps the relationship is not linear??? How
about a regression model that is non-linear in X?
It turns out that any “transformation” of X (e.g.,
square root of X, X squared, etc.) will still produce
the same effect, in this case with virtually no
impact of the significance of the model. This can
be seen graphically in the chart below:
P
P
relationship (x) = f(I1), such that (x) is always
between 0 and 1. As I1 gets very large or very
P
small, (x) must approach 0 or 1 asymptotically
(i.e., the relationship must be “S-shaped”). One
way to achieve this is to let f(I1) = I2 / (1 + I2). As
I2 approaches 0, f(I1) also approaches 0; as I2 gets
very large, f(I1) approaches 1.
Linear Regression of Pr(SUV Purchase) on Household Income
1.2
Probability of SUV Purchase
1
0.8
0.6
SUV (0/1)
Reg_line
0.4
0.2
0
0
20
40
60
80
100
120
140
160
We’re all set, except for one small detail –
what if I2 is negative? If I2 is negative, our model
falls apart (i.e., the calculated probability will be
negative). So we must “force” I2 to be positive.
One way to do this is to set I2 = e raised to some
power. Regardless of the exponent, I2 will be
positive (see Appendix IV for details). So, let I2 =
e Y. Our model, then, is:
P(x) = e Y / (1 + e Y).
-0.2
Household Income ($10,000)
P
Now, regardless of the value of Y, (x) will
always be between 0 and 1. This model also has
Why does this happen? A linear regression
model simply minimizes the sum of the squared
distances from the regression line. No matter what
transformation you make to X (which will change
the scale on the X-axis), a linear regression model
will select a line that connects one point on the
P
the desirable property that (x) will approach 0 or
1 asymptotically as Y gets very negative or very
positive, respectively.
The final step in the development of this
model is to acknowledge that we don’t know the
value of Y – it is a dependent variable (dependent
on X, which is our predictor variable). We can
estimate Y with the function:
Y=
B +B X
0
1
This looks quite a bit like a regression line - we
will use this to our advantage, as you’ll see in the
next section. Here, X is the independent (i.e.,
“predictor”) variable. Y is the dependent variable
in this expression, but we really want to estimate
P(x).
Substituting Y =
equation for
B
0
BX
+
into the
1
P(x) yields:
P(x) = e (B + B X) / (1 + e (B + B X)).
0
1
0
1
This is the logistic regression equation we saw
in the introduction section of this paper. We will
use it to model the dichotomous (i.e., “binary”)
behavior of the SUV analysis presented earlier.
There are other types of models that are useful for
dichotomous behavior, but the Logistic Regression
model has a few important advantages (see
Appendix V for details).
Modeling Behavior:
Now that we have a model, how do we fit the
parameters of this model to our data? In order to
estimate our parameters, we need a “logit
transformation”. This essentially allows us to
use a “linear regression – type” model.
If
P(x)
= the probability of some event
P
occurring, then 1 - (x) = the probability that the
event did not occur. Now consider the ratio:
P
P
(x) / [1 - (x)]. Mathematicians refer to this as
the “Odds Ratio”, because it denotes the “odds” of
a certain event. For example, if the probability of
an event occurring is 25%, then the "odds” of this
event occurring are (.25 / .75) or “1 to 3” (any
gamblers reading this will know that racetracks use
“odds” instead of probabilities). Now consider the
Odds Ratio in terms of our model for
P(x):
Odds Ratio =
B + B X) / (1 + e (B + B X))]
/ 1 – [e (B + B X) / (1 + e (B + B
[e (
0
1
0
0
1
0
B +B
(
X)
1
)]. We
X)
. Why would we want to do that?
Because now we can take the natural log of this
0
1
SUV Analysis:
Now that we have derived the logistic
regression model, and completed the logit
transformation, let’s use this to model SUV
purchase behavior. We have a dataset that
contains 1500 observations of new car buyers,
indicating their annual household income, and
whether or not they purchased an SUV (0=No,
1=Yes). The following is a brief SAS program
that fits a logit model to our SUV test data:
'$7$703
,1),/(
&?'55?6$63$3(5?/2*,67,&B'$7$B1351
7581&29(5
/5(&/ ,1387
#9$5
#9$5
#,1&
#689
2%6 B1B
581
352&/2*,67,&'$7$ 703'(6&(1',1*
7,7/(
/2*,67,&5(*5(66,21
02'(/689 ,1&
581
A portion of the output of this program follows:
7KH/2*,67,&3URFHGXUH
5HVSRQVH3URILOH
2UGHUHG
9DOXH689&RXQW
$QDO\VLVRI0D[LPXP/LNHOLKRRG(VWLPDWHV
3DUDPHWHU6WDQGDUG3U!
9DULDEOH')(VWLPDWH(UURU&KL6TXDUH
,17(5&37
,1&
$VVRFLDWLRQRI3UHGLFWHG3UREDELOLWLHVDQG2EVHUYHG
5HVSRQVHV
1
can simplify this messy equation to:
e
B B
expression to obtain 0 + 1X. This natural log
of the Odds Ratio is called the logit – now our
transformation is complete. The logit can take on
ANY value, and it is linear in X! Hooray!
&RQFRUGDQW 'LVFRUGDQW 7LHG SDLUV
more complex than for a standard linear regression
(see Appendix VI for details). However, in our
case, we rejected the null hypothesis, therefore a
bias in favor of the null hypothesis is not relevant
to us. Another test for significance is given by the
p-values in the output. The p-value for INC is
0.0001, which clearly indicates significance.
Finally, we can construct a 95% Confidence
Interval (C.I.) for the Odds Ratio (detailed below).
Since this interval does not contain one, we have
further indication that INC is a significant
predictor.
There is much more to the output than is shown
here, but this much will demonstrate how to
interpret a logit model. The parameter estimates
are:
B = -3.3481
B = 0.0405
0
1
B B
The logit (L), then, is L = 0 + 1X. So what is
the probability that a new car buyer will buy an
SUV? Plugging the above values into our original
model, we obtain:
P(x) = e (-3.3481 + 0.0405X) / (1 + e (-
3.3481 + 0.0405X)
).
Graphically, this model looks like:
Logistic Regression of Pr(SUV Purchase) on Household
Income
1.2
Probability of SUV Purchase
1
0.8
SUV (0/1)
Logistic_Reg_curve
0.6
0.4
0.2
0
0
20
40
60
80
100
120
140
160
Household Income ($10,000)
The “fit” from a linear regression and a
logistic regression (as measured by the SSE) are
virtually identical, over the range of predictors.
The linear model just becomes unreasonable as we
try to predict an outcome for any predictor whose
value is higher or lower than any we’ve already
observed. For example, if we want to predict the
probability that a new car buyer with an income of
$300,000 will purchase an SUV, we get 237%
with a linear model, and 99.98% from our logistic
model. Not only does the logistic model yield
results that are more reasonable, it also yields
results with lower prediction error. It is important
to remember that a good “fit” does not
necessarily lead to high “predictive power”.
This is especially true when we are making
predictions for values of X that are outside the
range of the sample data. The logistic model, for
example, recognizes that in terms of probability of
SUV purchase, there is very little difference
between an income of $200,000 and an income of
$400,000 (not so with a standard linear model).
This plot shows the “S-shape” that is required for
our predictions to approach 0 and 1
asymptotically. Note: the S-shape does not appear
to be very pronounced, partly because the plot is
only over the range of predictors given in the
dataset.
Finally, the logistic model is more robust in
the sense that it is less sensitive to input variations.
For example, adding one single observation to our
data set of 1500 observations, with an income of
$400,000 and SUV=1 would change the
parameters of the linear model substantially, while
those of the logistic model would be unchanged to
four decimal places!
So, how “good” is this model? There are a
few different ways to answer this question. One
way is to compare the estimated coefficient to an
estimate of its standard deviation. In our example,
Now let’s take a closer look at the meaning of
the Odds Ratio. From the output, we can see that
0.0405 / 0.00271 = 14.94.
If
B
1
is normally
B
distributed, we can reject the hypothesis that 1=0
with near certain confidence. Current research
indicates that, in a logistic regression, such a test
can sometimes overly favor the null hypothesis
(not true in a standard linear regression). The
mathematics behind a logistic regression is much
B
= 0.0405. This means that for every 1-unit
increase in X, the log of the Odds Ratio increases
by 0.0405. Now let’s select a more meaningful
increment in X, say 20 (remember X is measured
in tens of thousands of dollars). As income
increases by $20,000, we expect log of the Odds
Ratio to increase by 20*0.0405, or 0.81. Once we
have a point estimate for the log of the Odds Ratio,
we can just exponentiate to determine an estimate
1
of the Odds Ratio itself. Applying the exponent,
we can expect the Odds Ratio to increase by
(0.81)
e
, or 2.25, for every $20,000 increase in
income. To construct a 95% C.I. for the Odds
Ratio, we simply exponentiate the limits for a C.I.
for the log of the Odds Ratio (following the logic
stated above), as follows:
(20*.0405 + 1.96*20*.00271)
95% C.I. = e
, or
95% C.I. = (2.02, 2.50)
Note that this interval does NOT contain one,
again indicating that we have a statistically
significant model.
Does it make sense that the Odds Ratio should
increase by the same amount for EVERY $20,000
increase in income (i.e., should the effect from 30
to 50 be the same as the effect from 100 to 120)?
On the surface, one would think NO. However,
we need to keep in mind that we’re measuring the
Odds Ratio, NOT the underlying probabilities. In
our example here, a logistic regression model can
handle multiple predictors of varying types
(continuous, categorical, or dummy).
For
example, in our SUV analysis, we could probably
increase the predictive power of the model by
including a continuous predictor variable for Age,
and perhaps a dummy variable (0/1) indicating
whether the purchaser had young children.
Logistic regression is relatively more common
in medicine than in the business world, since such
results are more likely to have a binary nature
(e.g., heart attack or not, pregnant or not, etc.).
However, the technique also has applicability in
business – modeling customer choice is just one
example.
Contact Information:
The author may be
[email protected].
contacted
at
B
our SUV example, a constant slope for 1 (i.e., a
logit that is linear in X) DOES make sense – see
Appendix VII for details.
Sanity Check:
One other piece of output that I’ve found
useful as a sanity check is the concordant /
discordant information. Consider every possible
pair of observations where one observation has
SUV=0 and the other has SUV=1. In our dataset
of 1500 observations, there are 855 with SUV=0
and 645 with SUV=1. Therefore, there are
551,475 (=645*855) “pairs” that contain one of
each. Now for each pair, you need to guess which
one has SUV=1, based only on the model and the
value of the predictor variable, INC.
The
observation that shows a higher calculated
probability would be our guess for the one with
SUV=1. If we guess correctly, the pair is labeled
“concordant”, otherwise “discordant”. In our
simple model, a higher INC leads to a higher
calculated probability of SUV purchase. Based on
the concordant results, we can say that if we
randomly selected one SUV purchaser and one
non-SUV purchaser, the SUV purchaser would
have a higher income 74.4% of the time. This
concordant / discordant information serves as a
very useful sanity check, particularly as our
models become more complex (e.g., more
predictors).
Conclusions:
Logistic Regression is a very useful technique
for modeling dichotomous results from continuous
predictors. Though we used only one predictor in
APPENDIX I – Hypothesis test on income
In our sample of 1500 buyers, the average income of SUV
buyers is $86,401. The average income of non-SUV buyers
in our sample is $66,133. Taking the difference of the
sample means, and dividing by the “pooled” standard
deviation, yields a test statistic of 17.6. This means that the
difference of the sample means is 17.6 standard deviations
different from 0, which means we can reject the null
hypothesis with near-certain confidence.
APPENDIX II – Chi-Square test
We can construct a 2 x 2 table, as follows:
Non-SUV
SUV
Total
<$80K
>$80K
Total
If we make one such table for “actuals” and one such table
for “expected values” (calculated from row and column
totals), we can compute a chi-square statistic. In our
example, the chi-square test statistic is 177 (calculations
omitted). Since the chi-square statistic with one degree of
freedom and 99% confidence is only 6.6, we can conclude
that the interaction of these measures is NOT due to
sampling.
APPENDIX III – Modeling Dichotomous results from
discrete predictors
Modeling dichotomous results from discrete predictors does
not translate well to continuous predictors. The reason is
that the discrete predictors will be highly dependent on
sampling. For example, how would I estimate the
probability of SUV purchase for someone with an income
of $95,000, if there were no observations near $95,000? I’d
need to take some observations on either side of $95,000.
When I start choosing how far on either side to go, or how
many observations to use, I introduce some subjectivity in
my model.
APPENDIX IV – Forcing I2 to be positive with an
exponent
If I2 = e Y, I2 will always be positive. For example, e -1 =
0.368, e 0 = 1.000 , e 1 = 2.718.
APPENDIX V – Advantages of a Logistic Regression
model
Some cumulative distribution functions can be useful for
modeling dichotomous results from continuous predictors.
The logistic distribution is generally favored because it is
very flexible, and it has a “meaningful” interpretation (as
opposed to a function that simply fits the data).
APPENDIX VI – Logistic Math is more complex
In a linear regression, the likelihood equations
(used to calculate the parameter estimates) are
B
B
linear in
0 and
1, and are easily solved. In a
logistic regression, these equations are non-linear,
and require fairly elaborate, iterative procedures to
solve.
In a linear regression, the error term is normally
distributed (with mean 0, and constant variance).
In a logistic regression, the error term is NOT
normal. In fact, it can take on only two values
P
P
( (x) , or 1 - (x), depending on whether the
actual result is 0 or 1, respectively). This
represents a binomial distribution with mean 0,
P
P
and variance = ( (x) * (1 - (x)).
Even the Odds Ration is not quite “normal” – it tends to be
skewed positive since it cannot cross zero. The same is true
to a lesser degree with the log of the Odds Ratio. As sample
sizes get very large, these distributions become more
normal.
APPENDIX VII – Does a logit that is linear in X make
sense?
Should the effect from 30 to 50 really be the same as the
effect from 100 to 120? This is counter-intuitive. It turns
out that the effect on the Odds Ratio is the same, while the
effect on the underlying probabilities is drastically different.
Let’s look at some specific numbers. At an income of
P
P
$30K, (x) = 11%, while 1 - (x) = 89%. The Odds
Ratio, then, is 12% (=11%/89%). At an income of $50K,
P(x) = 21%, while 1 - P(x) = 79%. The Odds Ratio, then,
is 27% (=21%/79%). As we increase income by $20K, the
Odds Ratio increased by a factor of 2.25. Is this really valid
for ANY $20K increase? Let’s look another example. At
P
P
an income of $100K, (x) = 67%, while 1 - (x) = 33%.
The Odds Ratio, then, is 202% (=67%/33%). At an income
P
P
of $120K, (x) = 82%, while 1 - (x) = 18%. The Odds
Ratio, then, is 453% (=82%/18%). As we saw before, we
increased income by $20K, and the Odds Ratio increased
by a factor of 2.25.
So, IN OUR EXAMPLE, a logit that is linear
in X makes sense (as it does for most models).
However, if you determine that a linear logit is
inappropriate for your data, you can use
categorical variables, dummy variables, and/or
some non-linear transformation of X. This is
beyond the scope of this paper, but may be my
topic for next year’s conference!