Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Modeling Customer Choice By David R. Roberts Introduction: Imagine you’ve just been assigned as a brand analyst for a particular model of Sport Utility Vehicle (SUV). Your first assignment is to determine what, if any, relationship exists between income and a decision to purchase an SUV. You are given data from a random sample of 1500 recent new car buyers, which includes the buyer’s household income and whether the buyer purchased an SUV or other type of car. Your boss expects a briefing by tomorrow morning, specifically answering the following two questions: 1) Is there a relationship between income and SUV purchase?, and 2) How would you “model” this relationship in order to predict future behavior (e.g., How likely is a person with an income of X to buy an SUV)? Note: This example is simplified (one predictor only) for illustrative purposes. The first question is easy; the second is not. “Traditional” Techniques: How would a business analyst (perhaps a newly minted MBA) approach this problem? The first step might be to look at the distribution of all purchasers by income. This would result in the following chart: Now we can begin to get an understanding of behavior. It appears that at incomes below $70,000, buyers are more likely to purchase a nonSUV, while at incomes above $70,000, buyers are more likely to purchase an SUV. This certainly hints at a potential relationship between income and SUV purchase. Do SUV purchasers have higher incomes than non-SUV purchasers? The chart indicates YES. This could also be answered more explicitly with a hypothesis test - the null hypothesis would be that the average income of SUV buyers is the same as the average income of non-SUV buyers. It turns out that this null hypothesis can be rejected with near-certain confidence (see Appendix I for details). Overview: The purpose of this paper is to: 1) discuss the shortcomings of “traditional” statistical techniques when modeling customer choice, and 2) discuss a powerful, but relatively obscure, technique called Logistic Regression. Throughout the paper, I will make use of the “SUV analysis” example, mentioned above. This paper will review some of the steps that a business analyst might take to answer the two questions posed. The limitations of these techniques will then provide the motivation for the use of Logistic Regression. The Logistic Regression model is: Alternatively, we could construct a conditional probability table of purchase decision given a level of income. This test statistic has a chi-square distribution, and is also very significant for virtually any bucketing of incomes (see Appendix II for details). So, now we’ve shown that there is, indeed, a relationship between income and an SUV purchase decision. That much is fairly straightforward – these techniques are found in many undergraduate and graduate level statistics textbooks. However, we also want to model this relationship. How should we do that? P(x) = e (B + B X) / (1 + e (B + B X)). 0 1 0 1 (The mathematics behind this will be described later in the paper.) Household Income of 1500 Purchasers of new Vehicles 200 180 160 # of purchasers 140 120 SUV non-SUV 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 bottom row with one point on the top row! Try a few simple examples on your own, and you’ll see this same effect. Linear Regression: Any card-carrying MBA will know that a linear regression model is typically well suited for such an analysis. With income as the independent variable, and SUV purchase (0/1) as the dependent variable, the regression model derived from our test data is: A “linear” regression model is quite flexible in terms of its ability to handle non-linear predictors (via “transformations”) and dichotomous inputs (via “dummy” variables). However, it is ill suited to model dichotomous results form continuous predictors. [Note: we could model dichotomous results from discrete predictors, but such a model does not translate well to continuous predictors - see Appendix III for details.] If the result is always either a 0 or a 1, and we want to estimate a probability, we need a model whose calculated result is always between 0 and 1. What would such a model look like? How can we model dichotomous results from continuous predictors? This is the motivation for logistic regression. SUV=-0.21 + 0.0086 INC. With an R-squared near 20% and an F-value over 300, we can say with near certainty that income explains about 1/5 of the decision to buy an SUV. This model indicates that, for a buyer with an income of $100,000, the probability of choosing an SUV for his or her next car purchase is 65%. This all looks pretty reasonable so far. As a final sanity check before submitting this to the boss, let’s find the probability of SUV purchase for someone with an income of $150,000. Plugging $150,000 into the regression model yields a result of 1.08! A new car buyer with an income of $150,000 has a 108% chance of buying an SUV! What’s going on here? Logistic Regression: Now let’s “derive” the logistic regression P model presented in the introduction. Let (x) = the probability that some event occurs. Let I1 = an input (independent) variable. We want a Perhaps the relationship is not linear??? How about a regression model that is non-linear in X? It turns out that any “transformation” of X (e.g., square root of X, X squared, etc.) will still produce the same effect, in this case with virtually no impact of the significance of the model. This can be seen graphically in the chart below: P P relationship (x) = f(I1), such that (x) is always between 0 and 1. As I1 gets very large or very P small, (x) must approach 0 or 1 asymptotically (i.e., the relationship must be “S-shaped”). One way to achieve this is to let f(I1) = I2 / (1 + I2). As I2 approaches 0, f(I1) also approaches 0; as I2 gets very large, f(I1) approaches 1. Linear Regression of Pr(SUV Purchase) on Household Income 1.2 Probability of SUV Purchase 1 0.8 0.6 SUV (0/1) Reg_line 0.4 0.2 0 0 20 40 60 80 100 120 140 160 We’re all set, except for one small detail – what if I2 is negative? If I2 is negative, our model falls apart (i.e., the calculated probability will be negative). So we must “force” I2 to be positive. One way to do this is to set I2 = e raised to some power. Regardless of the exponent, I2 will be positive (see Appendix IV for details). So, let I2 = e Y. Our model, then, is: P(x) = e Y / (1 + e Y). -0.2 Household Income ($10,000) P Now, regardless of the value of Y, (x) will always be between 0 and 1. This model also has Why does this happen? A linear regression model simply minimizes the sum of the squared distances from the regression line. No matter what transformation you make to X (which will change the scale on the X-axis), a linear regression model will select a line that connects one point on the P the desirable property that (x) will approach 0 or 1 asymptotically as Y gets very negative or very positive, respectively. The final step in the development of this model is to acknowledge that we don’t know the value of Y – it is a dependent variable (dependent on X, which is our predictor variable). We can estimate Y with the function: Y= B +B X 0 1 This looks quite a bit like a regression line - we will use this to our advantage, as you’ll see in the next section. Here, X is the independent (i.e., “predictor”) variable. Y is the dependent variable in this expression, but we really want to estimate P(x). Substituting Y = equation for B 0 BX + into the 1 P(x) yields: P(x) = e (B + B X) / (1 + e (B + B X)). 0 1 0 1 This is the logistic regression equation we saw in the introduction section of this paper. We will use it to model the dichotomous (i.e., “binary”) behavior of the SUV analysis presented earlier. There are other types of models that are useful for dichotomous behavior, but the Logistic Regression model has a few important advantages (see Appendix V for details). Modeling Behavior: Now that we have a model, how do we fit the parameters of this model to our data? In order to estimate our parameters, we need a “logit transformation”. This essentially allows us to use a “linear regression – type” model. If P(x) = the probability of some event P occurring, then 1 - (x) = the probability that the event did not occur. Now consider the ratio: P P (x) / [1 - (x)]. Mathematicians refer to this as the “Odds Ratio”, because it denotes the “odds” of a certain event. For example, if the probability of an event occurring is 25%, then the "odds” of this event occurring are (.25 / .75) or “1 to 3” (any gamblers reading this will know that racetracks use “odds” instead of probabilities). Now consider the Odds Ratio in terms of our model for P(x): Odds Ratio = B + B X) / (1 + e (B + B X))] / 1 – [e (B + B X) / (1 + e (B + B [e ( 0 1 0 0 1 0 B +B ( X) 1 )]. We X) . Why would we want to do that? Because now we can take the natural log of this 0 1 SUV Analysis: Now that we have derived the logistic regression model, and completed the logit transformation, let’s use this to model SUV purchase behavior. We have a dataset that contains 1500 observations of new car buyers, indicating their annual household income, and whether or not they purchased an SUV (0=No, 1=Yes). The following is a brief SAS program that fits a logit model to our SUV test data: '$7$703 ,1),/( &?'55?6$63$3(5?/2*,67,&B'$7$B1351 7581&29(5 /5(&/ ,1387 #9$5 #9$5 #,1& #689 2%6 B1B 581 352&/2*,67,&'$7$ 703'(6&(1',1* 7,7/( /2*,67,&5(*5(66,21 02'(/689 ,1& 581 A portion of the output of this program follows: 7KH/2*,67,&3URFHGXUH 5HVSRQVH3URILOH 2UGHUHG 9DOXH689&RXQW $QDO\VLVRI0D[LPXP/LNHOLKRRG(VWLPDWHV 3DUDPHWHU6WDQGDUG3U! 9DULDEOH')(VWLPDWH(UURU&KL6TXDUH ,17(5&37 ,1& $VVRFLDWLRQRI3UHGLFWHG3UREDELOLWLHVDQG2EVHUYHG 5HVSRQVHV 1 can simplify this messy equation to: e B B expression to obtain 0 + 1X. This natural log of the Odds Ratio is called the logit – now our transformation is complete. The logit can take on ANY value, and it is linear in X! Hooray! &RQFRUGDQW 'LVFRUGDQW 7LHG SDLUV more complex than for a standard linear regression (see Appendix VI for details). However, in our case, we rejected the null hypothesis, therefore a bias in favor of the null hypothesis is not relevant to us. Another test for significance is given by the p-values in the output. The p-value for INC is 0.0001, which clearly indicates significance. Finally, we can construct a 95% Confidence Interval (C.I.) for the Odds Ratio (detailed below). Since this interval does not contain one, we have further indication that INC is a significant predictor. There is much more to the output than is shown here, but this much will demonstrate how to interpret a logit model. The parameter estimates are: B = -3.3481 B = 0.0405 0 1 B B The logit (L), then, is L = 0 + 1X. So what is the probability that a new car buyer will buy an SUV? Plugging the above values into our original model, we obtain: P(x) = e (-3.3481 + 0.0405X) / (1 + e (- 3.3481 + 0.0405X) ). Graphically, this model looks like: Logistic Regression of Pr(SUV Purchase) on Household Income 1.2 Probability of SUV Purchase 1 0.8 SUV (0/1) Logistic_Reg_curve 0.6 0.4 0.2 0 0 20 40 60 80 100 120 140 160 Household Income ($10,000) The “fit” from a linear regression and a logistic regression (as measured by the SSE) are virtually identical, over the range of predictors. The linear model just becomes unreasonable as we try to predict an outcome for any predictor whose value is higher or lower than any we’ve already observed. For example, if we want to predict the probability that a new car buyer with an income of $300,000 will purchase an SUV, we get 237% with a linear model, and 99.98% from our logistic model. Not only does the logistic model yield results that are more reasonable, it also yields results with lower prediction error. It is important to remember that a good “fit” does not necessarily lead to high “predictive power”. This is especially true when we are making predictions for values of X that are outside the range of the sample data. The logistic model, for example, recognizes that in terms of probability of SUV purchase, there is very little difference between an income of $200,000 and an income of $400,000 (not so with a standard linear model). This plot shows the “S-shape” that is required for our predictions to approach 0 and 1 asymptotically. Note: the S-shape does not appear to be very pronounced, partly because the plot is only over the range of predictors given in the dataset. Finally, the logistic model is more robust in the sense that it is less sensitive to input variations. For example, adding one single observation to our data set of 1500 observations, with an income of $400,000 and SUV=1 would change the parameters of the linear model substantially, while those of the logistic model would be unchanged to four decimal places! So, how “good” is this model? There are a few different ways to answer this question. One way is to compare the estimated coefficient to an estimate of its standard deviation. In our example, Now let’s take a closer look at the meaning of the Odds Ratio. From the output, we can see that 0.0405 / 0.00271 = 14.94. If B 1 is normally B distributed, we can reject the hypothesis that 1=0 with near certain confidence. Current research indicates that, in a logistic regression, such a test can sometimes overly favor the null hypothesis (not true in a standard linear regression). The mathematics behind a logistic regression is much B = 0.0405. This means that for every 1-unit increase in X, the log of the Odds Ratio increases by 0.0405. Now let’s select a more meaningful increment in X, say 20 (remember X is measured in tens of thousands of dollars). As income increases by $20,000, we expect log of the Odds Ratio to increase by 20*0.0405, or 0.81. Once we have a point estimate for the log of the Odds Ratio, we can just exponentiate to determine an estimate 1 of the Odds Ratio itself. Applying the exponent, we can expect the Odds Ratio to increase by (0.81) e , or 2.25, for every $20,000 increase in income. To construct a 95% C.I. for the Odds Ratio, we simply exponentiate the limits for a C.I. for the log of the Odds Ratio (following the logic stated above), as follows: (20*.0405 + 1.96*20*.00271) 95% C.I. = e , or 95% C.I. = (2.02, 2.50) Note that this interval does NOT contain one, again indicating that we have a statistically significant model. Does it make sense that the Odds Ratio should increase by the same amount for EVERY $20,000 increase in income (i.e., should the effect from 30 to 50 be the same as the effect from 100 to 120)? On the surface, one would think NO. However, we need to keep in mind that we’re measuring the Odds Ratio, NOT the underlying probabilities. In our example here, a logistic regression model can handle multiple predictors of varying types (continuous, categorical, or dummy). For example, in our SUV analysis, we could probably increase the predictive power of the model by including a continuous predictor variable for Age, and perhaps a dummy variable (0/1) indicating whether the purchaser had young children. Logistic regression is relatively more common in medicine than in the business world, since such results are more likely to have a binary nature (e.g., heart attack or not, pregnant or not, etc.). However, the technique also has applicability in business – modeling customer choice is just one example. Contact Information: The author may be [email protected]. contacted at B our SUV example, a constant slope for 1 (i.e., a logit that is linear in X) DOES make sense – see Appendix VII for details. Sanity Check: One other piece of output that I’ve found useful as a sanity check is the concordant / discordant information. Consider every possible pair of observations where one observation has SUV=0 and the other has SUV=1. In our dataset of 1500 observations, there are 855 with SUV=0 and 645 with SUV=1. Therefore, there are 551,475 (=645*855) “pairs” that contain one of each. Now for each pair, you need to guess which one has SUV=1, based only on the model and the value of the predictor variable, INC. The observation that shows a higher calculated probability would be our guess for the one with SUV=1. If we guess correctly, the pair is labeled “concordant”, otherwise “discordant”. In our simple model, a higher INC leads to a higher calculated probability of SUV purchase. Based on the concordant results, we can say that if we randomly selected one SUV purchaser and one non-SUV purchaser, the SUV purchaser would have a higher income 74.4% of the time. This concordant / discordant information serves as a very useful sanity check, particularly as our models become more complex (e.g., more predictors). Conclusions: Logistic Regression is a very useful technique for modeling dichotomous results from continuous predictors. Though we used only one predictor in APPENDIX I – Hypothesis test on income In our sample of 1500 buyers, the average income of SUV buyers is $86,401. The average income of non-SUV buyers in our sample is $66,133. Taking the difference of the sample means, and dividing by the “pooled” standard deviation, yields a test statistic of 17.6. This means that the difference of the sample means is 17.6 standard deviations different from 0, which means we can reject the null hypothesis with near-certain confidence. APPENDIX II – Chi-Square test We can construct a 2 x 2 table, as follows: Non-SUV SUV Total <$80K >$80K Total If we make one such table for “actuals” and one such table for “expected values” (calculated from row and column totals), we can compute a chi-square statistic. In our example, the chi-square test statistic is 177 (calculations omitted). Since the chi-square statistic with one degree of freedom and 99% confidence is only 6.6, we can conclude that the interaction of these measures is NOT due to sampling. APPENDIX III – Modeling Dichotomous results from discrete predictors Modeling dichotomous results from discrete predictors does not translate well to continuous predictors. The reason is that the discrete predictors will be highly dependent on sampling. For example, how would I estimate the probability of SUV purchase for someone with an income of $95,000, if there were no observations near $95,000? I’d need to take some observations on either side of $95,000. When I start choosing how far on either side to go, or how many observations to use, I introduce some subjectivity in my model. APPENDIX IV – Forcing I2 to be positive with an exponent If I2 = e Y, I2 will always be positive. For example, e -1 = 0.368, e 0 = 1.000 , e 1 = 2.718. APPENDIX V – Advantages of a Logistic Regression model Some cumulative distribution functions can be useful for modeling dichotomous results from continuous predictors. The logistic distribution is generally favored because it is very flexible, and it has a “meaningful” interpretation (as opposed to a function that simply fits the data). APPENDIX VI – Logistic Math is more complex In a linear regression, the likelihood equations (used to calculate the parameter estimates) are B B linear in 0 and 1, and are easily solved. In a logistic regression, these equations are non-linear, and require fairly elaborate, iterative procedures to solve. In a linear regression, the error term is normally distributed (with mean 0, and constant variance). In a logistic regression, the error term is NOT normal. In fact, it can take on only two values P P ( (x) , or 1 - (x), depending on whether the actual result is 0 or 1, respectively). This represents a binomial distribution with mean 0, P P and variance = ( (x) * (1 - (x)). Even the Odds Ration is not quite “normal” – it tends to be skewed positive since it cannot cross zero. The same is true to a lesser degree with the log of the Odds Ratio. As sample sizes get very large, these distributions become more normal. APPENDIX VII – Does a logit that is linear in X make sense? Should the effect from 30 to 50 really be the same as the effect from 100 to 120? This is counter-intuitive. It turns out that the effect on the Odds Ratio is the same, while the effect on the underlying probabilities is drastically different. Let’s look at some specific numbers. At an income of P P $30K, (x) = 11%, while 1 - (x) = 89%. The Odds Ratio, then, is 12% (=11%/89%). At an income of $50K, P(x) = 21%, while 1 - P(x) = 79%. The Odds Ratio, then, is 27% (=21%/79%). As we increase income by $20K, the Odds Ratio increased by a factor of 2.25. Is this really valid for ANY $20K increase? Let’s look another example. At P P an income of $100K, (x) = 67%, while 1 - (x) = 33%. The Odds Ratio, then, is 202% (=67%/33%). At an income P P of $120K, (x) = 82%, while 1 - (x) = 18%. The Odds Ratio, then, is 453% (=82%/18%). As we saw before, we increased income by $20K, and the Odds Ratio increased by a factor of 2.25. So, IN OUR EXAMPLE, a logit that is linear in X makes sense (as it does for most models). However, if you determine that a linear logit is inappropriate for your data, you can use categorical variables, dummy variables, and/or some non-linear transformation of X. This is beyond the scope of this paper, but may be my topic for next year’s conference!