Download Business Analytics II - Winter 2016

Managerial Economics & Decision Sciences Department Developed for business analytics II week 3 week 5 week 5 week 3 ▌key concepts hypotheses, test and confidence intervals  linear regression: estimation and interpretation  linear regression: the dummy case  © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five key concepts Developed for business analytics II learning objectives ► linear regression     definition and assumptions of the linear model estimate the model, interpret coefficients understand regression table when provided without data statistical significance, p-value and confidence intervals ► confidence and prediction intervals  klincom and kpredint commands  interpret and use output from klincom and kpredint commands  klincom and kpredint commands: use for change in y and levels of y ► dummy variables  definition and interpretation of dummy and slope dummy variables  use of dummy and slope dummy regressions in hypothesis testing readings ► (MSN)  Chapter 2-5 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression – general overview ► A high level description of the typical steps in a regression analysis: Step I: Model specification choice of dependent and independent variables Step II: Coefficients estimation Step III: Coefficients interpretationsensitivity analysis (change in E[y] vs. change in xi) Step IV: Tests of results test coefficient significance Step V: Confidence intervals ranges for parameters (i) and dependent variable Step VI: Additional analysis test of combinations of parameters run the regression, obtain estimated coefficients ► Sometimes the steps above were fairly obvious, other times the analysis was not explicitly set forward in terms of these steps but everything that we have done so far (and will continue to do) falls somewhere along this general roadmap. ► As a simple example: while the dummy regression seems to be a topic by itself, it’s really nothing else but a standard linear regression model ; if you think that dummy vs. slope dummy is not captured above: it’s really a matter Step I – a correct and purposedriven specification of the regression model. ► The intention of this review is to give an integrated view of the topics covered so far and not to substitute the notes. As you read the detailed notes and solve practice problems, have this road-map handy and try to figure out where your specific analysis/topics/problem are located. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 1 session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression: model specification ► The mechanics of running a linear regression is straightforward, no matter that variables you decide to include as independent variables (x1, …, xk) the model gets estimated. The most important aspect at the stage of model specification is to consider as independent variables only those for which there exist ex-ante (before estimation) reasons to believe that the variable under consideration really affect the dependent variable in a meaningful way. Can you argue why the x-variable might affect the y-variable? ► Ingredients of the linear regression model: i. assumption that the true mean of the dependent variable is explained by k independent variables through a linear relation E[y] = 0 + 1·x1 + 2·x2 + … + k·xk ii. what is observed in practice is a sample of n observation for (y, x1, x2, … ,xk) where, in terms of observations we write y = 0 + 1·x1 + 2·x2 + … + k·xk + noise Here the term “noise” indicates the part of the variation in y (around its mean) not explained by the deterministic part of the regression (0 + 1·x1 + 2·x2 + … + k·xk) iii. The OLS (ordinary least square) method delivers estimates b0, b1, b2, … , bk for the true … , k respectively © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II parameters 0 , 1, 2, session five | page 2 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: coefficient estimation ► This is always the easiest step. Again, worth emphasizing that it makes no difference whether the dependent variables are continuous or discrete (dummy) or slope dummies the same OLS method is applied. linear regression: coefficient interpretation ► For a regression without slope dummies variables, all the coefficients have the same interpretation: bi (coefficient of independent variable xi) represents the change in y when xi changes with one unit holding all other independent variables at fixed level (no matter what level) ► Why does a regression with slope dummies variables deserve a different treatment at this step? Notice in the statement above that the interpretation of bi requires that only xi changes, all other independent variables being held fixed. For a simple slope dummy regression Est. E[y] = b0 + b1·dummy + b2·x + b3·dummyx if we attempt to interpret b2 as the change in y when x changes with one unit holding all other independent variables fixed we’ll run into a problem when dummy = 1 because in this case dummyx = x and therefore this variable will also change when x changes. This is the reason why for dummy regressions we have to pay extra attention on how we interpret the coefficients, however, in doing this what we end doing is simply a grouping of coefficients (b0 and b0 + b1, and b2 and b2 + b3) that allows us to apply the “holding everything else constant” condition properly. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 3 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: tests of results ► While it seems that each time you solve a problem there’s a new test you have to apply and remember, there’s really a standard “testing procedure” that is applied again and again. The impression that there’s not a standardized strategy to perform a test is probably due to the fact that, due to lack of time availability, we sometimes tend to jump of different steps involved in testing. ► The standard (and very general) steps in any testing procedure are given below:  determine the function/combination of coefficients that is tested, call it f(b0, b1, …, bk)  determine the benchmark B* with which f(b0, b1, …, bk) is compared  set the null hypothesis H0 – gather negative evidence against this statement  set the alternative hypothesis (contrary of the null) Ha  calculate the t-test = [f(b0, b1, …, bk) – B*]/std.err[f(b0, b1, …, bk)]  calculate the p-value - if Ha: f(b0, b1, …, bk) > B* then p-value = Pr[t > t-test] - if Ha: f(b0, b1, …, bk) < B* then p-value = Pr[t < t-test] - if Ha: f(b0, b1, …, bk)  B* then p-value = Pr[t < -|t-test| or t > -|t-test|] (right tail) (left tail) (two-tail)  set the significance level (0,1)  compare the p-value with your chosen significance level , if p-value <  then reject the null © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 4 session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression: tests of results ► For the simple case when the test involves a single coefficient of the regression you can just use the estimate and standard error of that coefficient from the regression table. ► When the test involves combination of coefficients (as in the case of testing whether that the average of y is above a certain benchmark b*) for particular, given, values of independent variables x1, x2 , …, , xk  f(b0, b1, …, bk) = b0 + b1·x1 + … + bk·xk and B* = b* ■ null hypothesis H0: b0 + b1·x1 + … + bk·xk  b* ■ alternative hypothesis Ha: b0 + b1·x1 + … + bk·xk >b* ► You will use the klincom or kpredint. Choose the p-value that corresponds to your specific alternative hypothesis, set the significance level (0,1) and compare the p-value with , if p-value <  then reject the null. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 5 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: confidence intervals ► The central idea behind any confidence interval is the following:  The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven).  Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU] a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the  the wider will be the confidence interval.  To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f(b0, b1, …, bk) and the general form of any confidence interval is: bL = estimated value of f(b0, b1, …, bk) – t/2,n – k – 1·std.error[f(b0, b1, …, bk)] bU = estimated value of f(b0, b1, …, bk) + t/2,n – k – 1·std.error[f(b0, b1, …, bk)]  Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 6 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: confidence intervals ► The central idea behind any confidence interval is the following:  The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven).  Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU] a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the  the wider will be the confidence interval.  To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f(b0, b1, …, bk) and the general form of any confidence interval is: bL = estimated value of f(b0, b1, …, bk) – std.error[f(b0, b1, …, bk)]tdf,/2 bU = estimated value of f(b0, b1, …, bk) + std.error[f(b0, b1, …, bk)]tdf,/2  Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands. ► What can you calculate yourself? You can always determine  the estimate of the variable of interest  tdf,/2 = invttail(n – k – 1, /2)  The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a coefficient. Otherwise you have to use the klincom or kpredint commands. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 7 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: confidence intervals ► The central idea behind any confidence interval is the following:  The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven).  Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU] a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the  the wider will be the confidence interval.  To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f(b0, b1, …, bk) and the general form of any confidence interval is: bL = estimated value of f(b0, b1, …, bk) – std.error[f(b0, b1, …, bk)]tdf,/2 bU = estimated value of f(b0, b1, …, bk) + std.error[f(b0, b1, …, bk)]tdf,/2  Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands. ► What can you calculate yourself? You can always determine  the estimate of the variable of interest  tdf,/2 = invttail(n – k – 1, /2)  The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a coefficient. Otherwise you have to use the klincom or kpredint commands. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 8 session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression: example ► A simple example: let’s use the pizzasales.dta file and run the regression Sales = b0 + b1·Income -----------------------------------------------------------------------------Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Income | 2.697244 .2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ ► The table already provide you with the 95% confidence intervals for b0 and b1 Let’s confirm the calculations for b1 (the sample consists of n = 50 observations): - the estimated value b1: - the t-value: - the standard error: b1 = 2.697244 t 48,0.025 = invttail(48,0.025) = 2.0106348 std.err.(b1) = 0.2777973 ► The 95% confidence interval bounds: bL = b1 – tdf,/2·std.error[b1] = 2.697244 – 0.2777973 · 2.010634 = 2.138695 bU = b1 + tdf,/2 ·std.error[b1] = 2.697244 + 0.2777973 · 2.010634 = 3.255793 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 9 session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression: example ► Let’s continue with the pizzasales.dta file Sales = b0 + b1·Income -----------------------------------------------------------------------------Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Income | 2.697244 .2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ Remark: One issue that raises sometimes questions is why not using the t = 9.71 from the table for the confidence interval? That t is the t-test calculated for the null hypothesis that b1 = 0. Let’s confirm that: ■ null hypothesis H0: b1 = 0 ■ alternative hypothesis Ha: b1  0  calculate t-test = [b1 – 0]/std.err[b1] = [2.697244 – 0]/0.2777973 = 9.71  calculate p-value = Pr[t < -|t-test| or t > +|t-test|] = 2·ttail(48,9.71) = 0 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 10 session five Managerial Economics & Decision Sciences Department key concepts linear regression: example Developed for business analytics II the sum is exactly the p-value 0.2905 ► The reported t and p-value are related to the testing part 0.2905 -----------------------------------------------------------------------------Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Income | 2.697244 .2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ ► The reported conf. interval is the confidence interval for  = 0.05 (or 95% confidence) the sum is exactly the confidence level 0.05 for the intervals 0.025 0.025 48.72079 -127.3544 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II -224.7959 session five | page 11 Managerial Economics & Decision Sciences Department session five key concepts Developed for business analytics II linear regression: klincom and kpredint ► The confidence interval is the interval that contains the true mean of y for x  x0 with probability 1  . The confidence interval draws inference about E[ y | x  x0 ] based on Est.E[ y | x  x0 ]. Confidence interval’s form is: Est.E[ y | x  x0 ]  std.err.ci  tdf,/2  E[ y | x  x0 ]  Est.E[ y | x  x0 ]  std.err.ci  tdf,/2 ► The prediction interval is the interval that contains any level of y for x  x0 with probability 1  . The prediction interval draws inference about y | x  x0 based on Est.E[ y | x  x0 ]. Prediction interval’s form is Est.E[ y | x  x0 ]  std.err.pi  tdf,/2  y | x  x0  Est.E[ y | x  x0 ]  std.err.pi  tdf,/2 ► Understanding the difference between klincom and kpredint:  klincom is used in the context of providing inference for the level of true mean/average of the dependent variable  kpredint is used in the context of providing inference for the level of an individual value of the dependent variable ► Both klincom and kpredint can be used to obtain an interval (set the benchmark equal to zero) or to perform tests comparing the average or level of dependent variable with a given benchmark (in this case you will include the benchmark in the command) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 12 session five Managerial Economics & Decision Sciences Department key concepts Developed for business analytics II linear regression: dummy variables ► Whenever a dummy variable and a continuous variable is involved there are three possible regressions you can set up:  y = 0 +  1·dummy For this setup you are only interested to find the mean of y for dummy =1 vs. dummy = 0 without controlling for anything else. As an example: test whether the Sales for pizza are different for neighborhoods with competitors vs. neighborhoods without competition (test b1 = 0)  y = 0 + 1·dummy + 2·x For this setup you force the regression to assume that the dummy variable has only a level effect on y.  y = 0 + 1·dummy + 2·x + 3·slopedummy For this setup you allow the regression to pick up the slope effect; you assume that the dummy variable and the continuous variable might interact. Useful for: testing whether the change in y vs. change in x is different when dummy = 1 vs. dummy =0 (test b3 =0) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session five | page 13

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Business Analytics II - Winter 2016