Download Business Analytics II - Winter 2016

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Managerial Economics &
Decision Sciences Department
Developed for
business analytics II
week 3
week 5
week 5
week 3
▌key concepts
hypotheses, test and confidence intervals 
linear regression: estimation and interpretation 
linear regression: the dummy case 
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five
key concepts
Developed for
business analytics II
learning objectives
► linear regression




definition and assumptions of the linear model
estimate the model, interpret coefficients
understand regression table when provided without data
statistical significance, p-value and confidence intervals
► confidence and prediction intervals
 klincom and kpredint commands
 interpret and use output from klincom and kpredint commands
 klincom and kpredint commands: use for change in y and levels of y
► dummy variables
 definition and interpretation of dummy and slope dummy variables
 use of dummy and slope dummy regressions in hypothesis testing
readings
► (MSN)
 Chapter 2-5
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression – general overview
► A high level description of the typical steps in a regression analysis:
Step I:
Model specification
choice of dependent and independent variables
Step II:
Coefficients estimation
Step III:
Coefficients interpretationsensitivity analysis (change in E[y] vs. change in xi)
Step IV:
Tests of results
test coefficient significance
Step V:
Confidence intervals
ranges for parameters (i) and dependent variable
Step VI:
Additional analysis
test of combinations of parameters
run the regression, obtain estimated coefficients
► Sometimes the steps above were fairly obvious, other times the analysis was not explicitly set forward in terms of these steps but
everything that we have done so far (and will continue to do) falls somewhere along this general roadmap.
► As a simple example: while the dummy regression seems to be a topic by itself, it’s really nothing else but a standard linear
regression model ; if you think that dummy vs. slope dummy is not captured above: it’s really a matter Step I – a correct and purposedriven specification of the regression model.
► The intention of this review is to give an integrated view of the topics covered so far and not to substitute the notes. As you read the
detailed notes and solve practice problems, have this road-map handy and try to figure out where your specific analysis/topics/problem
are located.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 1
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression: model specification
► The mechanics of running a linear regression is straightforward, no matter that variables you decide to include as independent
variables (x1, …, xk) the model gets estimated. The most important aspect at the stage of model specification is to consider as
independent variables only those for which there exist ex-ante (before estimation) reasons to believe that the variable under
consideration really affect the dependent variable in a meaningful way. Can you argue why the x-variable might affect the y-variable?
► Ingredients of the linear regression model:
i. assumption that the true mean of the dependent variable is explained by k independent
variables through a linear relation
E[y] = 0 + 1·x1 + 2·x2 + … + k·xk
ii. what is observed in practice is a sample of n observation for (y, x1, x2, … ,xk) where, in terms of
observations we write
y = 0 + 1·x1 + 2·x2 + … + k·xk + noise
Here the term “noise” indicates the part of the variation in y (around its mean) not explained by the
deterministic part of the regression (0 + 1·x1 + 2·x2 + … + k·xk)
iii. The OLS (ordinary least square) method delivers estimates b0, b1, b2, … , bk for the true
… , k respectively
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
parameters 0 , 1, 2,
session five | page 2
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: coefficient estimation
► This is always the easiest step. Again, worth emphasizing that it makes no difference whether the dependent variables are
continuous or discrete (dummy) or slope dummies the same OLS method is applied.
linear regression: coefficient interpretation
► For a regression without slope dummies variables, all the coefficients have the same interpretation:
bi (coefficient of independent variable xi) represents the change in y when xi changes with one unit holding all other
independent variables at fixed level (no matter what level)
► Why does a regression with slope dummies variables deserve a different treatment at this step? Notice in the statement above that
the interpretation of bi requires that only xi changes, all other independent variables being held fixed. For a simple slope dummy
regression
Est. E[y] = b0 + b1·dummy + b2·x + b3·dummyx
if we attempt to interpret b2 as the change in y when x changes with one unit holding all other independent variables fixed we’ll run into a
problem when dummy = 1 because in this case dummyx = x and therefore this variable will also change when x changes. This is the
reason why for dummy regressions we have to pay extra attention on how we interpret the coefficients, however, in doing this what we
end doing is simply a grouping of coefficients (b0 and b0 + b1, and b2 and b2 + b3) that allows us to apply the “holding everything else
constant” condition properly.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 3
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: tests of results
► While it seems that each time you solve a problem there’s a new test you have to apply and remember, there’s really a standard
“testing procedure” that is applied again and again. The impression that there’s not a standardized strategy to perform a test is probably
due to the fact that, due to lack of time availability, we sometimes tend to jump of different steps involved in testing.
► The standard (and very general) steps in any testing procedure are given below:
 determine the function/combination of coefficients that is tested, call it f(b0, b1, …, bk)
 determine the benchmark B* with which f(b0, b1, …, bk) is compared
 set the null hypothesis H0 – gather negative evidence against this statement
 set the alternative hypothesis (contrary of the null) Ha
 calculate the t-test = [f(b0, b1, …, bk) – B*]/std.err[f(b0, b1, …, bk)]
 calculate the p-value
- if Ha: f(b0, b1, …, bk) > B* then p-value = Pr[t > t-test]
- if Ha: f(b0, b1, …, bk) < B* then p-value = Pr[t < t-test]
- if Ha: f(b0, b1, …, bk)  B* then p-value = Pr[t < -|t-test| or t > -|t-test|]
(right tail)
(left tail)
(two-tail)
 set the significance level (0,1)
 compare the p-value with your chosen significance level , if p-value <  then reject the null
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 4
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression: tests of results
► For the simple case when the test involves a single coefficient of the regression you can just use the estimate and standard error of
that coefficient from the regression table.
► When the test involves combination of coefficients (as in the case of testing whether that the average of y is above a certain
benchmark b*) for particular, given, values of independent variables x1, x2 , …, , xk
 f(b0, b1, …, bk) = b0 + b1·x1 + … + bk·xk and B* = b*
■ null hypothesis H0:
b0 + b1·x1 + … + bk·xk  b*
■ alternative hypothesis Ha:
b0 + b1·x1 + … + bk·xk >b*
► You will use the klincom or kpredint. Choose the p-value that corresponds to your specific alternative hypothesis, set the significance
level (0,1) and compare the p-value with , if p-value <  then reject the null.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 5
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: confidence intervals
► The central idea behind any confidence interval is the following:
 The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate
(for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the
dependent variable for given levels of the independent variables – this is context driven).
 Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that
with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU]
a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the
 the wider will be the confidence interval.
 To see that indeed all these steps are related to each other the variable of interest is the function/combination of
coefficients f(b0, b1, …, bk) and the general form of any confidence interval is:
bL = estimated value of f(b0, b1, …, bk) – t/2,n – k – 1·std.error[f(b0, b1, …, bk)]
bU = estimated value of f(b0, b1, …, bk) + t/2,n – k – 1·std.error[f(b0, b1, …, bk)]
 Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform
these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom
or kpredint commands.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 6
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: confidence intervals
► The central idea behind any confidence interval is the following:
 The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate
(for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the
dependent variable for given levels of the independent variables – this is context driven).
 Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that
with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU]
a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the
 the wider will be the confidence interval.
 To see that indeed all these steps are related to each other the variable of interest is the function/combination of
coefficients f(b0, b1, …, bk) and the general form of any confidence interval is:
bL = estimated value of f(b0, b1, …, bk) – std.error[f(b0, b1, …, bk)]tdf,/2
bU = estimated value of f(b0, b1, …, bk) + std.error[f(b0, b1, …, bk)]tdf,/2
 Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform
these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom
or kpredint commands.
► What can you calculate yourself? You can always determine
 the estimate of the variable of interest
 tdf,/2 = invttail(n – k – 1, /2)
 The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a
coefficient. Otherwise you have to use the klincom or kpredint commands.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 7
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: confidence intervals
► The central idea behind any confidence interval is the following:
 The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate
(for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the
dependent variable for given levels of the independent variables – this is context driven).
 Based on the sample determine two numbers, call them bL and bU as lower and upper bound, such that you can say that
with a certain chosen probability 1 – , the true value for your variable of interest is above bL and below bU. You call [bL,bU]
a 1 –  confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the
 the wider will be the confidence interval.
 To see that indeed all these steps are related to each other the variable of interest is the function/combination of
coefficients f(b0, b1, …, bk) and the general form of any confidence interval is:
bL = estimated value of f(b0, b1, …, bk) – std.error[f(b0, b1, …, bk)]tdf,/2
bU = estimated value of f(b0, b1, …, bk) + std.error[f(b0, b1, …, bk)]tdf,/2
 Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform
these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom
or kpredint commands.
► What can you calculate yourself? You can always determine
 the estimate of the variable of interest
 tdf,/2 = invttail(n – k – 1, /2)
 The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a
coefficient. Otherwise you have to use the klincom or kpredint commands.
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 8
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression: example
► A simple example: let’s use the pizzasales.dta file and run the regression
Sales = b0 + b1·Income
-----------------------------------------------------------------------------Sales |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Income |
2.697244
.2777973
9.71
0.000
2.138695
3.255793
_cons |
48.72079
87.57192
0.56
0.581
-127.3544
224.7959
------------------------------------------------------------------------------
► The table already provide you with the 95% confidence intervals for b0 and b1 Let’s confirm the calculations for b1 (the sample
consists of n = 50 observations):
- the estimated value b1:
- the t-value:
- the standard error:
b1 = 2.697244
t 48,0.025 = invttail(48,0.025) = 2.0106348
std.err.(b1) = 0.2777973
► The 95% confidence interval bounds:
bL = b1 – tdf,/2·std.error[b1] = 2.697244 – 0.2777973 · 2.010634 = 2.138695
bU = b1 + tdf,/2 ·std.error[b1] = 2.697244 + 0.2777973 · 2.010634 = 3.255793
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 9
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression: example
► Let’s continue with the pizzasales.dta file
Sales = b0 + b1·Income
-----------------------------------------------------------------------------Sales |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Income |
2.697244
.2777973
9.71
0.000
2.138695
3.255793
_cons |
48.72079
87.57192
0.56
0.581
-127.3544
224.7959
------------------------------------------------------------------------------
Remark: One issue that raises sometimes questions is why not using the t = 9.71 from the table for the confidence interval? That t is the
t-test calculated for the null hypothesis that b1 = 0. Let’s confirm that:
■ null hypothesis H0:
b1 = 0
■ alternative hypothesis Ha:
b1  0
 calculate t-test = [b1 – 0]/std.err[b1] = [2.697244 – 0]/0.2777973 = 9.71
 calculate p-value = Pr[t < -|t-test| or t > +|t-test|] = 2·ttail(48,9.71) = 0
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 10
session five
Managerial Economics &
Decision Sciences Department
key concepts
linear regression: example
Developed for
business analytics II
the sum is exactly the p-value
0.2905
► The reported t and p-value are
related to the testing part
0.2905
-----------------------------------------------------------------------------Sales |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Income |
2.697244
.2777973
9.71
0.000
2.138695
3.255793
_cons |
48.72079
87.57192
0.56
0.581
-127.3544
224.7959
------------------------------------------------------------------------------
► The reported conf. interval is the
confidence interval for  = 0.05 (or
95% confidence)
the sum is exactly the confidence level
0.05 for the intervals
0.025
0.025
48.72079
-127.3544
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
-224.7959
session five | page 11
Managerial Economics &
Decision Sciences Department
session five
key concepts
Developed for
business analytics II
linear regression: klincom and kpredint
► The confidence interval is the interval that contains the true mean of y for x  x0 with probability 1  . The confidence interval
draws inference about E[ y | x  x0 ] based on Est.E[ y | x  x0 ]. Confidence interval’s form is:
Est.E[ y | x  x0 ]  std.err.ci  tdf,/2  E[ y | x  x0 ]  Est.E[ y | x  x0 ]  std.err.ci  tdf,/2
► The prediction interval is the interval that contains any level of y for x  x0 with probability 1  . The prediction interval draws
inference about y | x  x0 based on Est.E[ y | x  x0 ]. Prediction interval’s form is
Est.E[ y | x  x0 ]  std.err.pi  tdf,/2  y | x  x0  Est.E[ y | x  x0 ]  std.err.pi  tdf,/2
► Understanding the difference between klincom and kpredint:
 klincom is used in the context of providing inference for the level of true mean/average of the dependent variable
 kpredint is used in the context of providing inference for the level of an individual value of the dependent variable
► Both klincom and kpredint can be used to obtain an interval (set the benchmark equal to zero) or to perform tests comparing the
average or level of dependent variable with a given benchmark (in this case you will include the benchmark in the command)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 12
session five
Managerial Economics &
Decision Sciences Department
key concepts
Developed for
business analytics II
linear regression: dummy variables
► Whenever a dummy variable and a continuous variable is involved there are three possible regressions you can set up:
 y = 0 +  1·dummy
For this setup you are only interested to find the mean of y
for dummy =1 vs. dummy = 0 without controlling for
anything else. As an example: test whether the Sales for
pizza are different for neighborhoods with competitors vs.
neighborhoods without competition (test b1 = 0)
 y = 0 + 1·dummy + 2·x
For this setup you force the regression to assume that the
dummy variable has only a level effect on y.
 y = 0 + 1·dummy + 2·x + 3·slopedummy
For this setup you allow the regression to pick up the slope
effect; you assume that the dummy variable and the
continuous variable might interact. Useful for: testing
whether the change in y vs. change in x is different when
dummy = 1 vs. dummy =0 (test b3 =0)
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II
session five | page 13