Download Chap. 11: Multiple Regression

Document related concepts
no text concepts found
Transcript
Statistics for Business and
Economics
Chapter 11
Multiple Regression and Model
Building
Learning Objectives
1.
2.
3.
4.
5.
6.
7.
8.
Explain the Linear Multiple Regression Model
Describe Inference About Individual Parameters
Test Overall Significance
Explain Estimation and Prediction
Describe Various Types of Models
Describe Model Building
Explain Residual Analysis
Describe Regression Pitfalls
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
Models With Two or More
Quantitative Variables
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
Multiple Regression Model
• General form:
y  0  1 x1  2 x2 
 k xk  
• k independent variables
• x1, x2, …, xk may be functions of variables
— e.g. x2 = (x1)2
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
•
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Probability Distribution
of Random Error
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
•
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Assumptions for Probability
Distribution of ε
1.
2.
3.
4.
Mean is 0
Constant variance, σ2
Normally Distributed
Errors are independent
Linear Multiple Regression
Model
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random
error term
•
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
First–Order Multiple
Regression Model
Relationship between 1 dependent and 2 or
more independent variables is a linear function
Population
Y-intercept
Population
slopes
y  0  1 x1  2 x2 
Dependent
(response)
variable
Random
error
 k xk  
Independent
(explanatory)
variables
First-Order Model With
2 Independent Variables
• Relationship between 1 dependent and 2
independent variables is a linear function
• Model
E ( y)  0  1 x1  2 x2
• Assumes no interaction between x1 and x2
— Effect of x1 on E(y) is the same regardless of
x2 values
Population Multiple
Regression Model
Bivariate model:
y
Response
Plane
yi  0  1 x1i  2 x2i   i
(Observed y)
0
i
x2
x1
(x1i , x2i)
E ( y)  0  1 x1i  2 x2i
Sample Multiple
Regression Model
Bivariate model:
y
Response
Plane
yi  ˆ0  ˆ1 x1i  ˆ2 x2i   i
(Observed y)
^0
^
i
x2
x1
(x1i , x2i)
yˆi  ˆ0  ˆ1 x1i  ˆ2 x2i
No Interaction
E(y) = 1 + 2x1 + 3x2
E(y)
E(y) = 1 + 2x1 + 3(3) = 10 + 2x1
12
E(y) = 1 + 2x1 + 3(2) = 7 + 2x1
8
E(y) = 1 + 2x1 + 3(1) = 4 + 2x1
4
E(y) = 1 + 2x1 + 3(0) = 1 + 2x1
0
x1
0
0.5
1
1.5
Effect (slope) of x1 on E(y) does not depend on x2 value
Parameter Estimation
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
•
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
First-Order Model
Worksheet
Case, i
yi
x1i
x2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
Run regression with y, x1, x2
Multiple Linear
Regression Equations
Too
complicated
by hand!
Ouch!
Interpretation of Estimated
Coefficients
^
1. Slope (k)
• Estimated y changes by ^k for each 1 unit
increase in xk holding all other variables
constant
– Example: if ^ = 2, then sales (y) is expected to
1
increase by 2 for each 1 unit increase in advertising
(x1) given the number of sales rep’s (x2)
2. Y-Intercept (^0)
• Average value of y when xk = 0
1st Order Model Example
You work in advertising for
the New York Times. You
want to find the effect of ad
size (sq. in.) and newspaper
circulation (000) on the
number of ad responses (00).
Estimate the unknown
parameters.
You’ve collected the
following data:
(y)
(x1) (x2)
Resp Size Circ
1
1
2
4
8
8
1
3
1
3
5
7
2
6
4
4
10
6
Parameter Estimation
Computer Output
^0
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error Param=0 Prob>|T|
INTERCEP 1
0.0640
0.2599 0.246
0.8214
ADSIZE
1
0.2049
0.0588 3.656
0.0399
CIRC
1
0.2805
0.0686 4.089
0.0264
^1
^2
yˆ  .0640  .2049x1  .2805x2
Interpretation of Coefficients
Solution
^
1. Slope (1)
• Number of responses to ad is expected to
increase by .2049 (20.49) for each 1 sq. in.
increase in ad size holding circulation constant
^
2. Slope (2)
• Number of responses to ad is expected to increase
by .2805 (28.05) for each 1 unit (1,000) increase in
circulation holding ad size constant
Estimation of σ2
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
•
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
Estimation of σ2
For a model with k independent variables
SSE
s 
n  (k  1)
where SSE    yi  yˆi 
2
SSE
s s 
n  (k  1)
2
2
Calculating s2 and s
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq.
in.), x1, and newspaper
circulation (000), x2, on the
number of ad responses (00), y.
Find SSE, s2, and s.
Analysis of Variance
Computer Output
Analysis of Variance
Source
DF
Regression
2
Residual Error 3
Total
5
SS
9.249736
.250264
9.5
MS
4.624868
.083421
SSE
S2
.250264
s 
 .083421
63
2
s  .083421  .2888
F
55.44
P
.0043
Evaluating the Model
Regression Modeling
Steps
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random
Error Term
•
Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction & Estimation
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
•
•
Individual coefficients
Overall model
3. Do residual analysis
Variation Measures
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
•
•
Individual coefficients
Overall model
3. Do residual analysis
Multiple Coefficient of
Determination
•
Proportion of variation in y ‘explained’ by all
x variables taken together
Explained Variation
SSE
• R =
= 1Total Variation
SSyy
2
•
Never decreases when new x variable is
added to model
— Only y values determine SSyy
— Disadvantage when comparing models
Adjusted Multiple Coefficient
of Determination
•
•
Takes into account n and number of
parameters
Similar interpretation to R2
R
2
a
 n-1 
2
= 1- 
1 R 


 n-(k+1) 
Estimation of R2 and Ra2
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Find R2 and
Ra2.
Excel Computer Output
Solution
R2
Ra2
Testing Parameters
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
•
•
Individual coefficients
Overall model
3. Do residual analysis
Inference for an Individual β
Parameter
• Confidence Interval
ˆi  t 2 sˆ
i
• Hypothesis Test
Ho: βi = 0
Ha: βi ≠ 0 (or < or > )
• Test Statistic
ˆi
t
sˆ
i
df = n – (k + 1)
Confidence Interval
Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Find a 95%
confidence interval for β1.
Excel Computer Output
Solution
ˆ1
sˆ
1
Confidence Interval
Solution
.204921  3.182(.058822)
.0177  1  .3921
Hypothesis Test Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Test the
hypothesis that the mean ad
response increases as
circulation increases (ad size
constant). Use α = .05.
Hypothesis Test
Solution
•
•
•
•
•
H0: 2 = 0
Ha: 2 > 0
  .05
df  6 - 3 = 3
Critical Value(s):
Test Statistic:
Decision:
Reject H0
.05
0 2.353
Conclusion:
t
Excel Computer Output
Solution
ˆ2
sˆ
2
Hypothesis Test
Solution
•
•
•
•
•
H0: 2 = 0
Ha: 2 > 0
  .05
df  6 - 3 = 3
Critical Value(s):
Reject H0
.05
0 2.353
t
Test Statistic:
ˆ2 .280492
t

 4.089
S ˆ .068602
2
Decision:
Reject at  = .05
Conclusion:
There is evidence the mean
ad response increases as
circulation increases
Excel Computer Output
Solution
ˆ2
t
sˆ
2
P–Value
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
•
•
Individual coefficients
Overall model
3. Do residual analysis
Testing Overall Significance
•
Shows if there is a linear relationship
between all x variables together and y
•
Hypotheses
— H0: 1 = 2 = ... = k = 0
 No linear relationship
— Ha: At least one coefficient is not 0
 At least one x variable affects y
Testing Overall Significance
•
Test Statistic
MS ( Model )
F
MS ( Error )
•
Degrees of Freedom
1 = k
2 = n – (k + 1)
— k = Number of independent variables
— n = Sample size
Testing Overall
Significance Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Conduct the
global F–test of model
usefulness. Use α = .05.
Testing Overall Significance
Solution
H0: β1 = β2 = 0
Ha: At least 1 not zero
 = .05
1 = 2 2 = 3
Critical Value(s):
•
•
•
•
•
Test Statistic:
Decision:
 = .05
Conclusion:
0
9.55
F
Testing Overall Significance
Computer Output
k
Source DF
Model
2
Error
3
C Total 5
Analysis of Variance
Sum of
Squares
9.2497
0.2503
9.5000
n – (k + 1)
Mean
Square
4.6249
0.0834
F Value
55.440
Prob>F
0.0043
MS(Model)
MS(Error)
Testing Overall Significance
Solution
H0: β1 = β2 = 0
Ha: At least 1 not zero
 = .05
1 = 2 2 = 3
Critical Value(s):
•
•
•
•
•
 = .05
0
9.55
F
Test Statistic:
4.6249
F
 55.44
.0834
Decision:
Reject at  = .05
Conclusion:
There is evidence at least 1
of the coefficients is not zero
Testing Overall Significance
Computer Output Solution
Analysis of Variance
Source DF
Model
2
Error
3
C Total 5
Sum of
Squares
9.2497
0.2503
9.5000
Mean
Square
4.6249
0.0834
F Value
55.440
Prob>F
0.0043
MS(Model)
MS(Error)
P-Value
Interaction Models
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Interaction Model With
2 Independent Variables
•
Hypothesizes interaction between pairs of x
variables
— Response to one x variable varies at different
levels of another x variable
• Contains two-way cross product terms
E ( y)  0  1 x1  2 x2  3 x1 x2
• Can be combined with other models
— Example: dummy-variable model
Effect of Interaction
Given:
E ( y)  0  1 x1  2 x2  3 x1 x2
• Without interaction term, effect of x1 on y is
measured by 1
• With interaction term, effect of x1 on y is
measured by 1 + 3x2
— Effect increases as x2 increases
Interaction Model Relationships
E(y) = 1 + 2x1 + 3x2 + 4x1x2
E(y)
E(y) = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1
12
8
E(y) = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1
4
x1
0
0
0.5
1
1.5
Effect (slope) of x1 on E(y) depends on x2 value
Interaction Model Worksheet
Case, i
yi
x1i
x2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
x1i x2i
3
40
6
30
:
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 , x1x2
Interaction Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Conduct a test
for interaction. Use α = .05.
Interaction Model Worksheet
yi
x1i
x2i
1
4
1
3
2
4
1
8
3
5
6
10
2
8
1
7
4
6
x1i x2i
2
64
3
35
24
60
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 , x1x2
Excel Computer Output
Solution
Global F–test indicates at least one parameter is not zero
F
P-Value
Interaction Test
Solution
•
•
•
•
•
H0: 3 = 0
Ha: 3 ≠ 0
  .05
df  6 - 2 = 4
Critical Value(s):
Reject H0
.025
-2.776
Test Statistic:
Decision:
Reject H0
.025
0 2.776
Conclusion:
t
Excel Computer Output
Solution
ˆ3
t
sˆ
3
Interaction Test
Solution
•
•
•
•
•
H0: 3 = 0
Ha: 3 ≠ 0
  .05
df  6 - 2 = 4
Critical Value(s):
Reject H0
.025
-2.776
Reject H0
.025
0 2.776
t
Test Statistic:
t = 1.8528
Decision:
Do no reject at  = .05
Conclusion:
There is no evidence of
interaction
Second–Order Models
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Second-Order Model With
1 Independent Variable
•
•
•
Relationship between 1 dependent and 1
independent variable is a quadratic function
Useful 1st model if non-linear relationship
suspected
Curvilinear
effect
Model
E ( y )  0  1 x   2 x
Linear effect
2
Second-Order Model
Relationships
y
2 > 0
y
2 > 0
x1
y
2 < 0
x1
y
x1
2 < 0
x1
Second-Order Model
Worksheet
Case, i
yi
xi
2
xi
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
64
9
25
:
Create x2 column.
Run regression with y, x, x2.
2nd Order Model Example
The data shows the number of
weeks employed and the number
of errors made per day for a
sample of assembly line
workers. Find a 2nd order model,
conduct the global F–test, and
test if β2 ≠ 0. Use α = .05 for all
tests.
Errors (y) Weeks (x)
20
1
18
1
16
2
10
4
8
4
4
5
3
6
1
8
2
10
1
11
0
12
1
12
Second-Order Model
Worksheet
yi
xi
2
xi
20
1
1
18
1
1
16
2
4
10
4
16
:
:
:
Create x2 column.
Run regression with y, x, x2.
Excel Computer Output
Solution
yˆ  23.728  4.784 x  .242 x 2
Overall Model Test Solution
Global F–test indicates at least one parameter is not zero
F
P-Value
β2 Parameter Test Solution
β2 test indicates curvilinear relationship exists
t
P-Value
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Second-Order Model With
2 Independent Variables
•
•
•
Relationship between 1 dependent and 2
independent variables is a quadratic
function
Useful 1st model if non-linear relationship
suspected
Model
E ( y)   0  1 x1i   2 x2i   3 x1i x2i
 x   x
2
4 1i
2
5 2i
Second-Order Model
Relationships
y
4 + 5 > 0
x2
x1
y
 32 > 4  4  5
x2
x1
4 + 5 < 0
y
x2
x1
E ( y )   0  1 x1i   2 x2i
  3 x1i x2i   x
2
4 1i
 x
2
5 2i
Second-Order Model
Worksheet
Case, i
yi
x1i
x2i
x1ix2i
2
x1i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
3
40
6
30
:
1
64
9
25
:
2
x2i
9
25
4
36
:
Multiply x1 by x2 to get x1x2; then create x12, x22.
Run regression with y, x1, x2 , x1x2, x12, x22.
Models With One Qualitative
Independent Variable
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Dummy-Variable Model
•
Involves categorical x variable with 2 levels
— e.g., male-female; college-no college
•
•
•
Variable levels coded 0 and 1
Number of dummy variables is 1 less than
number of levels of variable
May be combined with quantitative variable
(1st order or 2nd order model)
Dummy-Variable Model
Worksheet
Case, i
yi
x1i
x2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
0
1
1
:
x2 levels: 0 = Group 1; 1 = Group 2.
Run regression with y, x1, x2
Interpreting DummyVariable Model Equation
Given:
yˆi  ˆ0  ˆ1 x1i  ˆ2 x2i
y = Starting salary of college graduates
x1 = GPA
0 if Male
x2 =
1 if Female
Same slopes
Male ( x2 = 0 ):
yˆi  ˆ0  ˆ1 x1i  ˆ2 (0)  ˆ0  ˆ1 x1i
Female ( x2 = 1 ):


yˆi  ˆ0  ˆ1 x1i  ˆ2 (1)  ˆ0  ˆ2  ˆ1 x1i
Dummy-Variable Model
Example
Computer Output: yˆi  3  5 x1i  7 x2i
0 if Male
x2 =
1 if Female
Same slopes
Male ( x2 = 0 ):
yˆi  3  51 x1i  7(0)  3  5 x1i
Female ( x2 = 1 ):
yˆi  3  5x1i  7(1)   3  7   5x1i
Dummy-Variable Model
Relationships
y
Same Slopes ^
1
Female
^ + ^
0 2
Male
^
0
x1
0
0
Nested Models
Comparing Nested Models
•
Contains a subset of terms in the complete (full)
model
•
Tests the contribution of a set of x variables to the
relationship with y
•
Null hypothesis H0: g+1 = ... = k = 0
— Variables in set do not improve significantly
the model when all other variables are
included
•
Used in selecting x variables or models
•
Part of most computer programs
Selecting Variables
in Model Building
Selecting Variables in Model
Building
A butterfly flaps its wings in Japan, which causes it
to rain in Nebraska. -- Anonymous
Use Theory Only!
Use Computer Search!
Model Building with
Computer Searches
• Rule: Use as few x variables as possible
• Stepwise Regression
— Computer selects x variable most highly correlated
with y
— Continues to add or remove variables depending on SSE
• Best subset approach
— Computer examines all possible sets
Residual Analysis
Evaluating Multiple
Regression Model Steps
1. Examine variation measures
2. Test parameter significance
•
•
Individual coefficients
Overall model
3. Do residual analysis
Residual Analysis
• Graphical analysis of residuals
— Plot estimated errors versus xi values
 Difference between actual yi and predicted yi
 Estimated errors are called residuals
— Plot histogram or stem-&-leaf of residuals
• Purposes
— Examine functional form (linear v. non-linear
model)
— Evaluate violations of assumptions
Residual Plot
for Functional Form
Add x2 Term
Correct Specification
^
e
^e
x
x
Residual Plot
for Equal Variance
Unequal Variance
Correct Specification
^
e
^
e
x
Fan-shaped.
Standardized residuals used typically.
x
Residual Plot
for Independence
Not Independent
Correct Specification
^
e
^
e
x
Plots reflect sequence data were collected.
x
Residual Analysis
Computer Output
Dep Var Predict
Student
Obs SALES
Value Residual Residual -2-1-0 1 2
1 1.0000 0.6000
0.4000
1.044 |
|**
2 1.0000 1.3000 -0.3000
-0.592 |
*|
3 2.0000 2.0000
0
0.000 |
|
4 2.0000 2.7000 -0.7000
-1.382 |
**|
5 4.0000 3.4000
0.6000
1.567 |
|***
Plot of standardized
(student) residuals
|
|
|
|
|
Regression Pitfalls
Regression Pitfalls
•
Parameter Estimability
— Number of different x–values must be at least
one more than order of model
•
Multicollinearity
— Two or more x–variables in the model are
correlated
•
Extrapolation
— Predicting y–values outside sampled range
•
Correlated Errors
Multicollinearity
•
•
•
•
•
High correlation between x variables
Coefficients measure combined effect
Leads to unstable coefficients depending on
x variables in model
Always exists – matter of degree
Example: using both age and height as
explanatory variables in same model
Detecting Multicollinearity
•
•
•
Significant correlations between pairs of x
variables are more than with y variable
Non–significant t–tests for most of the
individual parameters, but overall model test
is significant
Estimated parameters have wrong sign
Solutions to Multicollinearity
•
•
•
Eliminate one or more of the correlated x
variables
Avoid inference on individual parameters
Do not extrapolate
Extrapolation
y
Interpolation
Extrapolation
Extrapolation
Sampled Range
x
Conclusion
1.
2.
3.
4.
5.
6.
7.
8.
Explained the Linear Multiple Regression Model
Described Inference About Individual Parameters
Tested Overall Significance
Explained Estimation and Prediction
Described Various Types of Models
Described Model Building
Explained Residual Analysis
Described Regression Pitfalls
Related documents