Download Document

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Applied Regression
By Nitiphong Songsrirote, Ph.D.
(www.kaset51.com)
Simple Linear Regression
Learning Objectives

1.Describe the Linear Regression Model

2.State the Regression Modeling Steps

3.Explain Ordinary Least Squares
Understand and check model assumptions

4.Compute Regression Coefficients

5.Predict Response Variable

6.Interpret Computer Output
Models
Models




1. Representation of Some Phenomenon
2. Mathematical Model Is a Mathematical
Expression of Some Phenomenon
3. Often Describe Relationships between
Variables
4. Types


Deterministic Models
Probabilistic Models
Deterministic Models



1. Hypothesize Exact Relationships
2. Suitable When Prediction Error is
Negligible
3. Example: Force Is Exactly
Mass Times Acceleration

F = m·a
© 1984-1994 T/Maker Co.
Probabilistic Models

1. Hypothesize 2 Components



Deterministic
Random Error
2. Example: Sales Volume Is 10 Times
Advertising Spending + Random Error


Y = 10X + 
Random Error May Be Due to Factors Other
Than Advertising
Types of
Probabilistic Models
Probabilistic
Models
Regression
Models
Correlation
Models
Other
Models
Regression Models
Types of
Probabilistic Models
Probabilistic
Models
Regression
Models
Correlation
Models
Other
Models
Regression Models


1. Answer ‘What Is the Relationship
Between the Variables?’
2. Equation Used

1 Numerical Dependent (Response) Variable



What Is to Be Predicted
1 or More Numerical or Categorical
Independent (Explanatory) Variables
3. Used Mainly for Prediction & Estimation
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Model Specification
Regression Modeling Steps



1. Hypothesize Deterministic
Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Specifying the Model


1. Define Variables
2. Hypothesize Nature of Relationship



Expected Effects (i.e., Coefficients’ Signs)
Functional Form (Linear or Non-Linear)
Interactions
Model Specification
Is Based on Theory




1.
2.
3.
4.
Theory of Field (e.g., Management)
Mathematical Theory
Previous Research
‘Common Sense’
Thinking Challenge:
Which Is More Logical?
Sales
Sales
Advertising
Sales
Advertising
Sales
Advertising
Advertising
Types of Regression Models
Types of Regression Models
Regression
Models
Types of Regression Models
1 Explanatory
Variable
Simple
Regression
Models
Types of Regression Models
1 Explanatory
Variable
Simple
Regression
Models
2+ Explanatory
Variables
Multiple
Types of Regression Models
1 Explanatory
Variable
Simple
Linear
Regression
Models
2+ Explanatory
Variables
Multiple
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
Multiple
Simple
Linear
2+ Explanatory
Variables
NonLinear
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
Linear Regression Model
Types of Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
Linear Equations
Y
Y = mX + b
m = Slope
Change
in Y
Change in X
b = Y-intercept
X
High School Teacher
Linear Regression Model

1. Relationship Between Variables Is a
Linear Function
Population
Y-Intercept
Population
Slope
Random
Error
Yi   0  1X i   i
Dependent
(Response)
Variable
(e.g., income)
Independent
(Explanatory)
Variable
(e.g., education)
Population & Sample
Regression Models
Population & Sample
Regression Models
Population
$
$
$
$
$
Population & Sample
Regression Models
Population
Unknown
Relationship
$
Yi   0  1X i   i
$
$
$
$
Population & Sample
Regression Models
Population
Random Sample
Unknown
Relationship
$
Yi   0  1X i   i
$
$
$
$
$
$
Population & Sample
Regression Models
Population
Unknown
Relationship
$
Yi   0  1X i   i
$
$
$
$
Random Sample
Yi  ˆ0  ˆ1 X i  ˆi
$
$
Population Linear Regression Model
Y
Yi   0  1X i   i
Observed
value
i = Random error
E Y   0  1 X i
X
Observed value
Sample Linear Regression Model
Y


Yi   0   1X i   i
^i = Random
error



Yi   0   1X i
Unsampled
observation
X
Observed value
Estimating Parameters:
Least Squares Method
Regression Modeling Steps



1. Hypothesize Deterministic Component
2. Estimate Unknown Model
Parameters
3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Scatter Diagram


1. Plot of All (Xi, Yi) Pairs
2. Suggests How Well Model Will Fit
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
60
40
20
0
Y
0
20
40
X
60
Least Squares

1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum

But Positive Differences Off-Set Negative
Least Squares

1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum

But Positive Differences Off-Set Negative
n
 
n
2
ˆ
 Yi  Yi   ˆi
i 1
2
i 1
Least Squares

1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum

But Positive Differences Off-Set Negative
ˆ

   ˆ
Y

Y

n
i 1

2
i
i
n
2
i
i 1
2. LS Minimizes the Sum of the
Squared Differences (SSE)
Least Squares Graphically
n
2
2
2
2
2





LS minimizes   i   1   2   3   4
i 1
Y2   0   1X 2   2
Y
^4
^2
^1
^3



Yi   0   1X i
X
Coefficient Equations
Prediction Equation
yˆi  ˆ0  ˆ1xi
Sample Slope
ˆ1 
SS xy
SS xx

  xi  x  yi  y 
2


x

x
 i
Sample Y-intercept
ˆ0  y  ˆ1x
Computation Table
Xi
Yi
2
Xi
X1
Y1
X12
Y12
X1Y1
X2
Y2
X2
2
2
X2Y2
:
:
:
Xn
Xi
2
Yi
XiYi
Y2
:
2
Yn
Xn
Yi
2
Xi
:
2
XnYn
2
Yi
XiYi
Yn
Interpretation of Coefficients
^

1. Slope (1)

^
Estimated Y Changes by 1 for Each 1 Unit
Increase
in X
^

If 1 = 2, then Sales (Y) Is Expected to
Increase by 2 for Each 1 Unit Increase in
Advertising (X)
Interpretation of Coefficients

1. Slope (^1)

Estimated Y Changes by ^1 for Each 1 Unit
Increase in X
^
 If 1 = 2, then Sales (Y) Is Expected to
Increase by 2 for Each 1 Unit Increase in
Advertising (X)

^
2. Y-Intercept (^0)

Average Value of Y When X = 0

If 0 = 4, then Average Sales (Y) Is Expected
to Be 4 When Advertising (X) Is 0
Parameter Estimation Example



You’re a marketing analyst for Hasbro
Toys. You gather the following data:
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
What is the relationship
between sales & advertising?
Scatter Diagram
Sales vs. Advertising
Sales
4
3
2
1
0
0
1
2
3
Advertising
4
5
Parameter Estimation Solution
Table
Xi
Yi
Xi2
Yi2
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
Parameter Estimation Solution
n




X i   Yi 


n
i 1

 i 1 
X
Y


i i
n
i 1
n
ˆ1 


X
  i
n
i 1
2


X


i
n
i 1
n
2


1510 
37 
5
2

15
55 
5
ˆ0  Y  ˆ1 X  2  0.703  0.10
 0.70
Coefficient Interpretation
Solution
Coefficient Interpretation Solution
^

1. Slope (1)

Sales Volume (Y) Is Expected to Increase
by .7 Units for Each $1 Increase in
Advertising (X)
Coefficient Interpretation Solution
^

1. Slope (1)


Sales Volume (Y) Is Expected to Increase
by .7 Units for Each $1 Increase in
Advertising (X)
^
2. Y-Intercept (0)

Average Value of Sales Volume (Y) Is
-.10 Units When Advertising (X) Is 0


Difficult to Explain to Marketing Manager
Expect Some Sales Without Advertising
Parameter Estimation
Computer Output





^k
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error
Param =0
INTERCEP 1
-0.1000
0.6350
-0.157
ADVERT
1
0.7000
0.1914
3.656
^0
^1
Prob>|T|
0.8849
0.0354
Derivation of Parameter Equations

Goal: Minimize squared error
2
2
  ˆi    yi  ˆ0  ˆ1xi 
0

ˆ0
ˆ0
   2 yi  ˆ0  ˆ1xi 
 2ny  nˆ0  nˆ1x 
ˆ0  y  ˆ1x
Derivation of Parameter Equations


2
2
ˆ
ˆ
  ˆi   yi   0  1xi
0

ˆ1
ˆ1
 2 xi  yi  ˆ0  ˆ1xi 
 2 xi  yi  y  ˆ1x  ˆ1xi 
ˆ1  xi  xi  x    xi  yi  y 
ˆ1   xi  x  xi  x     xi  x  yi  y 
ˆ1 
SS xy
SS xx
Parameter Estimation Thinking Challenge



You’re an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.)Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
What is the relationship
between fertilizer & crop yield?
Scatter Diagram
Crop Yield vs. Fertilizer*
Yield (lb.)
10
8
6
4
2
0
0
5
10
Fertilizer (lb.)
15
Parameter Estimation Solution
Table*
Xi
Yi
2
Xi
2
Yi
4
3.0
16
9.00
12
6
5.5
36
30.25
33
10
6.5
100
42.25
65
12
9.0
144
81.00
108
32
24.0
296
162.50
218
XiYi
Parameter Estimation Solution*
n




X i   Yi 


n
i 1

 i 1 
X
Y


i i
n
i 1
n
ˆ1 


X
  i
n
i 1
2


X


i
n
i 1
n
2


3224 
218 
ˆ0  Y  ˆ1 X  6  0.658  0.80
4
2

32 
296 
4
 0.65
Coefficient Interpretation Solution*
Coefficient Interpretation Solution*

^
1. Slope (1)

Crop Yield (Y) Is Expected to Increase by .65
lb. for Each 1 lb. Increase in Fertilizer (X)
Coefficient Interpretation Solution*

^
1. Slope (1)


Crop Yield (Y) Is Expected to Increase by .65
lb. for Each 1 lb. Increase in Fertilizer (X)
^
2. Y-Intercept (0)

Average Crop Yield (Y) Is Expected to Be 0.8
lb. When No Fertilizer (X) Is Used
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Linear Regression Assumptions




1.Mean of Probability Distribution of
Error Is 0
2.Probability Distribution of Error Has
Constant Variance
3.Probability Distribution of Error is
Normal
4.Errors Are Independent
Error Probability Distribution
^
f()
Y
X2
X
X1
Random Error Variation
Random Error Variation

1. Variation of Actual Y from Predicted Y
Random Error Variation


1. Variation of Actual Y from Predicted Y
2. Measured by Standard Error of
Regression Model

Sample Standard Deviation of , s^
Random Error Variation


1. Variation of Actual Y from Predicted Y
2. Measured by Standard Error of
Regression Model


Sample Standard Deviation of , s^
3. RV Affects Several Factors


Parameter Significance
Prediction Accuracy
Evaluating the Model
Testing for Significance
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Test of Slope Coefficient

1. Shows If There Is a Linear
Relationship Between X & Y

2. Involves Population Slope 1

3. Hypotheses



H0: 1 = 0 (No Linear Relationship)
Ha: 1  0 (Linear Relationship)
4. Theoretical Basis Is Sampling
Distribution of Slope
Sampling Distribution
of Sample Slopes
Sampling Distribution
of Sample Slopes
Y
Sample 1 Line
Sample 2 Line
Population Line
X
Sampling Distribution
of Sample Slopes
Y
Sample 1 Line
All Possible
Sample Slopes
Sample
1:
2.5
Sample
2:
1.6
Sample
3:
1.8
Sample
4:
2.1
:
:
Very large number of
sample slopes


Sample 2 Line
Population Line
X



Sampling Distribution
of Sample Slopes
Y
Sample 1 Line
Sample 2 Line
Population Line
X
Sampling Distribution
S^1
1
All Possible
Sample Slopes
Sample
1:
2.5
Sample
2:
1.6
Sample
3:
1.8
Sample
4:
2.1
:
:
Very large number of
sample slopes





^
1
Slope Coefficient
Test Statistic
tn2 
ˆ1  1
S ˆ
1
where
S ˆ 
1
S


 Xi 
n
i 1
2


Xi 

n
i 1
n
2
Test of Slope Coefficient Example



You’re a marketing analyst for Hasbro
Toys. You find b0 = -.1, b1 = .7 & s =
.60553.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Is the relationship significant
at the .05 level?
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
Test of Slope Parameter Solution
Test Statistic:
H0: 1 = 0
 Ha: 1  0
 1   1 0.70  0
t

 3.656
   .05
S
0.1915
1
 df  5 - 2 = 3
 Critical Value(s):
Decision:
Reject
Reject
Reject at  = .05

.025
.025
-3.1824 0 3.1824
t
Conclusion:
There is evidence of a
relationship
Test Statistic
Solution
ˆ1  1 0.70  0
tn2 

 3.656
S ˆ
0.1915
1
where
S ˆ 
1
S
 n

 Xi 
n
2
 i 1 
X


i
n
i 1
2

0.60553

15
55 
3
5
 0.1915
Test of Slope Parameter
Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error
Param=0 Prob>|T|
INTERCEP 1 -0.1000
0.6350
-0.157
0.8849
ADVERT
1
0.7000
0.1914
3.656
0.0354






^
k
S^
k
t = ^k / S^
k
P-Value
Measures of Variation
in Regression

1. Total Sum of Squares (SSyy)


2. Explained Variation (SSR)


Measures Variation of Observed Yi Around
the MeanY
Variation Due to Relationship Between
X&Y
3. Unexplained Variation

Variation Due to Other Factors
(SSE)
Variation Measures
Y
Yi
Total sum
of squares
(Yi -Y)2
Unexplained sum
^ )2
of squares (Yi - Y
i
Yi   0   1X i
Explained sum of
^
squares (Yi -Y)2
Y
Xi
X
Coefficient of Determination

1. Proportion of Variation ‘Explained’
by Relationship Between X & Y
0  r2  1
Explained Variation
r 
Total Variation
2
ˆ




Y

Y

Y

Y


n

i 1
n
2
i
2
i
i 1
 Y  Y 
n
i 1
2
i
Coefficient of Determination Examples
Y
Y
r2 = 1
r2 = -1
X
Y
X
Y
r2 = .8
X
r2 = 0
X
Coefficient of Determination Example



You’re a marketing analyst for Hasbro
^  = 0.7.
Toys. You find^0 = -0.1 &
1
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Interpret a coefficient of
determination of 0.8167.
r 2 Computer Output
r2
Root MSE
Dep Mean
C.V.
S
0.60553
2.00000
30.27650
R-square
Adj R-sq
0.8167
0.7556
r2 adjusted for number
of explanatory variables
& sample size
Using the Model for Prediction
& Estimation
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term



Estimate Standard Deviation of Error
4. Evaluate Model
5. Use Model for Prediction &
Estimation
Prediction With Regression
Models

1. Types of Predictions



Point Estimates
Interval Estimates
2. What Is Predicted

Population Mean Response E(Y) for Given X


Point on Population Regression Line
Individual Response (Yi) for Given X
What Is Predicted
Y
YIndividual
Mean Y, E(Y)
^ 0 +
^Y i=
^ 1X
E(Y) =  0 +  1X
Prediction,^Y
XP
X
Confidence Interval Estimate of Mean Y
Yˆ  t n  2, / 2  SYˆ  E (Y )  Yˆ  t n  2, / 2  SYˆ
where
1
SYˆ  S

n
X  X 
 X  X 
2
p
n
i 1
2
i
Factors Affecting
Interval Width

1. Level of Confidence (1 - )


2. Data Dispersion (s)


Width Increases as Variation Increases
3. Sample Size


Width Increases as Confidence Increases
Width Decreases as Sample Size Increases
4. Distance of Xp from MeanX

Width Increases as Distance Increases
Why Distance from Mean?
Y
m
a
S
_
Y
1
e
l
p
e
n
i
L
Sample 2
X1
X
Greater
dispersion
than X1
Line
X2
X
Confidence Interval Estimate
Example



You’re a marketing analyst for Hasbro
Toys. You find b0 = -.1, b1 = .7 & s =
.60553.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Estimate the mean sales when
advertising is $4 at the .05 level.
Solution Table
Xi
Yi
Xi2
Yi2
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
Confidence Interval Estimate
Solution
Yˆ  t n  2, / 2  SYˆ  E (Y )  Yˆ  t n  2, / 2  SYˆ
Yˆ  0.1  0.7 4  2.7
X to be predicted
1 4  3
SYˆ  .60553 
 0.3316
5
10
2
2.7  3.1824 0.3316   E (Y )  2.7  3.18240.3316
1.6445  E (Y )  3.7553
Prediction Interval of
Individual Response
Yˆ  t n  2, / 2  S Y Yˆ   YP  Yˆ  t n  2, / 2  S Y Yˆ 
where
1
S Y Yˆ   S 1  
n
X  X 
 X  X 
2
P
n
i 1
2
i
Why the Extra ‘S’?
Y
Y we're trying to
predict

Expected
(Mean) Y
+
^

^= 0
^ 1X i
Yi
E(Y) =  0 +  1X
Prediction, ^
Y
XP
X
Interval Estimate
Computer Output
Dep Var
Obs SALES
1 1.000
2 1.000
3 2.000
4 2.000
5 4.000
Pred Std Err Low95% Upp95% Low95% Upp95%
Value Predict
Mean
Mean Predict Predict
0.600
0.469 -0.892 2.092 -1.837
3.037
1.300
0.332 0.244 2.355 -0.897
3.497
2.000
0.271 1.138 2.861 -0.111
4.111
2.700
0.332 1.644 3.755
0.502
4.897
3.400
0.469 1.907 4.892
0.962
5.837
Predicted Y
when X = 4
SY^
Confidence
Interval
Prediction
Interval
Hyperbolic Interval Bands
Y
^
^= 0
Xi
^

1
+
Yi
_
X
X
XP
Correlation Models
Types of
Probabilistic Models
Probabilistic
Models
Regression
Models
Correlation
Models
Other
Models
Correlation Models


1. Answer ‘How Strong Is the Linear
Relationship Between 2 Variables?’
2. Coefficient of Correlation Used




Population Correlation Coefficient Denoted
 (Rho)
Values Range from -1 to +1
Measures Degree of Association
3. Used Mainly for Understanding
Sample Coefficient
of Correlation

1. Pearson Product Moment Coefficient
of Correlation, r:
r  Coefficien t of Determinat ion
 X
n

i 1
 X
n
i 1
i
 X Yi  Y 
X 
2
i
 Y  Y 
n
i 1
2
i
Coefficient of Correlation Values
-1.0
-.5
0
+.5
+1.0
Coefficient of Correlation Values
No
Correlation
-1.0
-.5
0
+.5
+1.0
Coefficient of Correlation Values
No
Correlation
-1.0
-.5
Increasing degree of
negative correlation
0
+.5
+1.0
Coefficient of Correlation Values
Perfect
Negative
Correlation
-1.0
No
Correlation
-.5
0
+.5
+1.0
Coefficient of Correlation Values
Perfect
Negative
Correlation
-1.0
No
Correlation
-.5
0
+.5
+1.0
Increasing degree of
positive correlation
Coefficient of Correlation Values
Perfect
Negative
Correlation
-1.0
Perfect
Positive
Correlation
No
Correlation
-.5
0
+.5
+1.0
Coefficient of Correlation Examples
Y
r=1
Y
r = -1
X
Y
r = .89
X
Y
X
r=0
X
Test of Coefficient of Correlation



1. Shows If There Is a Linear
Relationship Between 2 Numerical
Variables
2. Same Conclusion as Testing
Population Slope 1
3. Hypotheses


H0:  = 0 (No Correlation)
Ha:   0 (Correlation)
Conclusion

1. Described the Linear Regression Model

2. Stated the Regression Modeling Steps

3. Explained Ordinary Least Squares

4. Computed Regression Coefficients

5. Predicted Response Variable

6. Interpreted Computer Output
Multiple Regression and
Model Building
Learning Objectives

1. Explain the Linear Multiple Regression Model

2. Test Overall Significance

3. Describe Various Types of Models

4. Evaluate Portions of a Regression Model

5. Interpret Linear Multiple Regression Computer
Output

7. Explain Residual Analysis

8. Describe Regression Pitfalls
Types of Regression Models
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Linear Multiple Regression Model
Hypothesizing the
Deterministic Component
Regression Modeling Steps



1. Hypothesize Deterministic
Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Linear Multiple Regression Model

1. Relationship between 1 dependent
& 2 or more independent variables is a
linear function
Population
Y-intercept
Population
slopes
Random
error
Yi   0   1X 1i   2 X 2i   k X ki   i
Dependent
(response)
variable
Independent
(explanatory)
variables
Population Multiple Regression Model
Bivariate model
Y
Response
Plane
X1
Yi =  0 +  1X1i +  2X2i +  i
(Observed Y)
0
i
X2
(X1i,X2i)
E(Y) =  0 +  1X1i +  2X2i
Sample Multiple Regression Model
Bivariate model
Y
Response
Plane
X1
Yi = ^0 + ^1X1i + ^2X2i + ^i
(Observed Y)
^

0
^
i
X2
(X1i,X2i)
^ ^
Yi =  0 + ^1X1i + ^2X2i
Parameter Estimation
Regression Modeling Steps



1. Hypothesize Deterministic Component
2. Estimate Unknown Model
Parameters
3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Multiple Linear Regression
Equations
Too
complicated
by hand!
Ouch!
Interpretation of Estimated
Coefficients
Interpretation of Estimated Coefficients

^
1. Slope (k)

^
Estimated Y Changes by k for Each 1 Unit
Increase in Xk Holding All Other
Variables Constant

Example: If 1^ = 2, then Sales (Y) Is Expected
to Increase by 2 for Each 1 Unit Increase in
Advertising (X1) Given the Number of Sales
Rep’s (X2)
Interpretation of Estimated Coefficients

^
1. Slope (k)

^
Estimated Y Changes by k for Each 1 Unit
Increase in Xk Holding All Other
Variables Constant


Example: If 1^ = 2, then Sales (Y) Is Expected
to Increase by 2 for Each 1 Unit Increase in
Advertising (X1) Given the Number of Sales
Rep’s (X2)
^
2. Y-Intercept (0)

Average Value of Y When Xk = 0
Parameter Estimation
Example

You work in advertising
for the New York Times.
You want to find the
effect of ad size (sq.
in.) & newspaper
circulation (000) on
the number of ad
responses (00).
You’ve collected the
following data:
Resp Size Circ
1
1
2
4
8
8
1
3
1
3
5
7
2
6
4
4
10
6
Parameter Estimation
Computer Output







^P
Parameter
Variable DF Estimate
INTERCEP 1
0.0640
ADSIZE
1
0.2049
CIRC
1
0.2805
Parameter Estimates
Standard T for H0:
Error Param=0 Prob>|T|
0.2599 0.246
0.8214
0.0588 3.656
0.0399
0.0686 4.089
0.0264
^0
^1
^2
Interpretation of Coefficients
Solution
Interpretation of Coefficients
Solution

^
1. Slope (1)

# Responses to Ad Is Expected to Increase
by .2049 (20.49) for Each 1 Sq. In. Increase
in Ad Size Holding Circulation Constant
Interpretation of Coefficients
Solution

^
1. Slope (1)


# Responses to Ad Is Expected to Increase
by .2049 (20.49) for Each 1 Sq. In. Increase
in Ad Size Holding Circulation Constant
^
2. Slope (2)

# Responses to Ad Is Expected to Increase
by .2805 (28.05) for Each 1 Unit (1,000)
Increase in Circulation Holding Ad Size
Constant
Evaluating the Model
Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of
Random Error Term

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation
Evaluating Multiple Regression
Model Steps



1. Examine Variation Measures
2. Do Residual Analysis
3. Test Parameter Significance



Overall Model
Individual Coefficients
4. Test for Multicollinearity
Variation Measures
Coefficient of
Multiple Determination
Proportion of Variation in Y ‘Explained’ by
All X Variables Taken Together
SSE
2 Explained variation SS yy  SSE
R 

 1
Total variation
SS yy
SS yy

Check Your Understanding

If you add a variable to the model

How will that affect the R-squared value for
the model?
Adjusted R2

R2 Never Decreases When New X Variable Is
Added to Model



Only Y Values Determine SSyy
Disadvantage When Comparing Models
Solution: Adjusted R2

Each additional variable reduces adjusted R2,
unless SSE goes up enough to compensate
n  1  SSE
SSE
2

1


R
 SS


n

k

1
SSyy

 yy

2
Ra  1  
Variance of Error



Assuming model is correctly specified…
Best (unbiased) estimator of  2  Var    E  i2
2
SSE

ˆ
 i
is
2
s 

n  k  1 n  k  1
 
Used in formula for computing


Exact formula is too complicated to show
But higher value for s leads to higher
Individual Coefficients
Parameter Estimation
Computer Output
^P
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error Param=0 Prob>|T|
INTERCEP 1
0.0640
0.2599 0.246
0.8214
ADSIZE
1
0.2049
0.0588 3.656
0.0399
CIRC
1
0.2805
0.0686 4.089
0.0264
^
t
0
^
1
^
2
ˆi
sˆ
i
Evaluating Multiple Regression
Model Steps



1. Examine Variation Measures
2. Do Residual Analysis
3. Test Parameter Significance



Overall Model
Individual Coefficients
4. Test for Multicollinearity
Testing Overall Significance

1. Shows If There Is a Linear
Relationship Between All X Variables
Together & Y

2. Uses F Test Statistic

3. Hypotheses

H0: 1 = 2 = ... = k = 0


No Linear Relationship
Ha: At Least One Coefficient Is Not 0

At Least One X Variable Affects Y
Testing Overall Significance
Computer Output
Analysis of Variance
Source DF
Model
2
Error
3
C Total 5
k
Sum of
Squares
9.2497
0.2503
9.5000
Mean
Square
4.6249
0.0834
n - k -1
n-1
F Value
55.440
Prob>F
0.0043
MS(Model)
MS(Error)
P-Value
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Models With a Single
Quantitative Variable
Types of
Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
First-Order Model With
1 Independent Variable
First-Order Model With
1 Independent Variable

1. Relationship Between 1 Dependent
& 1 Independent Variable Is Linear
E (Y )   0   1X 1i
First-Order Model With
1 Independent Variable

1. Relationship Between 1 Dependent
& 1 Independent Variable Is Linear
E (Y )   0   1X 1i

2. Used When Expected Rate of
Change in Y Per Unit Change in X Is
Stable
First-Order Model With
1 Independent Variable



1. Relationship Between 1 Dependent &
1 Independent Variable Is Linear
E (Y )   0   1X 1i
2. Used When Expected Rate of Change
in Y Per Unit Change in X Is Stable
3. Used With Curvilinear Relationships If
Relevant Range Is Linear
First-Order Model
Relationships
E (Y )   0   1X 1i
Y
1 > 0
Y 1 < 0
X1
X1
First-Order Model Worksheet
Case, i
Yi
X1i
2
X1i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
64
9
25
:
Run regression with Y, X1
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Second-Order Model With
1 Independent Variable


1. Relationship Between 1 Dependent
& 1 Independent Variables Is a
Quadratic Function
2. Useful 1St Model If Non-Linear
Relationship Suspected
Second-Order Model With
1 Independent Variable
1. Relationship Between 1 Dependent
& 1 Independent Variables Is a
Quadratic Function
 2. Useful 1St Model If Non-Linear
Relationship Suspected
Curvilinear
effect
 3. Model
2
E (Y )   0   1X 1i   2 X 1i

Linear effect
Second-Order Model Relationships
Y
2 > 0
Y 2 > 0
X1
Y
2 < 0
X1
Y 2 < 0
X1
X1
Second-Order Model
Worksheet
Case, i
Yi
X1i
2
X1i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
64
9
25
:
Create X12 column.
Run regression with Y, X1, X12.
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Third-Order Model With
1 Independent Variable


1. Relationship Between 1 Dependent
& 1 Independent Variable Has a ‘Wave’
2. Used If 1 Reversal in Curvature
Third-Order Model With
1 Independent Variable
1. Relationship Between 1 Dependent
& 1 Independent Variable Has a ‘Wave’
 2. Used If 1 Reversal in Curvature
 3. Model
E (Y )   0   1X 1i   2 X 12i   3 X 13i

Linear effect
Curvilinear
effects
Third-Order Model Relationships
E (Y )   0   1X 1i 
Y
3 > 0
2
 2 X 1i
Y
X1

3
 3 X 1i
3 < 0
X1
Third-Order Model Worksheet
Case, i
Yi
X1i
X1i2
X1i3
1
1
1
1
1
2
4
8
64
512
3
1
3
9
27
4
3
5
25
125
:
:
:
:
:
Multiply X1 by X1 to get X12.
Multiply X1 by X1 by X1 to get X13.
Run regression with Y, X1, X12 , X13.
Models With Two or More
Quantitative Variables
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
First-Order Model With
2 Independent Variables


1. Relationship Between 1 Dependent &
2 Independent Variables Is a Linear
Function
2. Assumes No Interaction Between X1 &
X2

Effect of X1 on E(Y) Is the Same Regardless of
X2 Values
First-Order Model With
2 Independent Variables


1. Relationship Between 1 Dependent &
2 Independent Variables Is a Linear
Function
2. Assumes No Interaction Between X1 &
X2


Effect of X1 on E(Y) Is the Same Regardless of
X2 Values
3. Model
E (Y )   0   1X 1i   2 X 2i
No Interaction
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
12
8
4
0
X1
0
0.5
1
1.5
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
12
8
4
E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1
0
X1
0
0.5
1
1.5
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
12
8
E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1
4
E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1
0
X1
0
0.5
1
1.5
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
12
E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1
8
E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1
4
E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1
0
X1
0
0.5
1
1.5
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
E(Y) = 1 + 2X1 + 3(3) = 10 + 2X1
12
E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1
8
E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1
4
E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1
0
X1
0
0.5
1
1.5
No Interaction
E(Y)
E(Y) = 1 + 2X1 + 3X2
E(Y) = 1 + 2X1 + 3(3) = 10 + 2X1
12
E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1
8
E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1
4
E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on E(Y) does not depend on X2 value
First-Order Model Relationships
Y
Response
Surface
X1
0
X2
First-Order Model Worksheet
Case, i
Yi
X1i
X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
Run regression with Y, X1, X2
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Interaction Model With
2 Independent Variables

1. Hypothesizes Interaction Between
Pairs of X Variables

Response to One X Variable Varies at
Different Levels of Another X Variable
Interaction Model With
2 Independent Variables

1. Hypothesizes Interaction Between Pairs
of X Variables


Response to One X Variable Varies at Different
Levels of Another X Variable
2. Contains Two-Way Cross Product Terms
E (Y )   0   1X 1i   2 X 2i   3 X 1i X 2i
Interaction Model With
2 Independent Variables

1. Hypothesizes Interaction Between Pairs
of X Variables

Response to One X Variable Varies at
Different Levels of Another X Variable

2. Contains Two-Way Cross Product Terms

(Y ) Be 0Combined
  1X 1i  With
 2 X 2Other
3.ECan
Models
i   3X
1i X 2i

Example: Dummy-Variable Model
Effect of Interaction
Effect of Interaction
1. Given:
E (Y )   0   1X 1i   2 X 2i   3 X 1i X 2i

Effect of Interaction
1. Given:
E (Y )   0   1X 1i   2 X 2i   3 X 1i X 2i


2. Without Interaction Term, Effect of
X1 on Y Is Measured by 1
Effect of Interaction
1. Given:
E (Y )   0   1X 1i   2 X 2i   3 X 1i X 2i



2. Without Interaction Term, Effect of
X1 on Y Is Measured by 1
3. With Interaction Term, Effect of X1
on
Y Is Measured by 1 + 3X2

Effect Increases As X2i Increases
Interaction Model Relationships
Interaction Model Relationships
E(Y)
E(Y) = 1 + 2X1 + 3X2 + 4X1X2
12
8
4
0
X1
0
0.5
1
1.5
Interaction Model Relationships
E(Y)
E(Y) = 1 + 2X1 + 3X2 + 4X1X2
12
8
E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Interaction Model Relationships
E(Y)
E(Y) = 1 + 2X1 + 3X2 + 4X1X2
E(Y) = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Interaction Model Relationships
E(Y)
E(Y) = 1 + 2X1 + 3X2 + 4X1X2
E(Y) = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on E(Y) does depend on X2 value
Interaction Model Worksheet
Case, i
Yi
X1i
X2i
X1i X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
3
40
6
30
:
Multiply X1 by X2 to get X1X2.
Run regression with Y, X1, X2 , X1X2
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Second-Order Model With
2 Independent Variables


1. Relationship Between 1 Dependent
& 2 or More Independent Variables Is a
Quadratic Function
2. Useful 1St Model If Non-Linear
Relationship Suspected
Second-Order Model With
2 Independent Variables



1. Relationship Between 1 Dependent
& 2 or More Independent Variables Is a
Quadratic Function
2. Useful 1St Model If Non-Linear
Relationship Suspected
3. Model
E (Y )   0   1X 1i   2 X 2i   3 X 1i X 2i
2
  4 X 1i

2
 5 X 2i
Second-Order Model Relationships
Y
X2
X1
Y
X1
4 + 5 > 0
 32 > 4  4  5
X2
Y
4 + 5 < 0
X2
X1
E (Y )   0   1X 1i   2 X 2i
  3 X 1i X 2i
2
  4 X 1i

2
 5 X 2i
Second-Order Model Worksheet
Case, i
Yi
X1i
X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
X1i X2i X1i2
3
40
6
30
:
1
64
9
25
:
X2i 2
9
25
4
36
:
Multiply X1 by X2 to get X1X2; then X12, X22.
Run regression with Y, X1, X2 , X1X2, X12, X22.
Models With One Qualitative
Independent Variable
Types of Regression Models
Explanatory
Variable
1
Quantitative
Variable
2 or More
Quantitative
Variables
1
Qualitative
Variable
1st
2nd
3rd
Order Order Order
Model Model Model
1st
Inter- 2nd
Order Action Order
Model Model Model
Dummy
Variable
Model
Dummy-Variable Model

1.Involves Categorical X Variable With 2
Levels




e.g., Male-Female; College-No College
2.Variable Levels Coded 0 & 1
3.Number of Dummy Variables Is 1 Less
Than Number of Levels of Variable
4.May Be Combined With Quantitative
Variable (1st Order or 2nd Order Model)
Dummy-Variable Model
Worksheet
Case, i
Yi
X1i
X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
0
1
1
:
X2 levels: 0 = Group 1; 1 = Group 2.
Run regression with Y, X1, X2
Interpreting Dummy-Variable
Model Equation
Interpreting Dummy-Variable
Model Equation




Given: Yi   0   1X 1i   2 X 2i
Y  Starting salary of college grad's
X 1  GPA
0 if Male
X2 
1 if Female
Interpreting Dummy-Variable
Model Equation




Given: Yi   0   1X 1i   2 X 2i
Y  Starting salary of college grad's
X 1  GPA
0 if Male
X2 
1 if Female
Males ( X 2  0):
Yi   0   1X 1i   2 (0)   0   1X 1i
Interpreting Dummy-Variable
Model Equation




Given: Yi   0   1X 1i   2 X 2i
Y  Starting salary of college grad's
X 1  GPA
0 if Male
X2 
1 if Female
Same slopes
Males ( X 2  0):






Yi   0   1X 1i   2 (0)   0   1X 1i
Females (X 2  1):







Yi   0   1X 1i   2 (1)   0 2 )  1X 1i
Dummy-Variable Model Relationships
Y
^
Same Slopes 
1
Females
^
0 + ^2
^
0
Males
0
0
X1
Dummy-Variable Model Example
Dummy-Variable Model Example
Computer Output: Yi  3  5 X 1i  7 X 2i
0 if Male
X2 
1 if Female
Dummy-Variable Model Example
Computer Output: Yi  3  5 X 1i  7 X 2i
0 if Male
X2 
1 if Female
Males ( X 2  0):
Yi  3  5 X 1i  7(0)  3  5 X 1i
Dummy-Variable Model Example
Computer Output: Yi  3  5 X 1i  7 X 2i
0 if Male
X2 
1 if Female
Males ( X 2  0):
Same slopes
Yi  3  5 X 1i  7(0)  3  5 X 1i
Females (X 2  1):
Yi  3  5 X 1i  7(1)  (3 + 7)  5 X 1i
Selecting Variables
in Model Building
Selecting Variables in Model
Building
A Butterfly Flaps its Wings in Japan, Which
Causes It to Rain in Nebraska. -- Anonymous
Use Theory Only!
Use Computer Search!
Model Building with Computer
Searches

1. Rule: Use as Few X Variables As Possible

2. Stepwise Regression



Computer Selects X Variable Most Highly
Correlated With Y
Continues to Add or Remove Variables
Depending on SSE
3. Best Subset Approach

Computer Examines All Possible Sets
Residual Analysis
Evaluating Multiple Regression
Model Steps



1. Examine Variation Measures
2. Do Residual Analysis
3. Test Parameter Significance



Overall Model
Individual Coefficients
4. Test for Multicollinearity
Residual Analysis

1. Graphical Analysis of Residuals

Plot Estimated Errors vs. Xi Values




Difference Between Actual Yi & Predicted Yi
Estimated Errors Are Called Residuals
Plot Histogram or Stem-&-Leaf of Residuals
2. Purposes


Examine Functional Form (Linear vs.
Non-Linear Model)
Evaluate Violations of Assumptions
Linear Regression
Assumptions




1. Mean of Probability Distribution of
Error Is 0
2. Probability Distribution of Error Has
Constant Variance
3. Probability Distribution of Error is
Normal
4. Errors Are Independent
Residual Plot
for Functional Form
Add X2 Term
Correct Specification
^
e
^
e
X
X
Residual Plot
for Equal Variance
Unequal Variance
SR
Correct Specification
SR
X
Fan-shaped.
Standardized residuals used typically (residual
divided by standard error of prediction)
X
Residual Plot
for Independence
Not Independent
Correct Specification
SR
SR
X
X
Residual Analysis
Computer Output
Dep Var Predict
Student
Obs SALES
Value Residual Residual -2-1-0 1 2
1 1.0000 0.6000
0.4000
1.044 |
|**
2 1.0000 1.3000 -0.3000
-0.592 |
*|
3 2.0000 2.0000
0
0.000 |
|
4 2.0000 2.7000 -0.7000
-1.382 |
**|
5 4.0000 3.4000
0.6000
1.567 |
|***
|
|
|
|
|
Plot of standardized
(student) residuals
Multiple Regression Models
Multiple
Regression
Models
Linear
Linear
PolyNomial
Dummy
Variable
Square
Root
NonLinear
Interaction
Log
Reciprocal
Exponential
Polynomial (Curvilinear)
Regression Model
Curvilinear Regression Model



Relationship between 1 response
variable and 2 or more explanatory
variable is a polynomial function
Useful when scatter diagram indicates
non-linear relationship
Curvilinear model:
Yi   0   1 X 1i   2 X 12i   i

The second explanatory variable is the
square of the 1st.
Curvilinear Regression Model
Curvilinear models may be considered when
scatter diagram takes on the following shapes:
Y
Y
2 > 0
X1
Y
2 > 0
X1
Y
2 < 0
X1
2 = the coefficient of the quadratic term
2 < 0
X1
Testing for Significance:
Curvilinear Model

Testing for Overall Relationship



Similar to test for linear model
MSR
F test statistic = MSE
Testing the Curvilinear Effect

Compare curvilinear model
Yi   0   1 X 1i   2 X 12i   i
with the linear model
Yi   0   1 X 1 i   i
Testing for Significance:
Curvilinear Model

May require testing a portion of the
model (e.g. the linear and squared
terms) when there are other variables
in the model
Yi  0  1 X1i   2 X   3 X2i   i
2
1i

Here we must test 1  2  0 to test for
the significance of X1 - an F-test for
these two “variables”
Inherently Linear Models

Non-linear models that can be
expressed in linear form



Can be estimated by LS in linear form
Require data transformation
Multiplicative model example
1
2
Yi   0  X 1i  X 2i   i
ln Yi   ln  0   1ln  X 1i    2 ln  X 2i   ln  i 
Using Transformations



Requires Data Transformation
Either or Both Independent and
Dependent Variables May be
Transformed
Can be based on theory, logic or scatter
diagrams
Square Root Transformation
Yi   0   1 X 1 i   2 X 2 i   i
Y
1 > 0
Similarly for X2
1 < 0
X1
Transforms one of above model to one that appears
linear. Often used to overcome heteroscedasticity.
Logarithmic Transformation
Yi   0   1 ln( X 1i )   2 ln( X 2 i )   i
Y
1 > 0
Similarly for X2
1 < 0
X1
Exponential Transformation
Original Model
Yi  e
Y
 0  1X 1i   2 X 2 i
i
1 > 0
1 < 0
Similarly for X2
X1
Transformed into: ln Yi   0   1 X 1i   2 X 2 i  ln  1
Interpretation of coefficients

The dependent variable is logged.


The coefficient on the independent variable can
be approximately interpreted as : a 1 unit
change in X leads to a b percentage change in Y.
The independent variable is logged.

The coefficient on the independent variable can
be approximately interpreted as : a 100 percent
change in X leads to a b unit change in Y.
Interpretation of coefficients

Both dependent and independent variables
are logged.

The coefficient on the independent variable can
be approximately interpreted as : a 1 percent
change in X leads to a b percentage change in
Y. Therefore b is the elasticity of Y with respect
to a change in X.
Income and Experience: Scatter Plot
Income and Experience: Linear

Linear Model
Income and Experience:
Log Independent Variable

Log independent variable
Income and Experience: Income Logged

Log(Y)
Income and Experience: Double Log

Double Log - Elasticity Model (Note: LFEXP is
already logged in this example)
Income and Experience: Quadratic

Quadratic
Income and Experience: Log plus
Quadratic

Log(Y) +
Quadratic
Income and Experience: All
Specifications

Many specifications
Standardized and Unstandardized




Many disciplines report ONLY standardized
coefficients
The usual coefficients are then referred to
as “unstandardized coefficients”
The “standardized” coefficient are often
referred to as “beta weights”
The t-tests for significance of the slopes are
identical for either of these two.
Interpretation of coefficients

If both Y and X are measured in
standardized form,
 Yi  Y
yi  
 
and
 Xi  X 
Y

xi  










Then the b’s are called standardized
coefficients. They indicate the number
of standard deviations Y will change
when X changes by one standard
deviation
BETA Coefficients Example
Comparison of coefficients


In general, we should NOT compare
coefficients unless they are measured in
the same units (e.g. dollars or inches)
Two “unit free” measures are
sometimes used to compare
coefficients:


elasticities (percentage changes)
standardized coefficients (Stand. Dev.
Changes)
Violation of Assumptions
Omitted Variables


This problem occurs if a variable is omitted
from the specification either due to an error
by the researcher or lack of data.
If the variable is uncorrelated with the
included variables:
 The estimated slopes are inefficient (their
variance is too large).
 The estimated slopes are unbiased.
Omitted Variables

If the variable is correlated with the
included variables:



The t-tests are biased (the estimated
variance of the slopes is too small).
The estimated slopes are biased.
This is a serious problem because it
leads us to reject true null hypotheses
too often.
Omitted Variables

This suggests that great care be taken
in model building. It is generally not
good procedure to allow the sample to
dictate the model. It is better to
include a variable that should not be
there than exclude a variable that
should.
EXAMPLE
EXAMPLE
Effect of Omitted Variable
Measurement Error

In the dependent variable


Slopes are biased toward zero - null
hypotheses that are false are more difficult
to reject. Measurement error makes it
more difficult to reject null hypotheses.
In an independent variable

Slope is biased toward zero. Slopes of
other variables that are correlated with this
variable can also be biased. Measurement
error can lead to rejecting true nulls.
Measurement Error

Implications

Your dependent variable is hard to
measure: product satisfaction or quality of
work. If you do find results they would be
even stronger if you could measure the
variable accurately. A significant result
with a variable that is difficult to measure
should not be dismissed!
Measurement Error

Implications

Your independent variable is hard to
measure: product satisfaction or quality of
work. Same as dependent variable (a
significant result would be even more
significant). HOWEVER, poor
measurement can lead you to give MORE
credit than is due to another variable.
Measurement Error

Conclusions


Measure your variables as accurately as
possible to improve the power of your tests
If your independent variable is difficult to
measure, you must worry about the results
for other variables in the model.
Heteroscedasticity

Typically a problem in cross sectional
data


Slopes are unbiased, but inefficient.
However, this is often an indication of an
omitted variable problem, in which case
the slopes are potentially biased.
Heteroscedasticity

Usually occurs due to a few outliers.
Possible cures:



Drop the outliers.
Use a transformation like a log
transformation that eliminates the problem.
Use advanced procedures to correct the
problem (Weighted Least Squares;
Generalized Least Squares)
Heteroscedasticity

Examples:

Data on firms of different sizes - there is
likely to be more heterogeneity in
management for small firms



Small firms -> big errors
Large firms -> small errors
Data on proportions gathered from groups
of different sizes


Large groups likely to give better estimates
Example: College graduation rates
Autocorrelation




This occurs when the error in one
observation is correlated with the error
in another observation.
This is generally a time series problem.
This correlation can be quite simple, or
very complicated.
If the correlation is with the previous
observation error, this is called 1st order
autocorrelation.
Example Plots of Residuals
Positive Autocorrelation Negative Autocorrelation None
The Durbin-Watson Statistic
•Used when data is collected over time to detect
autocorrelation (Residuals in one time period
are related to residuals in another period)
•Measures Violation of independence assumption
n
D
 ( ei  ei  1 )
i 2
n
2
 ei
i 1
2
•Approximately 0=positive
autocorrelation
•Approximately 2=none
•Approximately 4=negative
autocorrelation.
The Durbin-Watson Statistic
Durbin-Watson table (one-tailed critical values)
=.05
P=1
DL
N
15
16
17
18
P=2
DL
DH
1.08
1.1
1.13
1.16
1.36
1.37
1.38
1.39
DH
0.95
0.98
1.02
1.05
1.54
1.54
1.54
1.53
The Durbin-Watson Statistic
Durbin-Watson table (one-tailed critical values)
=.05
P=1
DL
N
15
16
17
18
P=2
DL
DH
1.08
1.1
1.13
1.16
1.36
1.37
1.38
1.39
DH
0.95
0.98
1.02
1.05
1.54
1.54
1.54
1.53
d>DH indicates ACCEPT NULL
The Durbin-Watson Statistic
Durbin-Watson table (one-tailed critical values)
=.05
P=1
DL
N
15
16
17
18
P=2
DL
DH
1.08
1.1
1.13
1.16
1.36
1.37
1.38
1.39
d<DL indicates REJECT NULL
DH
0.95
0.98
1.02
1.05
1.54
1.54
1.54
1.53
The Durbin-Watson Statistic
Durbin-Watson table (one-tailed critical values)
=.05
P=1
DL
N
15
16
17
18
P=2
DL
DH
1.08
1.1
1.13
1.16
1.36
1.37
1.38
1.39
d>DL and d<DH is inconclusive
DH
0.95
0.98
1.02
1.05
1.54
1.54
1.54
1.53
The Durbin-Watson Statistic
Durbin-Watson table (one-tailed critical values)
=.05
P=1
DL
N
15
16
17
18
P=2
DL
DH
1.08
1.1
1.13
1.16
1.36
1.37
1.38
1.39
DH
0.95
0.98
1.02
1.05
1.54
1.54
1.54
1.53
Test for NEGATIVE autocorrelation: USE 4-d
Example: d=3.5 n=15, p=2 use d=.5 reject null
Durbin-Watson Example
Relationship between sales
and customers
Regression Statistics
Multiple R 0.810829997
R Square
0.657445284
Adjusted R Square
0.631094922
Standard Error
0.936036681
Observations
15
Coefficients Standard Error
t Stat
P-value
Intercept -16.0321936 5.310167093 -3.019150493 0.009868641
Customers 0.030760228 0.006158189 4.995011683 0.000245105
Week Customers Sales
1
794 9.33
2
799 8.26
3
837 7.48
4
855 9.08
5
845 9.83
6
844 10.09
7
863 11.01
8
875 11.49
9
880 12.07
10
905 12.55
11
886 11.92
12
843 10.27
13
904 11.80
14
950 12.15
15
841 9.64
Durbin-Watson Example
DW=.88
p=1 (# of variables)
n=15 (# of observations)
dl=1.08 dh=1.36
Conclusion: Reject null of no positive
autocorrelation (DW< dl)
Problem and Cure
•Autocorrelation present
•t-tests are biased - estimated standard error too small
• Degree of autocorrelation known (or estimated)
•Remove by differencing the data.
• Special case: correlation +1 -> first difference the data
Yt *  Yt  Yt 1
X t*  X t  X t 1
We then run the regression using Y* and X* instead
of Y and X
EXAMPLE:
How is birth rate related
to wars and women in
the labor force?
WLF=labor force
participation of women
Divorce=divorce rate
returnyr=3 years
following a war
Birth Rate
WLF
UE
2.03
25.4
2.22
26.70
2.27
29.10
2.12
29.20
2.04
29.20
2.41
27.80
2.66
27.40
2.49
28.00
2.45
28.30
2.41
28.80
2.49
29.30
2.51
29.40
2.50
29.20
2.53
29.40
2.50
30.20
2.52
31.00
2.53
31.20
2.45
31.5
2.40
31.70
2.37
32.30
2.33
32.60
2.24
32.70
2.17
33.20
2.10
33.60
1.94
34.00
1.84
34.60
1.78
35.10
3.75
35.50
1.78
36.30
1.84
36.70
9.90
4.70
1.90
3.20
1.90
3.9
3.9
3.80
5.9
5.3
3.3
3.00
2.90
5.50
4.40
4.10
4.30
6.80
5.50
5.50
6.70
5.5
5.70
5.20
4.5
3.80
3.80
3.60
3.50
4.90
Divorce waryear
17
18
23
28
30
27
24
23
25
23
24
25
25
25
25
24
25
25
26
27
26
26
26
26
27
27
27
28
30
33
returnyr
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Results
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.859549368
0.738825117
0.684413683
0.152709042
30
ANOVA
df
Regression
Residual
Total
Intercept
WLF
UE
Divorce
waryear
returnyr
5
24
29
SS
1.583255433
0.559681234
2.142936667
Coefficients
Standard Error
4.374896708
0.381602139
-0.04910246
0.014827648
-0.045616967
0.023630251
-0.008953659
0.015631931
-0.327043186
0.074959758
0.009847529
0.089477525
MS
F
Significance F
0.316651087 13.57849007
2.41579E-06
0.023320051
t Stat
11.46454976
-3.311547559
-1.930447837
-0.572780128
-4.362916791
0.110055896
P-value
3.1963E-11
0.002928173
0.065449464
0.572121264
0.0002099
0.913280166
Lower 95%
3.587308764
-0.079705215
-0.094387398
-0.041216371
-0.48175249
-0.174824968
Upper 95%
5.162484651
-0.018499706
0.003153464
0.023309053
-0.172333881
0.194520026
Residuals
Residuals
DW=.55 reject
Null of no
autocorrelation
0.2
0.15
0.1
0.05
Estimate rho as
r=1-d/2
=1-.55/2=.725
0
-0.05 0
-0.1
-0.15
-0.2
5
10
15
20
25
30
35
Differencing Data
Birth Rate
WLF
UE
2.03
25.4
2.22
26.70
2.27
29.10
2.12
29.20
2.04
29.20
2.41
27.80
2.66
27.40
2.49
28.00
2.45
28.30
2.41
28.80
2.49
29.30
2.51
29.40
2.50
29.20
2.53
29.40
2.50
30.20
2.52
31.00
2.53
31.20
2.45
31.5
2.40
31.70
2.37
32.30
2.33
32.60
2.24
32.70
2.17
33.20
2.10
33.60
1.94
34.00
1.84
34.60
1.78
35.10
1.75
35.50
1.78
36.30
1.84
36.70
9.90
4.70
1.90
3.20
1.90
3.9
3.9
3.80
5.9
5.3
3.3
3.00
2.90
5.50
4.40
4.10
4.30
6.80
5.50
5.50
6.70
5.5
5.70
5.20
4.5
3.80
3.80
3.60
3.50
4.90
Divorce waryear
returnyr
17
1
18
1
23
1
28
1
30
1
27
0
24
0
23
0
25
0
23
0
24
0
25
0
25
0
25
0
25
0
24
0
25
0
25
0
26
0
27
0
26
0
26
0
26
0
26
0
27
1
27
1
27
1
28
1
30
1
33
1
Birthrate
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0.74825
0.6605
0.47425
0.503
0.931
0.91275
0.5615
0.64475
0.63375
0.74275
0.70475
0.68025
0.7175
0.66575
0.7075
0.703
0.61575
0.62375
0.63
0.61175
0.55075
0.546
0.52675
0.4175
0.4335
0.446
0.4595
0.51125
0.5495
WLF
8.285
9.7425
8.1025
8.03
6.63
7.245
8.135
8
8.2825
8.42
8.1575
7.885
8.23
8.885
9.105
8.725
8.88
8.8625
9.3175
9.1825
9.065
9.4925
9.53
9.64
9.95
10.015
10.0525
10.5625
10.3825
UE
Divorce
-2.4775
-1.5075
1.8225
-0.42
2.5225
1.0725
0.9725
3.145
1.0225
-0.5425
0.6075
0.725
3.3975
0.4125
0.91
1.3275
3.6825
0.57
1.5125
2.7125
0.6425
1.7125
1.0675
0.73
0.5375
1.045
0.845
0.89
2.3625
5.675
9.95
11.325
9.7
5.25
4.425
5.6
8.325
4.875
7.325
7.6
6.875
6.875
6.875
5.875
7.6
6.875
7.875
8.15
6.425
7.15
7.15
7.15
8.15
7.425
7.425
8.425
9.7
11.25
waryear
0.275
0.275
0.275
0.275
-0.725
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0.275
0.275
0.275
0.275
0.275
returnyr
0
0
0
0
0
1
0.275
0.275
-0.725
0
0
0
0
1
0.275
0.275
-0.725
0
0
0
0
0
0
0
0
0
0
0
0
Results
Regression Statistics
Multiple R
0.812694974
R Square
0.660473121
Adjusted R Square
0.58666293
Standard Error
0.083098593
Observations
29
ANOVA
df
Regression
Residual
Total
Intercept
WLF
UE
Divorce
waryear
returnyr
5
23
28
SS
0.308955666
0.158823652
0.467779317
MS
F
Significance F
0.061791133 8.948264624
7.86547E-05
0.006905376
Coefficients
Standard Error
t Stat
P-value
1.411080007
0.15976456 8.832246672 7.53745E-09
-0.0646903
0.019521702 -3.313763265 0.003028155
-0.014449922
0.013867407 -1.042006025 0.308238603
-0.019019342
0.010589345 -1.79608287 0.085630564
-0.116753859
0.056676884 -2.059990798 0.050889678
0.028674254
0.050850908 0.563888726 0.578287347
Multicollinearity




Does not violate assumptions of least
squares (unless it is perfectly collinear)
Estimates have low ability to reject false
null hypotheses (low power).
A post hoc problem.
Little that can be done - eliminating a
variable could cause omitted variable
bias.
Multicollinearity

May require testing groups of variables
instead of individual slopes.

Use F-test for a group of variables that are
measuring a similar idea rather than
testing the idea by looking at individual ttests
Example of Collinearity
Model
MPG
Type of Drive
Fuel Type
Fuel Capacity
Length
Wheelbase
Width
Turning Circle
Weight
Luggage Capacity
Front Leg Room
Front Head Room

How is MPG
influenced by car
characteristics?
Regression Results
Regression Statistics
Multiple R
0.916170699
R Square
0.839368749
Adjusted R Square
0.816421427
Standard Error
1.943378831
Observations
89
ANOVA
df
Regression
Residual
Total
Intercept
Type of Drive
Fuel Type
Fuel Capacity
Length
Wheelbase
Width
Turning Circle
Weight
Luggage Capacity
Front Leg Room
Front Head Room
SS
1519.596956
290.8075387
1810.404494
MS
138.1451778
3.776721282
Coefficients
Standard Error
47.84639077
16.00522579
-0.990467814
0.804947293
-0.448181315
0.669007482
-0.033480953
0.205431049
0.035032065
0.0609802
0.051601109
0.103653825
-0.005035985
0.163180417
-0.194077967
0.160875083
-0.009083907
0.001760789
0.085785172
0.09438463
0.028830301
0.326150787
-0.226076981
0.256358435
t Stat
2.989423041
-1.230475366
-0.669919736
-0.162979028
0.574482609
0.497821566
-0.030861452
-1.206389226
-5.158997869
0.908889218
0.088395621
-0.881878454
11
77
88
F
Significance F
36.57807062
4.3792E-26
P-value
0.003750637
0.222264941
0.504913062
0.870961928
0.567315883
0.620028615
0.975459886
0.23136098
1.87855E-06
0.366244683
0.929791742
0.380587216
Correlations of Independent Variables
Type of Drive
Type of Drive
1
Fuel Type
0.268681117
Fuel Capacity
-0.428951231
Length
-0.341685414
Wheelbase
-0.320081091
Width
-0.441548037
Turning Circle
-0.143483852
Weight
-0.520865176
Luggage Capacity
0.04879301
Front Leg Room
-0.282796832
Front Head Room 0.150372499
Width
Type of Drive
Fuel Type
Fuel Capacity
Length
Wheelbase
Width
Turning Circle
Weight
Luggage Capacity
Front Leg Room
Front Head Room
1
0.771254595
0.855494824
0.443142211
0.508740454
0.102180504
Fuel Type
Fuel Capacity
1
-0.473951823
1
-0.222986645 0.804874478
-0.289068849
0.76811138
-0.181405593 0.669241997
-0.10837901 0.572058456
-0.399893445 0.894722586
0.149137156 0.397945037
-0.319408961 0.642973059
0.496620764 -0.263388897
Turning Circle
Weight
1
0.719981675
1
0.317445699 0.429944841
0.329045265 0.633639181
0.090580546 -0.123170049
Length
1
0.90800949
0.87697456
0.789886571
0.906826507
0.510775619
0.539896618
0.06588231
Wheelbase
1
0.775996773
0.663126584
0.856277889
0.488358101
0.519103235
0.114062218
Luggage Capacity Front Leg Room Front Head Room
1
0.280372614
0.113971709
1
-0.132931814
1