Download Regression Continued: Functional Form

Document related concepts
no text concepts found
Transcript
Regression Continued:
Functional Form
LIR 832
Topics for the Evening
1.
2.
Qualitative Variables
Non-linear Estimation
Functional Form

Not all relations among variables are linear:

Our basic linear model:
y=b0+ b1X1 + b2X2 +…+ bkXk + e
Functional Form


Q: Given that we are using OLS, can we
mimic these non-linear forms?
A: We have a small bag of tricks which we can
use with OLS.
Functional Form
Functional Form
Functional Form
Functional Form

A first point about functional form: You must have an
intercept.



Consider the following case: We estimate a model and test
the intercept to determine if it is significantly different than
zero. We are not able to reject the null in a hypothesis test
and we decide to re-estimate the model without an
intercept. What is really going on?
Return to our basic model:
y=b0+ b1X1 + b2X2 +…+ bkXk + e
What are we doing when we remove the intercept?
y=0+ b1X1 + b2X2 +…+ bkXk + e
Functional Form
Functional Form
Functional Form
/* Regression without an intercept */
Regression Analysis: weekearn versus years ed
The regression equation is
weekearn = 57.3 years ed
47576 cases used, 7582 cases contain missing values
Predictor
Noconstant
years ed
S = 534.450
Coef
SE Coef
T
P
57.3005
0.1541
371.96
0.000
Functional Form
/* Regression with an intercept */
Regression Analysis: weekearn versus years ed
The regression equation is
weekearn = - 485 + 87.5 years ed
47576 cases used, 7582 cases contain missing values
Predictor
Constant
years ed
Coef
-484.57
87.492
S = 530.510
SE Coef
18.18
1.143
R-Sq = 11.0%
T
-26.65
76.54
P
0.000
0.000
R-Sq(adj) = 11.0%
Functional Form

Consequences of forcing through zero:


Unless the intercept is really zero, we are going to bias
both the intercept and the slope coefficients.
Remember that we calculate the intercept so that the line
passes through the point of means:




Assures that the Σε = 0
If we impose 0 as the intercept, the line may not pass through the
point of means and the sum of the errors may not equal zero.
Biases the coefficients and leads to incorrect estimates of the
standard errors of the βs.
Never suppress the intercept, even if your theory suggests
that it is not necessary.
Functional Form
/* What About Those Residuals? */
Descriptive Statistics: RESI1, RESI2
Variable
RESI1
RESI2
N
47576
47576
Variable
RESI1
RESI2
Q3
218.59
237.69
N*
7582
7582
Mean
-8.67
0.00
Maximum
2311.61
2494.26
SE Mean
2.45
2.43
StDev
534.38
530.50
Minimum
-1180.31
-1329.77
Q1
-359.12
-340.32
Median
-122.21
-107.62
Functional Form

Returning to the issue of non-linearity…

In our basic model:


b = DY/DX = change in Y for a one-unit change in X
Consider the effect of Education on base salary…
Functional Form
Descriptive Statistics: years ed, Exp
Variable
years ed
Exp
N
55158
55107
N*
0
51
Mean
15.734
21.644
SE Mean
0.00941
0.0496
StDev
2.211
11.640
Minimum
1.000
0.0000
Regression Analysis: weekearn versus years ed
The regression equation is
weekearn = - 485 + 87.5 years ed
47576 cases used, 7582 cases contain missing values
Predictor
Constant
years ed
S = 530.510
Coef
-484.57
87.492
SE Coef
18.18
1.143
R-Sq = 11.0%
T
-26.65
76.54
P
0.000
0.000
R-Sq(adj) = 11.0%
Q1
14.000
13.000
Median
16.000
22.000
Q3
18.000
30.000
Maximum
21.000
76.000
Functional Form

Now create a graph in MINITAB:




Work in a new worksheet:
Create values for years of education 0 - 21
Use the calculator to create the predicted weekly
earnings.
Use the scatterplot graphing function:
Functional Form
Every year of education
increases earnings by $87.49!
Functional Form



Q: How do we estimate non-linear relations?
A: We can use log transforms of variables to measure
relations between variables as percentages rather than
units.
What is a log? What is a log transform?



Take any number, let’s take 10.
Then calculate b such that 10 = 2.71828b. Then b is the log
of 10. In this case b = 2.302585.
You can do this on your calculator, in a spreadsheet, or in
MINITAB.
Functional Form

As your text shows:





ln(100) = 4.605
ln(1000) = 6.908
ln(10,000) = 9.210
ln(1,000,000) = 13.816
100 = 2.71828b
1000 = 2.71828b
10,000 = 2.71828b
1,000,000 = 2.71828b
We typically do not write 2.71828, rather we
substitute e the natural base (there are also base 10
logs). So…


10 = e2.302585
Some nice properties of log functions:


ln(X*Y) = ln(X) + ln(Y)
ln(X2) = 2*ln(X)
Functional Form

This property made it possible to manipulate very large
numbers very easily and provides the foundation for slide
rules and many modern computer calculations.



Consider: 1,212,345*375,282
A real mess to do by hand
Now consider the following transformation of this problem:

ln(1,212,345*375,282)






=ln(1,212,345) + ln(375,282)
=14.008067 + 12.83543
= 26.8435
= 2.7182826.8435
= antilog(26.8435)
= 45,484,956.5078803
Functional Form

The Shell presentation has an equation associated with an
upward curve of:



We cannot estimate this in its current form using regression,
but think about taking the log of each side:




Earnings = 62988x0.2676
Or… y=b0Xb1
ln(y) = ln(b0Xb1)
ln(y) = ln(b0)+ln(Xb1)
ln(y) = ln(b0)+b1ln(X)
So, if we take the log of each side, we get a linear equation
that we can estimate!
Functional Form

Consider the following equation: (single log
equation)



ln(weekearn) = b0 + b1*YearsEd + e
The interpretation of the coefficient on years of
education is now the % change in base salary for a 1
year change in Education.
How to do this in MINITAB:


Calculate the log of weekly earnings
Estimate the regression as…
Functional Form
Regression Analysis: ln week earn versus years ed
The regression equation is
ln week earn = 4.87 + 0.109 years ed
47576 cases used, 7582 cases contain missing values
Predictor
Constant
years ed
Coef
4.86646
0.108980
S = 0.694967
SE Coef
0.02382
0.001497
R-Sq = 10.0%
T
204.33
72.78
P
0.000
0.000
R-Sq(adj) = 10.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
47574
47575
SS
2558.4
22977.3
25535.6
MS
2558.4
0.5
F
5297.03
P
0.000
Functional Form

Now we find that an additional year of education
results in a 10.98% increase in salary.


Interpretation is different from linear model
r2 is different between linear and log model.




Linear: r2 =11.0%
Log: r2 = 10.0%
Does this mean the fit of the log model is worse than the
linear model?
No, cannot compare the two because you have transformed
the equation. Fundamentally altered the variance of the
dependent variable.
Functional Form
Descriptive Statistics: weekearn, ln week earn
Variable
weekearn
ln week earn
N
47576
47576
Variable
weekearn
ln week earn
Q3
1153.00
7.0501

N*
7582
7582
Mean
894.53
6.5843
SE Mean
2.58
0.00336
StDev
562.22
0.7326
Minimum
0.01
-4.6052
Q1
519.00
6.2519
Median
769.23
6.6454
Maximum
2884.61
7.967
What Does the Log Model Look Like? -- How to create a
prediction in MINITAB & graph:



Use regression equation to create estimated log wage from years of
education data
Exponentiate the predicted value using the MINITAB calculator
Graph predicted wage against years of education
Functional Form
Functional Form

What is the equation underlying this model?

Model of growth (such as compound
interest)…
Functional Form

Now lets try another approach, taking the log of both
sides (double log equation):

The interpretation of the coefficient on JEP is now
the % change in base salary for a 1 % change in JEP.
Note that this is an elasticity (which you will discuss
in 809 in talking about supply and demand – the
elasticity of labor demand with respect to the wage is
the % change in the demand for labor for a 1%
change in the wage).

Functional Form
Regression Analysis: ln week earn versus ln ed
The regression equation is
ln week earn = 2.13 + 1.62 ln ed
47576 cases used, 7582 cases contain missing values
Predictor
Constant
ln ed
Coef
2.12844
1.62142
S = 0.695775
SE Coef
0.06203
0.02254
R-Sq = 9.8%
T
34.32
71.93
P
0.000
0.000
R-Sq(adj) = 9.8%
Functional Form
Functional Form
Functional Form

What is going on graphically? What are we
really doing?
Functional Form
Functional Form
Functional Form


Q: How do we choose?
A: Prior work and theory

Is it sensible to measure as a linear model, or does
one of these non-linear forms make better sense?

Example:




Thinking of the relationship between education and wages:
wage = β0 + β1*Years_of_Education
ln(wage) = β0 + β1*Years_of_Education
ln(wage) = β0 + β1*ln(Years_of_Education)
Functional Form

What does prior work indicate?

We typically use a log wage equation rather than a
wage equation because…
Turns out the error term is normally distributed in a log
wage equation.
 More readily compared across models as it is not
dependent on the scaling of the variable.
 Comparing the effect of education in percentage terms
frees us from the effect of inflation and alternative
currencies.

Functional Form

A more general non-linear form (The
Polynomial Form)

Problem: Do we really believe that you get an
additional 0.723% in weekly earnings for each
year you get older. Hardly makes it worth getting
older.
Functional Form
Regression Analysis: ln(wkern) versus age, gender, edattain
The regression equation is
ln(wkern) = 2.41 + 0.00723 age - 0.368 gender + 0.105 edattain
47576 cases used 7582 cases contain missing values
Predictor
Constant
age
gender
edattain
S = 0.6626
Coef
2.41075
0.0072344
-0.368278
0.105032
SE Coef
0.06470
0.0002669
0.006115
0.001491
R-Sq = 18.2%
T
37.26
27.11
-60.22
70.45
P
0.000
0.000
0.000
0.000
R-Sq(adj) = 18.2%
This model remains linear in ln(weekly earnings), each unit
increase in age causes earnings to rise by 0.7%.
Functional Form

It would be more reasonable to believe we will
get a relationship which looks like: Why?
Functional Form

How do we mimic this? Consider estimating
the following linear regression:

Notice that age enters twice, first as a linear
term and then as a square. What does this
model look like with real data?
Functional Form
Regression Analysis: ln(wkern) versus age, age2, gender, edattain
The regression equation is
ln(wkern) = 0.927 + 0.104 age - 0.00113 age2 - 0.376 gender + 0.0948
edattain
47576 cases used 7582 cases contain missing values
Predictor
Coef
Constant
0.92706
age
0.103919
age2
-0.00112565
gender
-0.376012
edattain
0.094822
S = 0.6363
SE Coef
0.06640
0.001547
0.00001776
0.005874
0.001441
R-Sq = 24.6%
T
13.96
67.17
-63.37
-64.01
65.82
P
0.000
0.000
0.000
0.000
0.000
R-Sq(adj) = 24.6%
Functional Form

Note that we now have two coefficients on Age:



Age
Age2
.103919
-0.00112565
We know that the first term indicates that for each
additional year our weekly earnings rise by 10.39%.
But how do we chart out the second term. so that we
have the full effect of age on earnings?
Functional Form
Functional Form




The effect of an additional year on earnings
(formula for a polynomial model):
If our model is: y = b0 + b1X + b2X2 + ….
Then DY/DX = b1+2*b2*X
First issue, look at the prediction of ln weekly
earnings based on age (leave all other
variables at their mean).
Functional Form
Functional Form
Functional Form


What about the ‘marginal effect’ of age?
What is the effect on income of getting an additional
year older?


Obviously varies with how old you are. Things are pretty
good when you are young
Two ways of obtaining this:

1. Calculate the difference in the total effect of age for any two
years.



Age22
Age21
Diff
1.741
1.686
0.055 or + 5.5%
Functional Form

2. Alternatively, use the polynomial formula:
Functional Form

What is the increase in earnings at age 21?


What about age 25?


.103919 - .0022513*21 =0.056642
.103919 - .0022513*25 =0.0476365
What about age 50? (Class work)

Note that the effect of an additional year of
education is no longer constant, it depends on how
old you are.
Functional Form
Functional Form

The gains to aging are greatest when you are
youngest:



They decline steadily as you age.
By age fifty your earnings are falling as you get older (oops!).
A couple points about polynomial and functional
forms:


Polynomial forms have the strength of letting the data tell
you if the relationship is linear or not. If it is, the
coefficient on X2 will be 0 or very close to it.
You cannot compare r2 across log and non-log forms
because it changes the dependent variable and the sum of
squares. You can between linear and non-linear forms.
Recap on Functional Form


Not all relationships are linear
Regression allows us to estimate nonlinear models and to let the data tell us
whether we should be using a non-linear
form


Single and double log transforms
Polynomial form
MultiCollinearity

Issue: What happens when two variables
contain the same, or almost the same
information?

Condition is called multicollinearity
Perfect MultiCollinearity Is Not a
Problem

Try putting both a Male and Female
dummy variable in a wage equation
Base Regression:
Earnings=F(age, Education)


Regression Analysis: weekearn versus
years ed, age
The regression equation is

weekearn = - 707 + 83.5 years ed + 6.87 age
Predictor
Coef SE Coef
T
P
Constant
-706.63
19.24 -36.73 0.000
years ed
83.463
1.137
73.38 0.000
age
6.8717
0.2118
32.45 0.000

S = 524.739




R-Sq = 12.9%
R-Sq(adj) = 12.9%
Now Put Male & Female Into
Model

Regression Analysis: weekearn versus
years ed, age, Male, Female

* Female is highly correlated with other X
variables
* Female has been removed from the equation.

The Regression


The regression equation is
weekearn = - 720 + 76.4 years ed + 6.29 age + 319
Male

Predictor
Constant
years ed
age
Male

S = 500.391




Coef
-720.28
76.432
6.2874
318.522
SE Coef
18.35
1.089
0.2021
4.625
R-Sq = 20.8%
T
-39.25
70.16
31.11
68.87
P
0.000
0.000
0.000
0.000
R-Sq(adj) = 20.8%
Male & Female Contain the Same
Information

Correlations: Male, Female

Pearson correlation of Male and
Female = -1.000
P-Value = *

What If Several Variables Contain
the Same Information

Regression Analysis: weekearn versus age, years ed, Female, NE, MW, S, W

* W is highly correlated with other X variables
* W has been removed from the equation.


The regression equation is
weekearn = - 392 + 6.25 age + 75.9 years ed - 318 Female + 47.7 NE - 18.2 MW
- 20.3 S

47576 cases used, 7582 cases contain missing values


Predictor
Constant
age
years ed
Female
NE
MW
S

S = 499.701








Coef
-392.10
6.2532
75.895
-318.406
47.658
-18.155
-20.323
SE Coef
19.21
0.2019
1.089
4.619
6.768
6.594
6.317
R-Sq = 21.0%
T
-20.42
30.98
69.67
-68.93
7.04
-2.75
-3.22
P
0.000
0.000
0.000
0.000
0.000
0.006
0.001
R-Sq(adj) = 21.0%
What Are the Regional Dummies
Correlated With?

Descriptive Statistics: NE, MW, S, W

Variable
Median
NE
MW
S
W




N
N*
Mean
SE Mean
StDev
Minimum
Q1
55158
55158
55158
55158
0
0
0
0
0.22310
0.23873
0.29211
0.24606
0.00177
0.00182
0.00194
0.00183
0.41633
0.42631
0.45474
0.43072
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
Imperfect MultiCollinearity

Two or more variables contain similar but
not identical information
Log Wage Regression





















Source |
SS
df
MS
Number of obs = 156130
-------------+-----------------------------F( 11,156118) = 4227.42
Model | 11630.4798
11 1057.31635
Prob > F
= 0.0000
Residual | 39046.5066156118 .250108934
R-squared
= 0.2295
-------------+-----------------------------Adj R-squared = 0.2294
Total | 50676.9864156129 .324584071
Root MSE
= .50011
-----------------------------------------------------------------------------lnwage3 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0712402
.0005528
128.87
0.000
.0701567
.0723237
age2 | -.0007535
6.58e-06 -114.54
0.000
-.0007664
-.0007406
female | -.1999096
.0025452
-78.54
0.000
-.2048982
-.1949211
married |
.0947973
.0028481
33.28
0.000
.089215
.1003796
black | -.1314511
.0043814
-30.00
0.000
-.1400385
-.1228637
other | -.0063689
.0057833
-1.10
0.271
-.0177041
.0049663
NE |
.0328108
.0038223
8.58
0.000
.0253191
.0403024
Midwest |
.007487
.0036482
2.05
0.040
.0003367
.0146373
South | -.0204817
.0035696
-5.74
0.000
-.027478
-.0134854
city1mil |
.1440377
.0026054
55.28
0.000
.1389312
.1491443
union2 |
.1358151
.0037783
35.95
0.000
.1284097
.1432205
_cons |
.9784856
.0107005
91.44
0.000
.9575129
.999458
Switch CBC for Union
Source |





















SS
df
MS
Number of obs =
156130
-------------+-----------------------------F( 11,156118) = 4242.43
Model | 11662.2696
11 1060.20633
Prob > F
= 0.0000
Residual | 39014.7168156118 .249905307
R-squared
= 0.2301
-------------+-----------------------------Adj R-squared = 0.2301
Total | 50676.9864156129 .324584071
Root MSE
= .49991
-----------------------------------------------------------------------------lnwage3 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0710808
.0005528
128.59
0.000
.0699974
.0721642
age2 |
-.000752
6.58e-06 -114.34
0.000
-.0007649
-.0007391
female | -.2003086
.0025431
-78.77
0.000
-.205293
-.1953242
married |
.0946468
.002847
33.24
0.000
.0890668
.1002269
black | -.1321203
.0043799
-30.17
0.000
-.1407048
-.1235358
other | -.0061873
.005781
-1.07
0.284
-.0175179
.0051434
NE |
.033546
.0038197
8.78
0.000
.0260595
.0410324
Midwest |
.0079032
.0036465
2.17
0.030
.000756
.0150503
South | -.0200437
.003568
-5.62
0.000
-.0270369
-.0130504
city1mil |
.1442921
.0026043
55.41
0.000
.1391878
.1493965
cbc2 |
.1363582
.0036181
37.69
0.000
.1292668
.1434495
_cons |
.9799436
.0106968
91.61
0.000
.9589782
1.000909
------------------------------------------------------------------------------
Use Union & CBC






















Source |
SS
df
MS
Number of obs = 156130
-------------+-----------------------------F( 12,156117) = 3889.14
Model | 11662.8996
12 971.908303
Prob > F
= 0.0000
Residual | 39014.0867156117 .249902872
R-squared
= 0.2301
-------------+-----------------------------Adj R-squared = 0.2301
Total | 50676.9864156129 .324584071
Root MSE
=
.4999
-----------------------------------------------------------------------------lnwage3 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0710741
.0005528
128.58
0.000
.0699907
.0721575
age2 | -.0007519
6.58e-06 -114.32
0.000
-.0007648
-.000739
female | -.2001837
.0025443
-78.68
0.000
-.2051704
-.1951969
married |
.0946413
.002847
33.24
0.000
.0890612
.1002213
black | -.1321795
.00438
-30.18
0.000
-.1407643
-.1235947
other | -.0061938
.005781
-1.07
0.284
-.0175244
.0051367
NE |
.0333811
.0038211
8.74
0.000
.0258919
.0408703
Midwest |
.0078341
.0036468
2.15
0.032
.0006864
.0149817
South | -.0199589
.0035684
-5.59
0.000
-.0269529
-.0129649
city1mil |
.1442482
.0026044
55.39
0.000
.1391436
.1493528
union2 |
.0175444
.0110493
1.59
0.112
-.0041121
.0392008
cbc2 |
.1205632
.0105851
11.39
0.000
.0998166
.1413098
_cons |
.9800641
.010697
91.62
0.000
.9590982
1.00103
Consequences of MultiCollinearity


Estimates remain unbiased
Variances and Standard Errors Increase




Computed t-scores fall
Estimates will be very sensitive to
specification
Overall fit of the model (r-square) will be
unaffected
Predictions are also unaffected
What Is the Issue

Where there is MultiCollinearity, we need
to be careful about interpreting results

Can be misleading about effect of variables
Detecting Collinearity

High correlation between variables


Issue: multiple variables are collectively
collinear (region example)
Variance Inflation Factor


Regress each explanatory variable on all
other explanatory variables
Calculate
1
VIFi 
(1  Ri2 )
How Do We Calculate the
VIF?

Regression Analysis: age versus years ed, Female, NE, MW, S, W

* W is highly correlated with other X variables
* W has been removed from the equation.



The regression equation is
age = 35.8 + 0.480 years ed - 1.59 Female + 0.098 NE - 0.617 MW - 0.204 S

Predictor
Constant
years ed
Female
NE
MW
S

S = 11.5764






Coef
35.7977
0.47978
-1.59360
0.0979
-0.6174
-0.2044
SE Coef
0.3712
0.02241
0.09896
0.1443
0.1416
0.1349
R-Sq = 1.5%
T
96.43
21.41
-16.10
0.68
-4.36
-1.52
P
0.000
0.000
0.000
0.498
0.000
0.130
R-Sq(adj) = 1.5%
It’s a Different Story with Regional
Variables

Regression Analysis: NE versus age, years ed, Female, MW, S, W

The regression equation is
NE = 1.00 + 0.000000 age + 0.000000 years ed + 0.000000 Female - 1.00 MW
- 1.00 S - 1.00 W



Predictor
Constant
age
years ed
Female
MW
S
W

S = 0







Coef
1.00000
0.00000000
0.00000000
0.00000000
-1.00000
-1.00000
-1.00000
R-Sq = 100.0%
SE Coef
0.00000
0.00000000
0.00000000
0.00000000
0.00000
0.00000
0.00000
T
*
*
*
*
*
*
*
P
*
*
*
*
*
*
*
R-Sq(adj) = 100.0%
CBC Has A High VIF

. reg cbc2 age age2 female married black other NE Midwest South city1mil union2

Source |
SS
df
MS
-------------+-----------------------------Model | 18165.9762
11 1651.45238
Residual | 2301.31742161780 .014224981
-------------+-----------------------------Total | 20467.2936161791 .126504525




















Number of obs
F( 11,161780)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
161792
.
0.0000
0.8876
0.8876
.11927
-----------------------------------------------------------------------------cbc2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0013903
.0001288
10.80
0.000
.0011379
.0016426
age2 | -.0000133
1.53e-06
-8.72
0.000
-.0000163
-.0000103
female |
.0025409
.0005963
4.26
0.000
.0013722
.0037096
married |
.0013089
.0006676
1.96
0.050
4.52e-07
.0026174
black |
.0063441
.001032
6.15
0.000
.0043214
.0083668
other | -.0016395
.0013597
-1.21
0.228
-.0043046
.0010255
NE | -.0043777
.000895
-4.89
0.000
-.0061319
-.0026234
Midwest | -.0027157
.0008563
-3.17
0.002
-.0043941
-.0010374
South | -.0041338
.0008356
-4.95
0.000
-.0057716
-.0024961
city1mil | -.0018596
.0006102
-3.05
0.002
-.0030555
-.0006636
union2 |
.9811512
.0008888 1103.92
0.000
.9794092
.9828932
_cons |
-.013585
.0025048
-5.42
0.000
-.0184943
-.0086757
What To Do About
MultiCollinearity


Do Nothing
Get More Data


We had 156,000 observations for the wage
regressions
Drop the Redundant Variable

Care needed in interpretation
Compare Specification Issues
Omitted
Extraneous
MultiCollinearity
Added Variable
Right signed &
Large in Magnitude
Coefficient close to
zero
Right or wrong
signed
Significance
Highly Significant
Non-significant
Weak or n.s.
Other Coef
Change sign
Little Change
Possibly change
sign
Significance
Remains singificant
Little Change
Becomes weak or
n.s.
R-square
Increase alot
Little change
Little change
New Sample
Little Difference
Little Difference
Unstable Estimates
Related documents