Download Chapter 14: Omitted Explanatory Variables, Multicollinearity, and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 14: Omitted Explanatory Variables, Multicollinearity,
and Irrelevant Explanatory Variables
Chapter 14 Outline
• Review
o Unbiased Estimation Procedures
ƒ Estimates and Random Variables
ƒ Mean of the Estimate’s Probability Distribution
ƒ Variance of the Estimate’s Probability Distribution
o Correlated and Independent (Uncorrelated) Variables
ƒ Scatter Diagrams
ƒ Correlation Coefficient
• Omitted Explanatory Variables
o A Puzzle: Baseball Attendance
o Goal of Multiple Regression Analysis
o Omitted Explanatory Variables and Bias
o Resolving the Baseball Attendance Puzzle
o Omitted Variable Summary
• Multicollinearity
o Perfectly Correlated Explanatory Variables
o Highly Correlated Explanatory Variables
o Earmarks of Multicollinearity
• Irrelevant Explanatory Variables
Chapter 14 Prep Questions
1. Review the goal of multiple regression analysis. In words, explain what
multiple regression analysis attempts to do?
2. Recall that the presence of a random variable brings forth both bad news and
good news.
a. What is the bad news?
b. What is the good news?
3. Consider an estimate’s probability distribution. Review the importance of its
mean and variance:
a. Why is the mean of the probability distribution important? Explain.
b. Why is the variance of the probability distribution important? Explain.
4. Suppose that two variables are positively correlated.
a. In words, what does this mean?
b. What type of graph do we use to illustrate their correlation? What does
the graph look like?
c. What can we say about their correlation coefficient?
2
d. When two variables are perfectly positively correlated, what will their
correlation coefficient equal?
5. Suppose that two variables are independent (uncorrelated).
a. In words, what does this mean?
b. What type of graph do we use to illustrate their correlation? What does
the graph look like?
c. What can we say about their correlation coefficient?
Baseball Data: Panel data of baseball statistics for the 588 American League
games played during the summer of 1996.
Paid attendance for game t
Day of game t
Month of game t
Year of game t
Day of the week for game t (Sunday=0, Monday=1,
etc.)
Designator hitter for game t (1 if DH permitted; 0
DHt
otherwise)
HomeGamesBehindt Games behind of the home team for before game t
Per capita income in home team's city for game t
HomeIncomet
Season losses of the home team before game t
HomeLossest
Net wins (wins less losses) of the home team before
HomeNetWinst
game t
Player salaries of the home team for game t
HomeSalaryt
(millions of dollars)
Season wins of the home team before the game
HomeWinst
before game t
Average price of tickets sold for game t’s home
PriceTickett
team (dollars)
VisitGamesBehindt Games behind of the visiting team before game t
Season losses of the visiting team before the game t
VisitLossest
Net wins (wins less losses) of the visiting team
VisitNetWinst
before game t
Player salaries of the visiting team for game t
VisitSalaryt
(millions of dollars)
Season wins of the visiting team before the game
VisitWinst
6. Focus on the baseball data.
a. Consider the following simple model:
Attendancet = βConst + βPricePriceTickett + et
Attendancet
DateDayt
DateMontht
DateYeart
DayOfWeekt
3
Attendance depends only on the ticket price.
1) What does the economist’s downward sloping demand curve
theory suggest about the sign of the PriceTicket coefficient,
βPrice?
2) Use the ordinary least squares (OLS) estimation procedure to
estimate the model’s parameters. Interpret the regression
results.
[Link to MIT-ALSummer-1996.wf1 goes here.]
b. Consider a second model:
Attendancet = βConst + βPricePriceTickett + βHomeSalaryHomeSalaryt +
et
Attendance depends not only on the ticket price, but also on the salary
of the home team.
1) Devise a theory explaining the effect that home team salary
should have on attendance. What does your theory suggest
about the sign of the HomeSalary coefficient, βHomeSalary?
2) Use the ordinary least squares (OLS) estimation procedure to
estimate both of the model’s coefficients. Interpret the
regression results.
c. What do you observe about the estimates for the PriceTicket
coefficients in the two models?
7. Again, focus on the baseball data and consider the following two variables:
Attendancet Paid attendance at the game t
PriceTickett Average ticket price in terms of dollars for game t
You can access these data by clicking the following link:
[Link to MIT-ALSummer-1996.wf1 goes here.]
Generate a new variable, PriceCents, to express the price in terms of cents
rather than dollars:
PriceCents = 100×PriceTicket
a. What is the correlation coefficient for PriceTicket and PriceCents?
b. Consider the following model:
Attendancet = βConst + βPriceTicketPriceTickett + βPriceCentsPriceCentst
+ et
Run the regression to estimate the parameters of this model. You will
get an “unusual” result. Explain this by considering what multiple
regression analysis attempts to do.
4
8. The following are excerpts from an article appearing in the New York Times on
September 1, 2008:
Doubt Grow Over Flu Vaccine in Elderly
By Brenda Goodman
The influenza vaccine, which has been strongly recommended for people
over 65 for more than four decades, is losing its reputation as an effective
way to ward off the virus in the elderly.
A growing number of immunologists and epidemiologists say the vaccine
probably does not work very well for people over 70 …
The latest blow was a study in The Lancet last month that called into
question much of the statistical evidence for the vaccine’s effectiveness.
The study found that people who were healthy and conscientious about
staying well were the most likely to get an annual flu shot. … [others] are
less likely to get to their doctor’s office or a clinic to receive the vaccine.
Dr. David K. Shay of the Centers for Disease Control and Prevention, a
co-author of a commentary that accompanied Dr. Jackson’s study, agreed
that these measures of health … “were not incorporated into early
estimations of the vaccine’s effectiveness” and could well have skewed
the findings.
a. Does being healthy and conscientious about staying well increase or
decrease the chances of getting flu?
b. According to the article, are those who are healthy and conscientious
about staying well more or less likely to get a flu shot?
c. The article alleges that previous studies did not incorporate health and
conscientious in judging the effectiveness of flu shots. If the allegation
is true, have previous studies overestimated or underestimated the
effectiveness of flu shots?
d. Suppose that you were the director of your community’s health
department. You are considering whether or not to subsidize flu
vaccines for the elderly. Would you find the previous studies useful?
That is, would a study that did not incorporate health and
conscientious in judging the effectiveness of flu shots help you decide
if your department should spend your limited budget to subsidize flu
vaccines? Explain.
5
Review
Unbiased Estimation Procedures
Estimates and Random Variables
Estimates are random variables. Consequently, there is both good news and bad
news. Before the data are collected and the parameters are estimated:
• Bad news: We cannot determine the numerical values of the estimates
with certainty (even if we knew the actual values).
• Good news: On the other hand, we can often describe the probability
distribution of the estimate telling us how likely it is for the estimate to
equal its possible numerical values.
Mean (Center) of the Estimate’s Probability Distribution
An unbiased estimation procedure does not systematically underestimate or
overestimate the actual value. The mean (center) of the estimate’s probability
distribution equals the actual value. Applying the relative frequency interpretation
of probability, when the experiment is repeated many, many times, the average of
the numerical values of the estimates equals the actual value.
Probability Distribution
Actual Value
Estimate
Figure 14.1: Probability Distribution of an Estimate – Unbiased Estimation
Procedure
6
If the distribution is symmetric, we can provide an interpretation that is
perhaps even more intuitive. When the experiment were repeated many, many
times
• half the time the estimate is greater than the actual value;
• half the time the estimate is less than the actual value.
Accordingly, we can apply the relative frequency interpretation of probability. In
one repetition, the chances that the estimate will be greater than the actual value
equal the chances that the estimate will be less.
Variance (Spread) of the Estimate’s Probability Distribution Variance
Figure 14.2: Probability Distribution of an Estimate – Importance of Variance
When the estimation procedure is unbiased, the distribution variance (spread)
indicates the estimate’s reliability, the likelihood that the numerical value of the
estimate will be close to the actual value.
7
Correlated and Independent (Uncorrelated) Variables
Two variables are
• correlated whenever the value of one variable does help us predict the
value of the other.
• independent (uncorrelated) whenever the value of one variable does not
help us predict the value of the other;
Scatter Diagrams
Figure 14.3: Scatter Diagrams, Correlation, and Independence
The Dow Jones and Nasdaq growth rates are positively correlated. Most of the
scatter diagram points lie in the first and third quadrants. When the Dow Jones
growth rate is high, the Nasdaq growth rate is usually high also. Similarly, when
the Dow Jones growth rate is low, the Nasdaq growth rate is usually low also.
Knowing one growth rate helps us predict the other. On the other hand, Amherst
precipitation and the Nasdaq growth rate are independent, uncorrelated. The
scatter diagram points are spread rather evenly across the graph. Knowing the
Nasdaq growth rate does not help us predict Amherst precipitation and vice versa.
8
Correlation Coefficient
The correlation coefficient indicates the degree to which two variables are
correlated; the correlation coefficient ranges from −1 to +1:
Independent (uncorrelated)
• =0
Knowing the value of one variable does not help us
predict the value of the other.
>
0
Positive
correlation
•
Typically, when the value of one variable is high, the
value of the other variable will be high.
Negative correlation
• <0
Typically, when the value of one variable is high, the
value of the other variable will be low.
Omitted Explanatory Variables
We shall consider baseball attendance data to study the omitted variable
phenomena.
Project: Assess the determinants of baseball attendance.
Baseball Data: Panel data of baseball statistics for the 588 American League
games played during the summer of 1996.
Paid attendance for game t
Attendancet
Day of game t
DateDayt
Month of game t
DateMontht
Year of game t
DateYeart
Day of the week for game t (Sunday=0, Monday=1,
DayOfWeekt
etc.)
Designator hitter for game t (1 if DH permitted; 0
DHt
otherwise)
HomeGamesBehindt Games behind of the home team before game t
Per capita income in home team's city for game t
HomeIncomet
Season losses of the home team before game t
HomeLossest
Net wins (wins less losses) of the home team before
HomeNetWinst
game t
Player salaries of the home team for game t
HomeSalaryt
(millions of dollars)
Season wins of the home team before game t
HomeWinst
Average price of tickets sold for game t’s home
PriceTickett
team (dollars)
VisitGamesBehindt Games behind of the visiting team before game t
Season losses of the visiting team before game t
VisitLossest
9
VisitNetWinst
VisitSalaryt
VisitWinst
Net wins (wins less losses) of the visiting team
before game t
Player salaries of the visiting team for game t
(millions of dollars)
Season wins of the visiting team before the game
A Puzzle: Baseball Attendance
Let us begin our analysis by focusing on the price of tickets. Consider the
following two models that attempt to explain game attendance:
Model 1: Attendance depends on ticket price only.
The first model has a single explanatory variable, ticket price, PriceTicket:
Attendancet = βConst + βPricePriceTickett + et
Downward Sloping Demand Theory: This model is based on the economist’s
downward sloping demand theory. An increase in the price of a good decreases
the quantity demand. Higher ticket prices should reduce attendance; hence, the
PriceTicket coefficient should be negative:
βPrice < 0
We shall use the ordinary least squares (OLS) estimation procedure to estimate
the model’s parameters:
[Link to MIT-ALSummer-1996.wf1 goes here.]
Ordinary Least Squares (OLS)
Dependent Variable: Attendance
SE
t-Statistic
Estimate
Explanatory Variable(s):
PriceTicket
1896.611 142.7238
13.28868
Const
3688.911 1839.117
2.005805
Number of Observations
Prob
0.0000
0.0453
585
Estimated Equation: EstAttendance = 3,688 + 1,897PriceTicket
Interpretation of Estimates:
bPriceTicket = 1,897. We estimate that a $1.00 increase in the price of tickets
increases attendance by 1,897 per game.
Table 14.1: Baseball Attendance Regression Results – Ticket Price Only
The estimated coefficient for the ticket price is positive suggesting that higher
prices lead to an increase in quantity demanded. This contradicts the downward
sloping demand theory, does it not?
10
Model 2: Attendance depends on ticket price and salary of home team.
In the second model, we include not only the price of tickets, PriceTicket, as an
explanatory variable, but also the salary of the home team, HomeSalary:
Attendancet = βConst + βPricePriceTickett + βHomeSalaryHomeSalaryt + et
We can justify the salary explanatory variable in the grounds that fans like to
watch good players. We shall call this the star theory. Presumably, a high salary
team has better players, more stars, on its roster and accordingly will draw more
fans.
Star Theory: Teams with higher salaries will have better players which will
increase attendance. The HomeSalary coefficient should be positive:
βHomeSalary > 0
Now, use the ordinary least squares (OLS) estimation procedure to estimate the
parameters.
Ordinary Least Squares (OLS)
Dependent Variable: Attendance
Estimate
SE
t-Statistic
Explanatory Variable(s):
PriceTicket
-3.198211
−590.7836 184.7231
HomeSalary
783.0394 45.23955
17.30874
Const
9246.429 1529.658
6.044767
Number of Observations
Prob
0.0015
0.0000
0.0000
585
Estimated Equation: EstAttendance = 9,246 − 591PriceTicket
+ 783HomeSalary
Interpretation of Estimates:
bPriceTicket = −591. We estimate that a $1.00 increase in the price of tickets
decreases attendance by 591 per game.
bHomeSalary = 783. We estimate that a $1 million increase in the home team
salary increases attendance by 783 per game.
Table 14.2: Baseball Attendance Regression Results – Ticket Price and Home
Team Salary
These coefficient estimates lend support to our theories.
The two models produce very different results concerning the effect of the
ticket price on attendance. More specifically, the coefficient estimate for ticket
price changes drastically from 1,897 to −591 when we add home team salary as
an explanatory variable. This is a disquieting puzzle. We shall resolve this puzzle
by reviewing the goal of multiple regression analysis and then explaining when
omitting an explanatory variable will prevent us from achieving the goal.
11
Goal of Multiple Regression Analysis
Multiple regression analysis attempts to sort out the individual effect of each
explanatory variable. The estimate of an explanatory variable’s coefficient allows
us to assess the effect that an individual explanatory variable itself has on the
dependent variable. An explanatory variable’s coefficient estimate estimates the
change in the dependent variable resulting from a change in that particular
explanatory variable while all other explanatory variables remain constant.
In Model 1 we estimate that a $1.00 increase in the ticket price increase
attendance by nearly 2,000 per game whereas in Model 2 we estimate that a $1.00
increase decreases attendance by about 600 per game. The two models suggest
that the individual effect of the ticket price is very different. The omitted variable
phenomenon allows us to resolve this puzzle.
Omitted Explanatory Variables and Bias
Claim: Omitting an explanatory variable from a regression will bias the
estimation procedure whenever two conditions are met. Bias results if the omitted
explanatory variable
• influences the dependent variable;
• is correlated with an included explanatory variable.
When these two conditions are met, the coefficient estimate of the included
explanatory variable is a composite of two effects, the influence that the
• included explanatory variable itself has on the dependent variable (direct
effect);
• omitted explanatory variable has on the dependent variable because the
included explanatory variable also acts as a proxy for the omitted
explanatory variable (proxy effect).
Since the goal of multiple regression analysis is to sort out the individual effect of
each explanatory variable we want to capture only the direct effect.
12
Econometrics Lab 14.1: Omitted Variable Proxy Effect
We can now use the Econometrics Lab to justify our claims concerning omitted
explanatory variables. The following regression model including two explanatory
variables is used:
Model: yt = βConst + βx1x1t + βx2x2t + et
[Link to MIT-Lab 14.1 goes here.]
Figure 14.4: Omitted Variable Simulation
The simulation provides us with two options; we can either include both
explanatory variables in the regression, “Both Xs’ or just one, “Only X1.” By
default the “Only X1” option is selected, consequently the second explanatory
variable is omitted. That is, x1t is the included explanatory variable and x2t is the
omitted explanatory variable. For simplicity, assume that x1’s coefficient, βx1, is
positive. We shall consider three cases to illustrate when bias does and does not
result:
13
•
Case 1. The coefficient of the omitted explanatory variable is positive and
the two explanatory variables are independent (uncorrelated).
• Case 2. The coefficient of the omitted explanatory variable equals zero
and the two explanatory variables are positively correlated.
• Case 3. The coefficient of the omitted explanatory variable is positive and
the two explanatory variables are positively correlated.
We shall now show that only in the last case does bias results because only in the
last case is the proxy effect is present.
Case 1: The coefficient of the omitted explanatory variable is positive and the two
explanatory variables are independent (uncorrelated).
Will bias result in this case? Since the two explanatory variables are independent
(uncorrelated), an increase in the included explanatory variable, x1t, typically will
not affect the omitted explanatory variable, x2t. Consequently, the included
explanatory variable, x1t, will not act as a proxy for the omitted explanatory
variable, x2t. Bias should not result.
Typically,
Included
omitted
Independence
variable
variable
⎯⎯⎯⎯⎯⎯→
x1t up
x2t unaffected
↓ βx1 > 0
↓ βx2 > 0
yt up
yt unaffected
↓
↓
Direct Effect
No Proxy Effect
We shall use our lab to confirm this logic. By default, the actual
coefficient for the included explanatory variable, x1t, equals 2 and the actual
coefficient for the omitted explanatory variable, x2t, is nonzero, it equals 5. Their
correlation coefficient, Corr X1&X2, equals .00; hence, the two explanatory
variables are independent (uncorrelated). Be certain that the Pause checkbox is
cleared. Click Start and after many, many repetitions, click Stop. Table 14.3
reports that the average value of the coefficient estimates for the included
explanatory variable equals its actual value. Both equal 2.0. The ordinary least
squares (OLS) estimation procedure is unbiased.
Percent of Coef1 Estimates
Actual Actual Corr Mean (Average) Below Actual Above Actual
Coef 1 Coef 2 Coef of Coef1Estimates
Value
Value
2
5
.00
≈2.0
≈50
≈50
Table 14.3: Omitted Variables Simulation Results
14
The ordinary least squares (OLS) estimation procedure captures the individual
influence that the included explanatory variable itself has on the dependent
variable. This is precisely the effect that we wish to capture. The ordinary least
squares (OLS) estimation procedure is unbiased; it is doing what we want it to do.
Case 2: The coefficient of the omitted explanatory variable equals zero and the
two explanatory variables are positively correlated.
In the second case, the two explanatory variables are positively correlated; when
the included explanatory variable, x1t, increases, the omitted explanatory variable,
x2t, will typically increase also. But the actual coefficient of the omitted
explanatory variable, βx2, equals 0; hence, the dependent variable, yt, is unaffected
by the increase in x2t. There is no proxy effect because the omitted variable, x2t,
does not affect the dependent variable; hence, bias should not result.
Typically,
Included
Positive
omitted
variable
Correlation
variable
x1t up
⎯⎯⎯⎯⎯⎯→
x2t up
↓ βx1 > 0
↓ βx2 = 0
yt up
yt unaffected
↓
↓
Direct Effect
No Proxy Effect
To confirm our logic with the simulation, be certain that the actual
coefficient for the omitted explanatory variable equals 0 and the correlation
coefficient equals .30. Click Start and then after many, many repetitions, click
Stop. Table 14.4 reports that the average value of the coefficient estimates for the
included explanatory variable equals its actual value. Both equal 2.0. The ordinary
least squares (OLS) estimation procedure is unbiased.
Percent of Coef1 Estimates
Mean (Average) Below Actual Above Actual
of
Coef 1 Coef 2 Coef
Value
Value
Coef1Estimates
2
5
.00
≈2.0
≈50
≈50
2
0
.30
≈2.0
≈50
≈50
Table 14.4: Omitted Variables Simulation Results
Actual Actual Corr
Again, the ordinary least squares (OLS) estimation procedure captures the
influence that the included explanatory variable itself has on the dependent
variable. Again, there is no proxy effect and all is well.
15
Case 3: The coefficient of the omitted explanatory variable is positive and the two
explanatory variables are positively correlated.
As with Case 2, the two explanatory variables are positively correlated; when the
included explanatory variable, x1t, increases the omitted explanatory variable, x2t,
will typically increase also. But now, the actual coefficient of the omitted
explanatory variable, βx2, is no longer 0, it is positive; hence, an increase in the
omitted explanatory variable, x2t, increases the dependent variable. In additional
to having a direct effect on the dependent variable, the included explanatory
variable, x1t, also acts as a proxy for the omitted explanatory variable, x2t. There
is a proxy effect.
Typically,
Included
Positive
omitted
variable
Correlation
variable
x1t up
⎯⎯⎯⎯⎯⎯→
x2t up
↓ βx1 > 0
↓ βx2 > 0
yt up
yt up
↓
↓
Direct Effect
Proxy Effect
In the simulation, the actual coefficient of omitted explanatory variable,
βx2, once again equals 5. The two explanatory variables are still positively
correlated, the correlation coefficient equals .30. Click Start and then after many,
many repetitions, click Stop. Table 14.5 reports that the average value of the
coefficient estimates for the included explanatory variable, 3.5, exceeds its actual
value, 2.0. The ordinary least squares (OLS) estimation procedure is biased
upward.
Percent of Coef1 Estimates
Actual Actual Corr Mean (Average) Below Actual Above Actual
of
Coef 1 Coef 2 Coef
Value
Value
Coef1Estimates
2
5
.00
≈2.0
≈50
≈50
2
0
.30
≈2.0
≈50
≈50
2
5
.30
≈3.5
≈28
≈72
Table 14.5: Omitted Variables Simulation Results
Now we have a problem. The ordinary least squares (OLS) estimation procedure
overstates the influence of the included explanatory variable, the effect that the
included explanatory variable itself has on the dependent variable.
16
Let us now take a brief aside. Case 3 provides us with the opportunity to
illustrate what bias does and does not mean.
bCoef1 < 2
2
bCoef1
3.3
Figure 14.5: Probability Distribution of an Estimate – Upward Bias
•
What bias does mean: Bias means that the estimation procedure
systematically overestimates or underestimates the actual value. In this
case, upward bias is present. The average of the estimates is greater than
the actual value after many, many repetitions.
• What bias does not mean: Bias does not mean that the value of the
estimate in a single repetition must be less than the actual value in the case
of downward bias or greater than the actual value in the case of upward
bias. Focus on the last simulation. The ordinary least squares (OLS)
estimation procedure is biased upward as a consequence of the proxy
effect. Despite the upward bias, however, the estimate of the included
explanatory variable is less than the actual value in 12.5 percent of the
repetitions.
Upward bias does not guarantee that in any one repetition the estimate will be
greater than the actual value. It just means that it will be greater “on average.” If
the probability distribution is symmetric, the chances of the estimate being greater
than the actual value exceed the chances of being less.
17
Now, we return to our three omitted variable cases by summarizing them:
Does the omitted
Is the omitted
Estimation procedure
variable influence the variable correlated with
for the included
Case dependent variable? an included variable?
variable is
1
Yes
No
Unbiased
2
No
Yes
Unbiased
3
Yes
Yes
Biased
Table 14.6: Omitted Variables Simulation Summary
Econometrics Lab 14.2: Avoiding Omitted Variable Bias
Question: Is the estimation procedure biased or unbiased when both explanatory
variables are included in the regression?
[Link to MIT-Lab 14.2 goes here.]
To address this question, “Both Xs” is now selected. This means that both
explanatory variables, x1t and x2t, will be included in the regression. Both
explanatory variables affect the dependent variable and they are correlated. As we
saw in Case 3, if one of the explanatory variables is omitted, bias will result. To
see what occurs when both explanatory variables are included, click Start and
after many, many repetitions, click Stop. When both variables are included,
however, the ordinary least squares (OLS) estimation procedure is unbiased:
Actual
Actual
Correlation
Mean of Coef 1
Coef 1
Coef 2
Parameter
Estimates
2
5
.3
≈2.0
Table 14.7: Omitted Variables Simulation Results – No Omitted Variables
Conclusion: To avoid omitted variable bias, all relevant explanatory variables
should be included in a regression.
18
Resolving the Baseball Attendance Puzzle
We begin by reviewing the baseball attendance models:
• Model 1: Attendance depends on ticket price only.
Attendancet = βConst + βPricePriceTickett + et
Estimated Equation: EstAttendance = 3,688 + 1,897PriceTicket
Interpretation: We estimate that $1.00 increase in the price of tickets
increases by 1,897 per game.
• Model 2: Attendance depends on ticket price and salary of home team.
Attendancet = βConst + βPricePriceTickett + βHomeSalaryHomeSalaryt + et
Estimated Equation: EstAttendance = 9,246 − 591PriceTicket +
783HomeSalary
Interpretation: We estimate that a
• $1.00 increase in the price of tickets decreases attendance by 591
per game.
• $1 million increase in the home team salary increases attendance
by 783 per game.
The ticket price coefficient estimate is affected dramatically by the presence of
home team salary; in Model 1 the estimate is much higher 1,897 versus −591.
Why?
We shall now argue that when ticket price is included in the regression
and home team salary is omitted, as in Model 1, there reason to believe that the
estimation procedure for the ticket price coefficient will be biased. We just
learned that the omitted variable bias results when the following two conditions
are met; when an omitted explanatory variable:
• influences the dependent variable
and
• is correlated with an included explanatory variable.
Now focus on Model 1.
Attendancet = βConst + βPricePriceTickett + et
Model 1 omits home team salary, HomeSalaryt. Are the two omitted variable bias
conditions met?
• It certainly appears reasonable to believe that the omitted explanatory
variable, HomeSalaryt, affects the dependent variable, Attendancet. The
club owner who is paying the high salaries certainly believes so. The
owner certainly hopes that by hiring better players more fans will attend
the games. Consequently, it appears that the first condition required for
omitted variable bias is met.
• We can confirm the correlation by using statistical software to calculate
the correlation matrix:
19
Correlation Matrix
PriceTicket
HomeSalary
PriceTicket
1.000000
0.777728
HomeSalary
0.777728
1.000000
Table 14.8: Ticket Price and Home Team Salary Correlation Matrix
The correlation coefficient between PriceTickett and HomeSalaryt is .78;
the variables are positively correlated. The second condition required for
omitted variable bias is met.
We have reason to suspect bias in Model 1. When the included variable,
PriceTickett, increases the omitted variable, HomeSalaryt, typically increases
also. An increase in the omitted variable, HomeSalaryt, increases the dependent
variable, Attendancet:
Typically,
Included
Positive
omitted
variable
Correlation
variable
PriceTickett up
⎯⎯⎯⎯⎯⎯→ HomeSalary up
t
βPrice < 0
↓
↓ βHomeSalary > 0
Attendancet
Attendancet up
down
↓
↓
Direct Effect
Proxy Effect
In additional to having a direct effect on the dependent variable, the included
explanatory variable, PriceTickett, also acts as a proxy for the omitted explanatory
variable, HomeSalaryt. There is a proxy effect and upward bias results. This
provides us with an explanation of why the ticket price coefficient estimate in
Model 1 is greater than the estimate in Model 2.
20
Omitted Variable Summary
Omitting an explanatory variable from a regression biases the estimation
procedure whenever two conditions are met. Bias results if the omitted
explanatory variable:
• influences the dependent variable;
• is correlated with an included explanatory variable.
When these two conditions are met, the coefficient estimate of the included
explanatory variable is a composite of two effects; the coefficient estimate of the
included explanatory reflects the influence that the
• included explanatory variable itself has on the dependent variable (direct
effect);
• omitted explanatory variable has on the dependent variable because the
included explanatory variable also acts as a proxy for the omitted
explanatory variable (proxy effect).
The bad news is that the proxy effect leads to bias. The good news is that we can
eliminate the proxy effect and its accompanying bias by including the omitted
explanatory variable. But now, we shall learn that if two explanatory variables are
highly correlated a different problem can emerge.
Multicollinearity
The phenomenon of multicollinearity occurs when two explanatory variables are
highly correlated. Recall that multiple regression analysis attempts to sort out the
influence of each individual explanatory variable. But what happens when we
include two explanatory variables in a single regression that are perfectly
correlated? Let us see.
Perfectly Correlated Explanatory Variables
In our baseball attendance workfile, ticket prices, PriceTickett, are reported in
terms of dollars. Generate a new variable, PriceCentst, reporting ticket prices in
terms of cents rather than dollars:
PriceCentst = 100 × PriceTickett
Note that the variables PriceTickett and PriceCentst are perfectly correlated. If we
know one, we can predict the value of the other with complete accuracy. Just to
confirm this, use statistical software to calculate the correlation matrix:
Correlation Matrix
PriceTicket
PriceCents
PriceTicket
1.000000
1.000000
PriceCents
1.000000
1.000000
Table 14.9: EViews Dollar and Cent Ticket Price Correlation Matrix
21
The correlation coefficient of PriceTickett and PriceCentst equals 1.00. The
variables are indeed perfectly correlated.
Now, run a regression with Attendance as the dependent variable and both
PriceTicket and PriceCents as explanatory variables.
Dependent variable: Attendance
Explanatory variables: PriceTicket and PriceCents
Your statistical software will report a diagnostic. Different software packages
provide different messages, but basically the software is telling us that it cannot
run the regression.
Why does this occur? The reason is that the two variables are perfectly
correlated. Knowing the value of one allows us to predict perfectly the value of
the other with complete accuracy. Both explanatory variables contain precisely
the same information. Multiple regression analysis attempts to sort out the
influence of each individual explanatory variable. But if both variables contain
precisely the same information, it is impossible to do this. How can we possibility
separate out each variable’s individual effect when the two variables contain the
identical information? We are asking statistical software to do the impossible.
Explanatory variables
perfectly correlated
↓
Knowing the value of one
explanatory value allows
us to predict perfectly the
value of the other
↓
Both variables contain
precisely the same information
↓
Impossible to separate out the
individual effect of each variable
Next, we consider a case in which the explanatory variables are highly, although
not perfectly, correlated.
22
Highly Correlated Explanatory Variables
To investigate the problems created by highly correlated explanatory variable we
shall use our baseball data to investigate a model that includes four explanatory
variables:
Attendancet = βConst + βPricePriceTickett + βHomeSalaryHomeSalaryt +
βHomeNWHomeNetWinst + βHomeGBHomeGamesBehindt
+ et
Paid attendance for game t
Attendancet
Average price of tickets sold for game t’s home team
PriceTickett
(dollars)
Player salaries of the home team for game t (millions of
HomeSalaryt
dollars)
The difference between the number of wins and losses of
HomeNetWinst
the home team before game t
Games behind of the home team before game t
HomeGamesBehindt
The variable HomeNetWinst equals the difference between the number of
wins and losses of the home team. It attempts to capture the quality of the team.
HomeNetWinst will be positive and large for a high quality team, a team that wins
many more games than it losses. On the other hand, HomeNetWinst will be a
negative number for a low quality team. Since baseball fans enjoy watching high
quality teams, we would expect high quality teams to be rewarded with greater
attendance:
The variable HomeGamesBehindt captures the home team’s standing in its
divisional race. For those who are not baseball fans, note that all teams that win
their division automatically qualify for the baseball playoffs. Ultimately, the two
teams what win the American and National League playoffs meet in the World
Series. Since it is the goal of every team to win the World Series, each team
strives to win its division. Games behind indicates how close a team is to winning
its division. To explain how games behind are calculated, consider the final
standings of the American League Eastern Division in 2009:
Team
Wins Losses Home Net Wins Games Behind
New York Yankees
103
59
44
0
Boston Red Sox
95
67
28
8
Tampa Bay Rays
84
78
6
19
Toronto Blue Jays
75
87
28
−12
Baltimore Orioles
64
98
39
−34
Table 14.10: 2009 Final Season Standings – AL East
23
The Yankees had the best record; the games behind value for the Yankees equals
0. The Red Sox won eight fewer games than the Yankees; hence, the Red Sox
were 8 games behind. The Rays won 19 fewer games than the Yankees; hence the
Rays were 19 games behind. Similarly, the Blue Jays were 28 games behind and
the Orioles 39 games behind.1 During the season if a team’s games behind
becomes larger, it becomes less likely the team will win its division, less likely for
that team to qualify for the playoffs, and less likely for that team to eventually win
the World Series. Consequently, if a team’s games behind becomes larger, we
would expect home team fans to become discourage resulting in less attendance.
We use the terms team quality and division race to summarize our theories
regarding home net wins and home team games behind:
• Team Quality Theory: More net wins increase attendance. βHomeNW > 0.
• Division Race Theory: More games behind decreases attendance. βHomeGB
< 0.
We would expect HomeNetWinst and HomeGamesBehindt to be negatively
correlated. As HomeNetWins decreases, a team moves farther from the top of its
division and consequently HomeGamesBehindt increases. We would expect the
correlation coefficient for HomeNetWinst and HomeGamesBehindt to be negative.
Let us check by computing their correlation matrix:
[Link to MIT-ALSummer-1996.wf1 goes here.]
Correlation Matrix
HomeNetWins
HomeGamesBehind
HomeNetWins
1.000000
−0.962037
HomeGamesBehind
1.000000
−0.962037
Table 14.11: HomeNetWins and HomeGamesBehind Correlation Matrix
Table 14.11 reports that the correlation coefficient for HomeGamesBehindt and
HomeNetWinst equals −.962. Recall that the correlation coefficient must lie
between −1 and +1. When two variables are perfectly negatively correlated their
correlation coefficient equals −1. While HomeGamesBehindt and HomeNetWinst
are not perfectly negatively correlated, they come close; they are highly
negatively correlated.
24
We use the ordinary least squares (OLS) estimation procedure to estimate
the model’s parameters:
Ordinary Least Squares (OLS)
Dependent Variable: Attendance
Estimate
SE
t-Statistic
Explanatory Variable(s):
PriceTicket
-2.295725
−437.1603 190.4236
HomeSalary
667.5796 57.89922
11.53003
HomeNetWins
60.53364 85.21918
0.710329
HomeGamesBehind
−84.38767 167.1067
-0.504993
Const
11868.58 2220.425
5.345184
Number of Observations
Prob
0.0220
0.0000
0.4778
0.6138
0.0000
585
Estimated Equation: EstAttendance = 11,869 − 437PriceTicket
+ 668HomeSalary + 61HomeNetWins
− 84HomeGamesBehind
Interpretation of Estimates:
bPriceTicket = −437. We estimate that a $1.00 increase in the price of tickets
decreases attendance by 437 per game.
bHomeSalary = 668. We estimate that a $1 million increase in the home team
salary increases attendance by 668 per game.
bHomeGamesBehind = −84. We estimate that 1 additional game behind decreases
attendance by 84 per game.
Table 14.12: Attendance Regression Results
The sign of each estimate supports the theories. Focus on the two new variables
included in the model: HomeNetWinst and HomeGamesBehindt. Construct the null
and alternative hypotheses.
Team Quality Theory
Division Race Theory
H0:βHomeGB = 0 Games behind
H0: βHomeNW = 0 Team quality
has no effect
has no effect on
on attendance
attendance
H1: βHomeNW > 0 Team quality
H1: βHomeGB < 0 Games behind
increases
decreases
attendance
attendance
25
While the signs coefficient estimates are encouraging, some of results are
disappointing:
• The coefficient estimate for HomeNetWinst is positive supporting our
theory, but what about the Prob[Results IF H0 True]?
What is the probability that the estimate from one regression would
equal 60.53 or more, if the H0 were true (that is, if the actual
coefficient, βHomeNW, equals 0, if home team quality has no effect on
attendance)? Using the tails probability:
.4778
Prob[Results IF H 0 True] =
≈ .24
2
We cannot reject the null hypothesis at the traditional significance
levels of 1, 5, or 10 percent, suggesting that it is quite possible for the
null hypothesis to be true, quite possible that home team quality has no
effect on attendance.
• Similarly, The coefficient estimate for HomeGamesBehindt is negative
supporting our theory, but what about the Prob[Results IF H0 True]?
What is the probability that the estimate from one regression would
equal −84.39 or less, if the H0 were true (that is, if the actual
coefficient, βHomeGB, equals 0, if games behind has no effect on
attendance)? Using the tails probability:
.6138
Prob[Results IF H 0 True] =
≈ .31
2
Again, we cannot reject the null hypothesis at the traditional
significance levels of 1, 5, or 10 percent, suggesting that it is quite
possible for the null hypothesis to be true, quite possible that games
behind has no effect on attendance.
26
Should we abandon our “theory” as a consequence of these regression results?
Let us perform a Wald test to access the proposition that both coefficients
equal 0:
Neither team quality nor games
H0: βHomeNW = 0 and βHomeGB = 0
behind have an effect on attendance
H1: βHomeNW ≠ 0 and/or βHomeGB ≠ 0 Either team quality and/or games
behind have an effect on attendance
F-statistic
Wald Test
Degrees of Freedom
Value
Num
Dem
5.046779
2
580
Prob
0.0067
Table 14.13: EViews Wald Test Results
Prob[Results IF H0 True]: What is the probability that the F-statistic would be
111.4 or more, if the H0 were true (that is, if both βHomeNW and βHomeGB equal 0, if
both team quality and games behind have no effect on attendance)?
Prob[Results IF H0 True] = .0067
We can reject the null hypothesis at a 1 percent significance level; it is unlikely
that both team quality and games behind have no effect on attendance.
There appears to be a paradox when we compare the t-tests and the Wald
test:
t-tests
Wald test
ã é
↓
Cannot reject the
Cannot reject the
Can reject the null
null hypothesis
null hypothesis
hypothesis that both
that team quality
that games behind
team quality and games
have no effect
have no effect
behind have no effect
on attendance.
on attendance.
on attendance.
é
ã
↓
Individually, neither team quality nor
Team quality and/or
games
games behind do appear to
behind appear to influence attendance
influence attendance
Individually, neither team quality nor games behind appears to influence
attendance significantly; but taken together by asking if team quality and/or
games behind influence attendance, we conclude that they do.
27
Next, let us run two regressions each of which includes only one of the
two troublesome explanatory variables:
Ordinary Least Squares (OLS)
Dependent Variable: Attendance
Estimate
SE
t-Statistic
Explanatory Variable(s):
PriceTicket
-2.379268
−449.2097 188.8016
HomeSalary
672.2967 57.10413
11.77317
HomeNetWins
100.4166 31.99348
3.138658
Const
11107.66 1629.863
6.815087
Number of Observations
Prob
0.0177
0.0000
0.0018
0.0000
585
Estimated Equation: EstAttendance = 11,108 − 449PriceTicket
+ 672HomeSalary + 100HomeNetWins
Interpretation of Estimates:
bPriceTicket = −449. We estimate that a $1.00 increase in the price of tickets
decreases attendance by 449 per game.
bHomeSalary = 672. We estimate that a $1 million increase in the home team
salary increases attendance by 672 per game.
bHomeNetWins = 100. We estimate that 1 additional home net win increases
attendance by 100 per game.
Table 14.14: EViews Attendance Regression Results – HomeGamesBehind
Omitted
28
Ordinary Least Squares (OLS)
Dependent Variable: Attendance
Estimate
SE
t-Statistic
Explanatory Variable(s):
PriceTicket
-2.278295
−433.4971 190.2726
HomeSalary
670.8518 57.69106
11.62835
HomeGamesBehind
−194.3941 62.74967
-3.097931
Const
12702.16 1884.178
6.741486
Number of Observations
Prob
0.0231
0.0000
0.0020
0.0000
585
Estimated Equation: EstAttendance = 12,702 − 433PriceTicket
+ 671HomeSalary − 194HomeGamesBehind
Interpretation of Estimates:
bPriceTicket = −433. We estimate that a $1.00 increase in the price of tickets
decreases attendance by 433 per game.
bHomeSalary = 671. We estimate that a $1 million increase in the home team
salary increases attendance by 671 per game.
bHomeGamesBehind = −194. We estimate that 1 additional game behind decreases
attendance by 194 per game.
Table 14.15: EViews Attendance Regression Results – HomeNetWins Omitted
When only a single explanatory variable is included the coefficient is significant.
“Earmarks” of Multicollinearity
We are observing what we shall call the earmarks of multicollinearity:
• Explanatory variables are highly correlated.
• Regression with both explanatory variables:
o t-tests do not allow us to reject the null hypothesis that the
coefficient of each individual variable equals 0; when considering
each explanatory variable individually, we cannot reject the
hypothesis that each individually has no influence.
o a Wald test allows us to reject the null hypothesis that the
coefficients of both explanatory variables equal 0; when
considering both explanatory variables together, we can reject the
hypothesis that they have no influence.
• Regressions with only one explanatory variable appear to produce “good”
results.
29
How can we explain this? Recall that multiple regression analysis attempts to sort
out the influence of each individual explanatory variable. When two explanatory
variables are perfectly correlated, it is impossible for the ordinary least squares
(OLS) estimation procedure to separate out the individual influences of each
variable. Consequently, if two variables are highly correlated, as team quality and
games behind are, it may be very difficult for the ordinary least squares (OLS)
estimation procedure to separate out the individual influence of each explanatory
variable. This difficulty evidences itself in the variance of the coefficient
estimates’ probability distributions. When two highly correlated variables are
included in the same regression, the variances of each estimate’s probability
distribution is large. This explains our t-test results.
Explanatory variables
Explanatory variables
perfectly correlated
highly correlated
↓
↓
Knowing the value of one
Knowing the value of one
variable allows us to
variable allows us to
predict the other perfectly
predict the other very accurately
↓
↓
Both variables contain
In some sense, both variables
the same information
contain nearly the same information
↓
↓
Impossible to separate out
Difficult to separate out
their individual effects
their individual effects
↓
Large variance of each coefficient
estimate’s probability distribution
30
We use a simulation to justify our explanation.
Econometrics Lab 14.3: Multicollinearity
Figure 14.6: Multicollinearity Simulation
Our model includes two explanatory variables, x1t and x2t:
Model: y = βConst + βx1x1t + βx2x2t + et
[Link to MIT-Lab 14.3 goes here.]
By default the actual value of the coefficient for the first explanatory variable
equals 2 and actual value for the second equals 5. Note that the “Both Xs” is
selected; both explanatory variables are included in the regression. Initially, the
correlation coefficient is specified as .00; that is, initially the explanatory
variables are independent. Be certain that the Pause checkbox is cleared and click
Start. After many, many repetitions click Stop. Next, repeat this process for a
correlation coefficient of .30, a correlation coefficient of .60, and a correlation
coefficient of .90.
31
Mean of Variance of
Correlation
Coef 1
Coef 1
Actual Coef 1 Parameter Estimates Estimates
2
.00
≈2.0
≈6.5
2
.30
≈2.0
≈7.2
2
.60
≈2.0
≈10.1
2
.90
≈2.0
≈34.2
Table 14.16: Multicollinearity Simulation Results
The simulation reveals both good news and bad news:
Good news: The ordinary least squares (OLS) estimation procedure is
unbiased. The mean of the estimate’s probability distribution equals the
actual value. The estimation procedure does not systematically
underestimate or overestimate the actual value.
• Bad news: As the two explanatory variables become more correlated, the
variance of the coefficient estimate’s probability distribution increases.
Consequently, the estimate from one repetition becomes less reliable.
The simulation illustrates the phenomenon of multicollinearity.
•
Irrelevant Explanatory Variables
An irrelevant explanatory variable is a variable that does not influence the
dependent variable. Including an irrelevant explanatory variable can be viewed as
adding “noise,” an additional element of uncertainty, into the mix. An irrelevant
explanatory variable adds a new random influence to the model. If our logic is
correct, irrelevant explanatory variables should lead to both good news and bad
news:
• Good news: Random influences do not cause the ordinary least squares
(OLS) estimation procedure to be biased. Consequently, the inclusion of
an irrelevant explanatory variable does not lead to bias.
• Bad news: The additional uncertainty added by the new random influence
means that the coefficient estimate is less reliable; the variance of the
coefficient estimate’s probability distribution rises when an irrelevant
explanatory variable is present.
32
We shall use our Econometrics Lab to justify our intuition.
Econometrics Lab 14.4: Irrelevant Explanatory Variables
Figure 14.7: Irrelevant Explanatory Variable Simulation
[Link to MIT-Lab 14.4 goes here.]
Once again we use a two explanatory variable model:
Model: y = βConst + βx1x1t + βx2x2t + et
By default the first explanatory variable, x1t, is the relevant explanatory variable;
the default value of its coefficient is 2. The second explanatory variable, x2t, is the
irrelevant one. An irrelevant explanatory variable has no effect on the dependent
variable; consequently, the actual value of its coefficient, βx2, equals 0.
Initially, the “Only X1” option is selected indicating that only the relevant
explanatory variable, x1t, is included in the regression; the irrelevant explanatory
variable, x2t, is not included. Click Start and then after many, many repetitions
click Stop. Since the irrelevant explanatory variable is not included in the
regression, correlation between the two explanatory variables should have no
impact on the results. Confirm this by changing correlation coefficients from .00
to .30 in the “Corr X1&X2” list. Click Start and then after many, many repetitions
33
click Stop. Similarly, show that the results are unaffected when the correlation
coefficient is .60 and .90.
Subsequently, investigate what happens when the irrelevant explanatory
variable is included by selecting the “Both Xs” option; the irrelevant explanatory,
x2t, will now be included in the regression. Be certain that the correlation
coefficient for the relevant and irrelevant explanatory variables initially equals
.00. Click Start and then after many, many repetitions click Stop. Investigate how
correlation between the two explanatory variables affects the results when the
irrelevant explanatory variable is included by selecting correlation coefficient
values of .30, .60, and .90. For each case, click Start and then after many, many
repetitions click Stop. Table 14.17 reports the results of the lab.
Only Variable 1
Included
Mean of
Variance
Variables 1 and 2
Included
Mean of
Variance
Corr Coef
for
Actual
Coef 1
of Coef 1
Coef 1
of Coef 1
Variables
Coef 1
1 and 2
Estimates
Estimates
Estimates
Estimates
2.0
.00
≈2.0
≈6.4
≈2.0
≈6.5
2.0
.30
≈2.0
≈6.4
≈2.0
≈7.2
2.0
.60
≈2.0
≈6.4
≈2.0
≈10.1
2.0
.90
≈2.0
≈6.4
≈2.0
≈34.2
Table 14.17: Irrelevant Explanatory Variable Simulation Results
The results reported in Table 14.17 are not surprising; the results support
our intuition:
• Only Relevant Variable (Variable 1) Included:
o The mean of the coefficient estimate for relevant explanatory
variable, x1t, equals 2, the actual value; consequently, the ordinary
least squares (OLS) estimation procedure for the coefficient
estimate is unbiased.
o Naturally, the variance of the coefficient estimate is not affected by
correlation between the relevant and irrelevant explanatory
variables because the irrelevant explanatory variable is not
included in the regression.
• Both Relevant and Irrelevant Variables (Variables 1 and 2) Included:
o The mean of the coefficient estimates for relevant explanatory
variable, x1t, still equals 2, the actual value; consequently, the
ordinary least squares (OLS) estimation procedure for the
coefficient estimate is unbiased.
34
o The variance of the coefficient estimate is greater whenever the
irrelevant explanatory variable is included even when the two
explanatory variables are independent (when the correlation
coefficient equals .00). This occurs because the irrelevant
explanatory variable is adding a new random influence to the
model.
o As the correlation between the relevant and irrelevant explanatory
variables increases it becomes more difficult for the ordinary least
squares (OLS) estimation procedure to separate out the individual
influence of each explanatory variable. As we saw with
multicollinearity, this difficulty evidences itself in the variance of
the coefficient estimate’s probability distributions. As the two
explanatory variables become more correlated the variance of the
coefficient estimate’s probability distribution increases.
The simulation illustrates the effect of including an irrelevant explanatory
variable in a model. While it does not cause bias, it does make the coefficient
estimate of the relevant explanatory variable less reliable by increasing the
variance of its probability distribution.
1
In this example all teams have played the same number of games. When a
different number of games have been played the calculation becomes a little more
complicated. Games behind for a non-first place team equals
(Wins of First − Wins of Trailing) + (Losses of Trailing − Losses of First)
2