Download Word Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Time series wikipedia , lookup

Discrete choice wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Page-1
Econ107 Applied Econometrics
Topic 10: Dummy Dependent Variable
(Studenmund, Chapter 13)
I. The Linear Probability Model
Suppose we have a cross section of 18-24 year-olds. We specify a simple
2-variable regression model. The probability of enrolling in tertiary study can be
written:
Y i =  0 + 1 X i +  i
where Yi = 1 if enrolled in university;
0 otherwise.
Xi = performance in secondary school, family background, income or
wealth of parents, gender, ethnicity, etc.
We write this as a 2-variable regression for simplicity. Could be a multiple
regression, where some or all of the independent variables are included. Assume
Probi  P(Yi  1 | X i ) .
This is known as a Linear Probability Model (LPM), because the conditional
expectation of Yi given Xi is the conditional probability of this event occurring:
E ( Y i | X i )=  0 +  1 X i
Probi =  0 +  1 X i
where E(  i |Xi)=0. Although we can treat this model like any other regression and
use OLS to estimate the parameters, one restriction is that:
0  Probi  1
Only probabilities within the 0,1 interval make sense.
Two undesirable characteristics of LPM (i.e., the use of OLS where the dependent
variable is discrete):
1. Nonnormal/heteroskedastic disturbances
In general,
 i = Y i -  0 -  1 X i = Y i - Probi
Page-2
For the purposes of statistical inference, we assume that these disturbances are
normally distributed with a constant variance. These are both violated under the
LPM.
We know that Yi can take on only one of two values: 0 and 1. Therefore:
If Y i = 1 , then,  i = 1 - Probi
If Y i = 0 , then,  i = - Probi
Since this estimated probability will always be positive, the error terms will
fluctuate between positive and negative values. It can be shown that the resulting
variance of the disturbance terms follows a ‘binomial distribution’:
Var (  i ) =  i2 = Probi ( 1 - Probi )
The result is that the disturbances are not normally distributed, and are
heteroskedastic. If they were homoskedastic, the Var(  i ) should be a constant.
Under the LPM, this variance is a function of Xi (i.e., it depends on Probi).
With heteroskedasticity coefficient estimates are still unbiased, but inefficient (i.e.,
no longer minimum variance or BLUE.)
This can be overcome by running Weighted Least Squares (WLS). Transform the
data and use OLS.
2-Step Procedure:
1. Run OLS. Retain the fitted values and compute the following 'weight'
Wˆ i = Pˆ robi (1 - Pˆ robi )
2. Transform the data in the following way, and run OLS.
1
Yi =
+ 1 X i +
0
Wˆ i
Wˆ i
Wˆ i
i
Wˆ i
This eliminates heteroskedasticity, since we now have unit variance for the
composite disturbance term. The resulting coefficient estimates are now BLUE.
However, this procedure eliminates only one of the two problems.
Page-3
2. Unrestricted Range of Probi
We said earlier that this probability must be restricted to the 0,1 interval.
The problem is that nothing in LPM 'restricts the range' of Probi.
Consider the following numerical example:
Suppose we estimate:
Probi = Yˆ i = .197 + .141 X i
where Xi is defined as father's 'years of education minus 12'.
For example, if father completes a secondary education, we say that he has 12
years of schooling (e.g., School Certificate). In this case, Xi=0. No qualification
would make X negative. Any post-SC qualification would make Xi positive (e.g.,
if he has Ph.D, Xi=7).
Show this in the following diagram.
The data points lie on the 2 horizontal lines, where y = 0 and 1. Either the
individual is enrolled in tertiary study or he or she isn’t. The dependent variable
is dichotomous, although the independent variable is more or less continuous.
This is the scatter diagram for a dummy dependent variable model.
Page-4
OLS tries to fit a regression line through these data points that minimises the sum
of the squared residuals. Suppose we get the upward-sloping regression function
in the diagram. Enrolment in tertiary education is positively related to the father’s
education.
The intercept term (0.197) is the intersection of the regression function with the
vertical axis. The slope is the estimated coefficient (0.141). Each year of
education by the father raises the probability that the offspring will be enrolled in
tertiary study by 14.1 percentage points.
We can predict the probability that a given individual will be enrolled by plugging
his or her father’s education into this conditional expectation. For example, two
years of post-SC education would give us:
Yˆ i = Probi = .197 + .141(2) = .479
The problem is that a regression function with any slope will eventually pass
outside the horizontal lines defined by the data points.
For example, someone with a father who dropped out of school at age 15 (i.e.,
Xi=-2), will have a negative probability of tertiary study:
Yˆ i = Probi = .197 + .141(-2) = - .085
This isn’t possible. Someone with a father who has a PhD (i.e., Xi=7), will have
a probability of tertiary study in excess of one:
Yˆ i = Probi = .197 + .141(7) = 1.184
This isn’t possible either. Thus, we have a fundamental problem with the LPM
and forecasts.
As a consequence, we need to explore alternatives to the LPM.
We want a technique that estimates a 'regression curve' bounded by zero and one
(i.e., it asymptotically approaches these two horizontal lines.)
Might also note that the R2 statistic isn’t very useful in the LPM as a measure of
the ‘goodness of fit’ of this regression function. It’s difficult to fit a ‘regression
line’ through two horizontal lines of data points. The intuition is that we’re trying
to determine the relationship between the ‘probability’ of this event and some
Page-5
independent variable. But we never observe the true probability. All we see is the
eventual outcome of zero or one.
II. The Logit Model
Under the LPM the probability of an event occurring is written:
Yˆ i = Probi =  0 +  1 X i
Under the logit model this probability is written:
Probi =
1
1+e
Probi =
- (  0 +  1X i )
1
- ˆ
1 + e Yi
Note that there is now a difference between the fitted value and the estimated
probability. This probability is now a nonlinear function of X. This is the
cumulative distribution function (CDF) for the logistic distribution.
We need to verify that the probability range is now restricted to lie within the 0,1
interval.
If Yˆ i  +  , then Probi  1
When ‘e’ is raised to a large negative number (in absolute value), this probability
approaches one.
If Yˆ i  -  , then Probi  0
When ‘e’ is raised to a large positive number, this probability approaches zero.
Thus, this logistic regression function asymptotically approaches one and zero.
In between these two extremes, we can show this logit model relative to the LPM
in the following diagram.
Page-6
Note that the marginal or incremental effect of X on Y declines at the extremes.
This is the slope of the curve at a given point. Contrast this with the constant slope
of the LPM. The largest slope of the logit model occurs at the inflection point,
where we go from increasing at an increasing rate to increasing at a decreasing
rate. This doesn’t have to correspond to X=0.
Log-Odds Ratio
How do we estimate this logit regression model? One possibility is to convert this
nonlinear function into a linear regression function and apply OLS.
Begin by writing the probability of not enrolling in tertiary study as:
1
1 + e-Yˆi
1 + e-Yˆi
1
=
-Yˆ i
1+e
1 + e-Yˆi
1 - Probi = 1 -
-ˆ
e Yi
1 + e-Yˆi
1
=
1 + eYˆi
=
We can now write the 'odds ratio' as:
1 + eYˆi
Probi
=
= eYˆi
1 - Probi 1 + e-Yˆi
Page-7
The trick is to realize that:
ˆ
1
eY i
=
1 + e-Yˆi 1 + eYˆi
This odds ratio is the probability that an event will occur over the probability that
it will not occur.
For example, if the Probi = 0.75, the odds ratio is 3 or 3:1. If the Probi = 0.8, the
odds ratio is 4 or 4:1
By taking the natural log of the odds ratio we get:
ln (
Probi
)=  0 +  1 X i
1 - Probi
so that the 'log-odds ratio' is a linear function of Xi, but the probability is still a
nonlinear function of Xi. For example, β1 tells us how the log of the odds ratio will
change with a one unit change in Xi.
Estimation
Imagine that we try to use the log odds ratio to estimate the earlier regression
model on tertiary enrolments. Plug in observed values for Yi or Probi and run
OLS. What's wrong with this approach?
It doesn't work with our cross section of individuals because we don't observe
probabilities, just actual outcomes.
0
1
1
If Probi=1, then ln( ) is undefined
0
If Probi=0, then ln( ) is undefined
One way to estimate the model is to use the Maximum likelihood (ML) method,
which is beyond the scope of this course. Alternatively, we can use a method
called Grouped Logit. Suppose we have 'group' rather than 'individual' data (e.g.,
a cross section of secondary schools).
We could estimate the probabilities or frequencies of tertiary enrolments for the
graduates of each school:
Page-8
mi
Pˆ robi =
ni
mi = number who attend universities or polytechnics by some age.
ni = number who completed secondary school in that class.
Assuming that this estimated probability is not 0 or 1, we can run the following
with OLS.
ˆ
ln ( Probi ) = ˆ 0 + ˆ1 X i   i
1 - Pˆ robi
where the 'hats' on the coefficients indicate that these are estimated with 'grouped'
data, and that we lose information in aggregating.
Since the disturbances are heteroskedastic:
Var (  i ) =
1
ni Pˆ robi (1 - Pˆ robi )
We can transform the data by multiplying through by the square root of the
weighting variable:
Wˆ i = ni Pˆ robi (1 - Pˆ robi )
This WLS procedure will yield more efficient estimators.
III. The Probit Model
The probit model is nothing more than an alternative regression function that also
asymptotically approaches the zero and one horizontal lines. The difference is
that it is based on a 'normal' rather than a 'logistic' distribution function.
Recall that under the Logit model the probability that an event will occur is
written:
Probi =
1
1 + e-Yˆ i
Under probit, we let Probi to be the CDF of a normal distribution:
Page-9
Probi =
1
2
Yˆ i
-

exp( 
t2
)dt
2
where t is a standardised normal variable, with zero mean and unit variance. For
this reason, it should be called ‘Normit’.
In general, there is no reason to prefer logit over probit or vice versa.
Probit does have a slightly different regression function (although it asymtotically
approaches zero and one like logit).
It approaches the extreme values faster than logit.
Numerical example. The regression model:
LF i =  0 +  1 M i +  2 S i +  i
where LFi = 1 if woman in labour force; 0 otherwise.
Mi = 1 if woman is married; 0 otherwise.
Si = number of years of schooling.
1. LPM (OLS). No correction for heteroskedasticity.
Lˆ F i = - 0.28 - 0.38 M i + 0.09 S i
..................(0.15) (0.03)
Page-10
We can interpret the effect of marital status on labour force participation:
E ( LF i | M i = 0 , S i = 12) = - 0.28 + 0.09(12) = .80
E ( LF i | M i = 1 , S i = 12) = - 0.28 - 0.38 + 0.09(12) = .42
2. LPM (Weighted Least Squares).
Lˆ F i = - 0.21 1 - 0 .39 M i + 0.08 S i
Wˆ i
Wˆ i
Wˆ i
Wˆ i
...........................(0.15)
.........(0.02)
where the relevant weight is the product of the probabilities of being in and out of
the labour force estimated from the previous regression.
It's easy to verify the problem of 'unrestricted range' of estimated probabilities
under LPM.
E ( LF i | M i = 0 , S i = 16) = - 0.21 + 0.08(16) = 1.07
The estimated probability of an unmarried woman with 16 years of education
being in the labour force is 107%.
3. Logit (Maximum likelihood).
We use the same individual data to estimate the equation with logit.


ln  LF i  = - 5.89 - 2.59 M i + 0.69 S i
 1 - LF i 
...............................(1.18) (0.31)
The results are represented in terms of the log of the odds ratio (even though
maximum likelihood estimation on the individual data was used).
4. Probit (Maximum likelihood).
Again, the individual data are used with maximum likelihood probit.
-1
F ( LF i ) = - 3.44 - 1.44 M i + 0.40 S i
............................(0.62) (0.17)
Where F-1 is the inverse of the normal CDF.
Page-11
We can't compare the ‘magnitudes’ of the coefficient estimates from the logit and
probit, but the t tests are performed in the traditional manner. The t ratios are
around 2.2 and 2.3 on Mi, respectively, and better than 2 on Si in both regressions.
The magnitudes of the estimated coefficients have no economic meaning, because
they’re related to labour force participation in a nonlinear way.
IV. Questions for Discussion: Q13.12
V. Computing Exercise: Johnson, Ch13