Download Qualitative Dependent Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Discrete choice wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Economics 140A
Qualitative Dependent Variables
With each of the classic assumptions covered, we turn our attention to extensions of the classic linear regression model. Our …rst extension is to data in which
the dependent variable is limited. Dependent variables that are limited in some
way are common in economics, but not all require special treatment. We have
seen examples with wages, income or consumption as the dependent variables all of these must be positive. As these strictly positive variables take numerous
values, we found the log transform to be su¢ cient.
Yet not all restrictions on the dependent variable can be handled so easily. If
we model individual choice, the optimal behavior of individuals often results in
a sizable fraction of the population at a corner solution. For example, a sizable
fraction of working age adults do not work outside the home, so the distribution of
hours worked has a sizable pile up at zero. If we …t a linear conditional mean, we
will likely predict negative hours worked for some individuals. The log transform
used for wages will not work, as the log of zero is unde…ned. Another issues
arises with sample selection. It may well be the case that E (Y jX) is linear, but
nonrandom sampling requires more detailed inference. Finally, a host of other
data issues may arise: linear conditional mean functions that switch over regimes,
data recorded as counts or analysis of durations between events. As we will see,
even if only a …nite number of values are possible, a linear model for E (Y jX) may
still be appropriate.
While all these issues may arise, we focus on perhaps the most common restriction in which the dependent variable is qualitative in nature and so takes discrete
values. For this reason, such models are also termed discrete dependent variable
models or (less frequently) dummy dependent variable models.
As we recall from our discussion of qualitative regressors, qualitative variables
capture the presence or absence of some non-numeric quantity. For example, in
studying home ownership the dependent variable is often
Yt =
1 if household t owns their home
:
0 otherwise
Many qualitative variables take more than two values. For example, in studies of
employment dynamics the dependent variable can take three values
8
< 1 if individual t is employed
0 if individual t is unemployed but seeking employment
Yt =
:
:
1 if individual t is not in the labor force (i.e. not seeking employment)
We focus attention on qualitative dependent variables that take only two values
and for ease set these values to 0 and 1. In binary response models, interest is
primarily in
p (X) P (Y = 1jX) = P (Y = 1jX1 ; : : : ; XK ) ;
for various values of X. For a continuous regressor Xj , the partial e¤ect of Xj on
the response probability is simply @P (Y@X=1jX)
. When multiplied by Xj (for small
j
Xj ), the partial e¤ect yields the approximate change in P (Y = 1jX) when Xj
increases by Xj holding all other regressors constant. For a discrete regressor
XK , the partial e¤ect is
P (Y = 1jx1 ; : : : ; xK 1 ; XK = 1)
P (Y = 1jx1 ; : : : ; xK 1 ; XK = 0) :
Perhaps the most natural extension of the classic linear regression model is to
leave the structure of the population model unchanged, so that
Yt =
0
+
1 Xt
+ Ut :
How does the presence of a qualitative dependent variable a¤ect our analysis? In
quite substantial ways. Consider the familiar relation
E (Yt jX) =
0
+
1 Xt :
Because Yt takes only the values 0 and 1,
E (Yt jX) = P (Yt = 1jX)
hence
p (X) =
0
+
1 Xt
and so the conditional mean is a probability and the model is termed the linear
probability model. The coe¢ cient 1 is interpreted as the e¤ect of a one unit
change in Xt on the probability that Yt = 1. Similarly, if Xt is a binary regressor,
then 1 captures the e¤ect of moving from Xt = 0 to Xt = 1. As there is no
reason to believe that as Xt varies the conditional mean will remain between 0 and
1, equality between the conditional mean and P (Yt = 1) is a substantial drawback
of the linear probability model. As one can deduce, it is hard to …t points that are
clustered at 0 and 1 (on the y axis) with a straight line, so the R-square measure
is not reliable.
(Draw a graph with the points clustered at 0 and 1 on the y-axis
and a straight line attempting to …t them.)
2
Several other features of the linear probability model are easily obtained. Because Yt is a Bernoulli random variable, Ut is a binomial random variable
Ut =
(
1
0
(
+
with probability 1
X
)
1 t with probability
1 Xt )
0+
( 0 + 1 Xt )
:
0 + 1 Xt
From the de…nition of Ut it is clear that the error has mean 0 and variance
EUt2 = (
= (
+
0+
0
2
1 Xt )
[1 ( 0 + 1 Xt )] + [1
( 0 + 1 Xt )] :
1 Xt ) [1
(
0
+
2
1 Xt )]
(
0
+
1 Xt )
Thus the error term is heteroskedastic and binomial, violating two of the classic
assumptions.
The OLSE is unbiased and consistent for the linear probability model, although
robust standard errors are needed to account for the heteroskedasticity. As an
aside, the test statistic for H0 : 1 =
= K = 0 can be accurately constructed
from the OLSE, as under H0 the error is homoskedastic, EUt2 = 0 (1
0 ).
To improve the e¢ ciency of the OLSE, construct the weighted least squares
estimator. Let YtP denote the predicted value of the dependent variable constructed from the OLSE. If 0 < YtP < 1 for all t, then form the estimate of the
error standard deviation
q
St = [YtP (1 YtP )]:
The weighted least squares estimator is obtained from the model
Yt
=
St
0
1
+
St
1
Xt Ut
+ :
St
St
Again, the reported standard errors are valid, as follows from our earlier treatment
= (0; 1), then WLS is infeasible without an ad
of weighted least squares. If YtP 2
hoc adjustment and should not be done.
The linear probability model is a convenient approximation and generally gives
good estimates of the partial e¤ects of the response probability near the center
of the regressor distribution. If one wishes to know the partial e¤ect, averaged
over the values of X, then the linear probability model may work well even if it
gives poor estimates of the partial e¤ects for extreme values of X.
Example (Married Women’s Labor Force Participation)
In a survey of 753 women, 428 report working more than zero hours. Also,
606 have no young children while 118 have exactly one young child. The variables
in play are
3
inlf
nonw…nc
ed
ex
k6
k+
binary, value 1 indicates non-zero working hours
non-wife income, in thousands of dollars
education
experience
number of children less than 6 years old
number of children between 6 and 18, inclusive
.
The estimated regression is
inlf P
ols se
robust se
= :586
(:154)
[:151]
:0034nonwf inc + :038 ed + :039 ex
(:0014)
[:0015]
(:007)
[:007]
(:006)
[:006]
:0006 ex2
(:00018)
[:00019]
:016 age
(:002)
[:002]
:262 k6 + :013 k +
(:034)
[:032]
R2 = :264
Except for k+, all regressor coe¢ cients have sensible signs and are statistically
signi…cant. The regressor k + is neither statistically signi…cant nor practically
important. Also, the OLS and robust standard errors are almost identical! Interpretation: An increase in non-wife income of $10,000 reduces participation in
the labor force by only .034 (3.4 percent). As the sample mean of non-wife income is only $20,129 with a standard deviation of $11,635, a $10,000 increase is
quite substantial. Having one more small child seems a …rst-order e¤ect, reducing
the probability of being in the labor force by 26.2 percent. Finally, of the 753
…tted values, 33 lie outside the unit interval (hence we do not construct WLS
estimators).
The case for linear probability models grows stronger if most regressors are
discrete and take only a few values, so that there are no “extreme” values. To
understand how to construct a discrete regressor from a continuous regressor,
return to the preceding example. Partition the variable k6 into three indicator variables: I0 = 1 (0 young children), I1 = 1 (1 young child), and I2 =
1 (2 or more young children). We replace k6 with (I1 ; I2 ) to allow the impact
of the …rst young child to di¤er and obtain the estimated coe¢ cients
:263 for I1 and
:274 for I2 .
It appears that the key impact is having one young child, additional young children
do not change labor force participation much. The use of discrete regressors is
familiar to us from our discussion of regressor speci…cation. For the model with
union-gender interactions, the predicted values correspond to cell averages. When
a model has the amount of indicator variables and interactions to …t cell averages,
4
(:013)
[:013]
the model is saturated. A new fact about saturated models emerges here: If
a model is saturated, then the predicted probabilities must lie between 0 and 1
(because they mimic cell averages).
For estimates of the partial e¤ects for extreme values of the regressors, we
must develop a new framework. To do so, observe that the linear probability
model falls within a broader class of models, termed single index models. The
term single index arises because the various regressors a¤ect Yt through the scalar
Xt0 , which is the single index. The class of single index models is
Yt = F (Xt0 ) + Ut ;
where the linear probability model is given by F (Xt0 ) = Xt0 . To overcome the
problem that the predictions for Yt can lie outside the unit interval, we constrain
F so that
0 < F (z) < 1 for all z:
Given this constraint, a natural choice for F is a cumulative distribution function
(although it is not necessary to use a CDF).
Index models in which F is a CDF are derived from a latent model. The latent
model concerns a variable that underlies the decision and cannot be observed. If
Yt measures whether or not household t owns a home, then the latent variable Yt
captures the desire of household t to own a home. If the desire is high enough,
then household t owns a home. Or, put another way, Yt captures the di¤erence in
utility between the two options, namely owning and renting, in which case if Yt is
positive, then the utility from owning exceeds that from renting and household t
purchases a home. The (latent) population model that explains the latent variable
is
Yt = 0 + 1 Xt + Vt ;
where fVt gnt=1 is a sequence of i.i.d. random variables that are symmetric about
their mean of 0, with variance 2 and distribution F . (Note, Vt does not have to
be symmetric about zero.)
The latent variable and the observed variable are linked as
Yt =
1 if Yt > 0
:
0 otherwise
From the measurement rule we can see that if we multiply Yt by any positive
constant, then Yt is unchanged. As a result, we can only estimate 0 and 1 up
to a positive multiple (that is, relative to scale). To identify the coe¢ cients, we
5
set = 1. We also see that if the threshold is c 6= 0, then we return to a zero
threshold simply by subtracting c from 0 . To identify the intercept, we set c = 0.
To construct an estimator of the coe¢ cients, note
p (X) = P (Yt = 1jX) = P (Yt > 0jX) = P ( 0 + 1 Xt + Vt > 0)
= P ( 0 + 1 Xt > Vt )
= P ( ( 0 + 1 Xt ) Vt ) = F ( 0 + 1 Xt ) ;
where the third displayed line follows because of symmetry about 0. The main
criticism of the linear probability model has been addressed; because the distribution function is contained in [0; 1] so too is the probability that Yt equals 1.
The presence of the latent model can give one the impression that we are
interested in the e¤ect of X on Y , which is given by 1 . Yet Y rarely has
sensible units of measurement (desire to own a home, or di¤erences in utility), so
the magnitude of 1 is not generally important. Rather, our goal is to explain
the e¤ect of X on the response probability p (X). To understand the link, note
that if X is a continuous regressor
@p (X)
=f(
@X
0
+
1 X)
where f (z) =
1
@F (z)
:
@z
If F is a strictly increasing function (as is true for Gaussian and logistic CDF’s),
then f (z) > 0 for all z and the sign of 1 determines the direction of the e¤ect
on the response probability. Observe that the magnitude of the e¤ect depends
on the value of the regressor, through f ( 0 + 1 X). If the underlying density is
unimodal and symmetric about 0, the maximum value of f (z) occurs at z = 0.
For the leading cases
Probit f (z) = p
Logit f (z) =
1
2
2
1
z2
2 2
e
ez
(1 + ez )2
1
f (0) = p
2
:399
f (0) = :25
For the multiple regressor model, relative e¤ects do not vary with X, as
@p(X)
@Xi
@p(X)
@Xj
=
i
j
6
:
If X is a discrete regressor, then the impact of an increase from c to c + 1 in
X on the response probability is
F(
0
+
1
(c + 1))
F(
0
+
1 c) :
If X is an indicator regressor, then c = 0.
Example (E¤ect of Job Training)
p (X) = probability of employment
Xk = indicator of participation in job training
The direction of the job training e¤ect is the sign of k , while the magnitude of
the e¤ect di¤ers depending on age, education, and experience (the other included
regressors).
Finally, consider the model
Xt =
0
+
1 X1t
2
2 X1t
+
+
3
ln X2t :
The partial e¤ect of X1 on the response probability is
f (Xt ) (
+2
1
2 X1t )
so the direction of the partial e¤ect potentially changes at X1 =
partial e¤ect of lnX2 on the response probability is
f (Xt )
1
2
2
.
The
3:
2
Because d ln X2 = dX
, a 1 percent change in X2 is a .01 change in ln X2 . Therefore
X2
the partial e¤ect of a 1 percent change in X2 on the response probability is
f (Xt )
3
100
:
Given the distributional assumptions on Vt , the maximum likelihood estimator
arises naturally. The distribution of Y1 is used to form the likelihood as
L(
0;
1 jY1
= y1 ; X1 = x1 ) = [F (
0
+
y1
1 x1 )]
[1
F(
0
+
1 y1
1 x1 )]
The likelihood for the sample is
L [ 0 ; 1 j (Y1 ; X1 ) = (y1 ; x1 ) ; : : : ; (Yn ; Xn ) = (yn ; xn )]
n
Y
=
[F ( 0 + 1 xt )]yt [1 F ( 0 + 1 xt )]1 yt :
t=1
7
:
The log-likelihood is
ln L (
0;
1j
)=
n
X
(yt ln F (
0
+
1 xt )
+ (1
yt ) ln [1
F(
0
+
1 xt )]) :
t=1
(We are able to construct ln L because of the strict inequality 0 < F (z) < 1.) The
…rst-order condition for the estimator of the coe¢ cients is the partial derivative of
the log-likelihood with respect to each of the coe¢ cients and is termed the score.
The ML estimators B0 and B1 are the values that set the scores equal to zero:
@ ln L ( 0 ;
@ i
1j
)
=
i =Bi
n
X
t=1
yt F (B0 + B1 xt )
@F ( 0 +
F (B0 + B1 xt ) [1 F (B0 + B1 xt )]
@ i
1 xt )
= 0;
i =Bi
for i = 0; 1.
The MLE is consistent and asymptotically Gaussian. To determine the covariance matrix for the estimators, we construct the expected value of the Hessian
conditional on X, for which we note @F ( @0 + 1 xt ) = xit f ( 0 + 1 xt ),
i
E
@ 2 ln L ( 0 ;
@ @ 0
1j
)
jX
equals
E
"
n
X
F (1
t=1
n
X
f 2 xt x0t
=
;
F (1 F )
t=1
F ) xf ( xf )0 + u(xf (1) )
[F (1
uxf [xf (1
F )]2
F ) + ( xf )F ]
jX
#
because E (ujX) = 0
which is a positive semi-de…nite matrix. The estimator of the asymptotic variance
of B is
" n
# 1
X f 2 xt x0
t
V =
:
F
(1
F
)
t=1
If the inverse exists, the matrix is positive de…nite. If the inverse does not exist,
the problem is likely multicollinear regressors. It does not make sense to compute
robust standard errors. The reason - in the latent model we specify all conditional
8
moments of Y jX. Therefore, if we believe the variance is misspeci…ed, then the
conditional mean must be misspeci…ed as well.
If we follow the classic regression model and assume that Vt is Gaussian, then
F ( ) is the distribution of a standard Gaussian random variable. The resulting
ML estimators are termed probit estimators (because
is termed the probit
function in statistics) and are obtained through nonlinear optimization (as the
score function is not a linear function of the coe¢ cient estimators). Because
the Gaussian distribution function cannot be expressed in closed form (that is, an
integral must be used), many researchers assume that Vt has a logistic distribution.
While the logistic density function is similar to the Gaussian and di¤ers only in
the tails, the logistic distribution function can be expressed in closed form as
F(
0
+
1 xt )
=
exp ( 0 + 1 xt )
:
1 + exp ( 0 + 1 xt )
The closed form expression for the logistic distribution delivers a simpli…ed likelihood as well
L(
0;
Thus
ln L (
n
Y
y
t
exp ( 0 + 1 xt )
1
1 + exp ( 0 + 1 xt )
1 + exp ( 0 +
t=1
Pn
Pn
exp ( 0 t=1 yt + 1 t=1 xt yt )
Qn
:
=
t=1 [1 + exp ( 0 + 1 xt )]
1j ) =
0;
1j
)=
0
n
X
yt +
1
n
X
xt yt
ln [1 + exp (
0
1 xt )
+
1 xt )] :
t=1
t=1
t=1
n
X
1 yt
The ML estimators, which are termed logit estimators, are the values B0 and B1
that satisfy
@ ln L ( 0 ;
@ i
1j
)
=
i =Bi
n
X
yt xi;t
t=1
n
X
t=1
1
exp (B0 + B1 xt ) xi;t = 0;
1 + exp (B0 + B1 xt )
for i = 0; 1 with x0;t = 1 and x1;t = xt . (As for the probit estimators, a nonlinear
solution technique must be used.) An immediate consequence of the score for 0
is that
n
n
X
X
yt =
P^ (yt = 1) ;
t=1
t=1
9
that is the observed frequency of yt = 1 equals the predicted frequency (here
exp(B0 +B1 xt )
P^ (yt = 1)
). (Note, the same feature holds for the linear prob1+exp(B0 +B1 xt )
ability model, because the OLS coe¢ cient estimators satisfy the relation that
the sum of observed values of the dependent variable, which yields the observed
frequencies, equals the sum of predicted values of the dependent variable, which
yields the predicted frequencies.)
One additional advantage of the logistic assumption is that
ln
F ( 0 + 1 xt )
=
1 F ( 0 + 1 xt )
0
+
1 xt :
The slope coe¢ cient is interpreted as the e¤ect of a one unit change in the regressor
on the logarithm of the odds ratio, where the odds ratio yields the probability of
a success (Y = 1) divided by the probability of a failure (Y = 0).
For the special case in which we have a number of observations for each value of
the regressor, a simpler estimator can be constructed. Suppose that the regressor
takes K distinct values and that there are nk observations on each value. For each
of the distinct regressor
Pnk values calculate the observed frequency of success, that
1
is construct p^k = nk t=1 yt . The estimator of the coe¢ cients is then obtained as
the OLS coe¢ cient estimator for the regression model
ln
p^k
=
1 p^k
0
+
1 xk
+ uk
for k = 1; : : : ; K. The method is sensible if nk is large for each k. If nk is not
constant across k, then the error is heteroskedastic and weighted least squares
should be used.
To perform hypothesis tests, any of the three test statistics can be used. As
the tests are asymptotically equivalent (and the …nite sample comparisons are
speci…c to the model) simply choose the statistic that is easiest to compute.
We begin with test of exclusion restrictions, of which the leading example
is the need to include additional regressors Z (perhaps indicators for region or
industry). Set
Yt = F (Xt + Zt ) + Ut :
(If Zt consists only of functions of Xt , we have a pure functional form test.) The
Wald test is computed directly in Stata. To construct an LR test, …rst estimate
the full model via probit and obtain the estimated value of the log-likelihood,
L^U . Next construct the probit estimate for the restricted model, in which the
10
conditional mean is assumed to be F (Xt ), and obtain the estimated value of the
log-likelihood, L^R . The LR test statistic is
2 L^U
L^R )
2
Q
Q = dim ( ) :
If Q is large, then probit can be di¢ cult to construct. For large Q, the LM
statistic is preferred as only the restricted model is estimated. First, construct
the probit estimate for the restricted model, B, and form
F^ = F (XB)
f^ = f (XB)
U^ = Y
F^ .
We then regress the residuals on both the included and excluded regressors, where
we do WLS for e¢ ciency
U^t
=
wt
f^t
Xt +
1
wt
f^t
Zt + Vt
2
wt
h
F^t
wt = F^t 1
i 12
:
Because the residuals sum to zero, there is no need for an intercept. The explained
sum of squares from the regression is identical to the LM statistic. Alternatively,
nR2 can be used as it is an asymptotically equivalent statistic, although numerically distinct. Both are distributed as 2 (Q) random variables.
Although less common, there are more general restrictions of interest to test.
To test for heteroskedastic errors, the latent model becomes
Ut jX
Yt = Xt + Ut
N 0; e2Zt
:
We analyze the leading case, in which Zt consists of all varying regressors (all but
the intercept). With heteroskedastic errors, our calculations become
P (Yt = 1jX) = P (Ut >
=
e
Zt
Xt jX) = P e
Xt
:
Zt
Ut >
e
Zt
Xt jX
As noted before, if the error to the latent model is heteroskedastic, the speci…cation
of the conditional mean is altered (hence we do not construct robust standard
errors for the original speci…cation). In particular, the conditional mean is no
longer a single index model, as the regressors a¤ect the response probabilty in two
ways. To indicate the absence of a single index model, the response probability
is often written as
p (X) = m (X ; X; ) ;
11
where the last two arguments emphasize the fact that regressors a¤ect the response
probability through more than the single index X .
The natural null hypothesis is H0 : = 0, under which the latent model is a
standard probit model. As the restricted model is clearly the easiest to estimate,
we again use the LM statistic. Again, construct the probit estimate for the
restricted model, B, and form
F^ = F (XB)
f^ = f (XB)
U^ = Y
F^ .
We then regress the residuals on both the included regressors and the score for
(recall, f^t multiplied by the excluded regressors forms the score for in the test
of exclusion restrictions), where we do WLS for e¢ ciency
U^t
=
wt
f^t
Xt +
1
wt
2
5 m (Xt ; Xt ; )
wt
=
wt = F^t 1
+ Vt
=
For the heteroskedaticity example (in which
5 m (Xt ; Xt ; )
wt
h
Zt
0
0
F^t
i 12
:
= 0)
Xt
e
Zt
=
e
Xt ( Zt )
=
(Xt ) Xt (Zt ) :
=
0
0
The explained sum of squares from the regression is identical to the LM statistic.
Alternatively, nR2 can be used as it is an asymptotically equivalent statistic,
although numerically distinct. Again, both statistics are 2 (Q) random variables.
12