Download Document

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Data assimilation wikipedia , lookup

Independence of irrelevant alternatives wikipedia , lookup

Coefficient of determination wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Transcript
Structure of the class
1. The linear probability model
2. Maximum likelihood estimations
3. Binary logit models and some other models
4. Multinomial models
The Linear Probability Model
The linear probability model
When the dependent variable is binary (0/1, for example, Y=1 if the
firm innovates, 0 otherwise), OLS is called the linear probability
model.

Y  0  1x1  2 x 2  u
How should one interpret βj? Provided that E(u|X)=0 holds true, then:

E(Y | X)  0  1x1  2 x 2

β measures the variation of the probability of success for a one-unit
variation of X (ΔX=1)
E(Y | X) Pr(Y  1| X)


 Pr(Y  1| X)
X
X
Limits of the linear probability model
1.
Non normality of errors
2.
Heteroskedastic errors
3.
Fallacious predictions
u
Normal(0, 2 )
Var  u x1 , x 2 ,
, x k   2
0  EY | X  1
Overcoming the limits of the LPM
1.
2.
3.
Non normality of errors

Increase sample size
Heteroskedastic errors

Use robust estimators
Fallacious prediction

Perform non linear or constrained regressions
Persistent use of LPM

Although it has limits, the LPM is still used
1.
In the process of data exploration (early stages of the research)
2.
It is a good indicator of the marginal effect of the representative
observation (at the mean)
3.
When dealing with very large samples, least squares can
overcome the complications imposed by maximum likelihood
techniques.
 Time of computation
 Endogeneity and panel data problems
The LOGIT/PROBIT Model
Probability, odds and logit/probit

We need to explain the occurrence of an event: the LHS
variable takes two values : y={0;1}.

In fact, we need to explain the probability of occurrence of
the event, conditional on X: P(Y=y | X) ∈ [0 ; 1].

OLS estimations are not adequate, because predictions
can lie outside the interval [0 ; 1].

We need to transform a real number, say z to ∈ ]-∞;+∞[
into P(Y=y | X) ∈ [0 ; 1].

The logit/probit transformation links a real number z ∈ ]∞;+∞[ to P(Y=y | X) ∈ [0 ; 1].It is also called the link
function
Binary Response Models: Logit - Probit

Link function approach
Maximum likelihood estimations

OLS can be of much help. We will use Maximum Likelihood
Estimation (MLE) instead.

MLE is an alternative to OLS. It consists of finding the parameters
values which is the most consistent with the data we have.

The likelihood is defined as the joint probability to observe a given
sample, given the parameters involved in the generating function.

One way to distinguish between OLS and MLE is as follows:
OLS adapts the model to the data you have : you only have one model
derived from your data. MLE instead supposes there is an infinity of
models, and chooses the model most likely to explain your data.
Likelihood functions

Let us assume that you have a sample of n random observations.
Let f(yi ) be the probability that yi = 1 or yi = 0. The joint probability to
observe jointly n values of yi is given by the likelihood function:
n
f  y1 , y2 ,..., yn    f ( yi )
i 1

Logit likelihood
n
n
i 1
i 1
L  y    f (yi )    p  i 1  p 
1 yi
y
1 yi
yi
 e   1 
L  y, z    f (yi , z)   
z  
z 
1

e
i 1
i 1 
 1  e 
n
z
n
yi
1 yi
 e
  1 
L  y, x,     f (yi , X, β)   
Xβ  
Xβ 
1

e
1

e


i 1
i 1 

n
n
Xβ
Likelihood functions

Knowing p (as the logit), having defined f(.), we come up with the
likelihood function:
n
n
i 1
i 1
L  y    f (yi )    p  1  p 
1 yi
yi
1 yi
yi
 e   1 
L  y, z    f (yi , z)   
z  
z 
1

e
1

e


i 1
i 1 

n
z
n
yi
1 yi
 e
  1 
L  y, x,     f (yi , X, β)   
Xβ  
Xβ 
1

e
1

e


i 1
i 1 

n
n
Xβ
Log likelihood (LL) functions

The log transform of the likelihood function (the log likelihood) is
much easier to manipulate, and is written:
n
n
i 1
i 1
LL  y, z    yi  z    ln 1  e z 
n
n
i 1
i 1
LL  y, x,     yi Xβ   ln 1  e Xβ  
n

LL  y, x,     ln 1  e Xβ   yi Xβ
i 1

Maximum likelihood estimations

The LL function can yield an infinity of values for the
parameters β.

Given the functional form of f(.) and the n observations at
hand, which values of parameters β maximize the
likelihood of my sample?

In other words, what are the most likely values of my
unknown parameters β given the sample I have?
Maximum likelihood estimations
The LL is globally concave and has a maximum. The gradient is used
to compute the parameters of interest, and the hessian is used to
compute the variance-covariance matrix.
LL n
   yi   i  x i  0

i 1


ez

 where i 
z
n
1

e
 ²LL
   i 1   i  x i xi 


i 1
However, there is not analytical solutions to this non linear problem.
Instead, we rely on a optimization algorithm (Newton-Raphson)
You need to imagine that the computer is going to generate all possible
values of β, and is going to compute a likelihood value for each (vector of )
values to then choose (the vector of) β such that the likelihood is highest.
Binary Dependent Variable – Research
questions


We want to explore the factors affecting the probability of
being successful innovator (inno = 1): Why?
Logistic Regression with STATA
 Instruction Stata : logit
logit y x1 x2 x3 … xk
[if] [weight] [, options]
Options
 noconstant : estimates the model without the constant
 robust : estimates robust variances, also in case of
heteroscedasticity
 if : it allows to select the observations we want to include in the
analysis
 weight : it allows to weight different observations
Interpretation of Coefficients

A positive coefficient indicates that the probability of innovation
success increases with the corresponding explanatory variable.

A negative coefficient implies that the probability to innovate
decreases with the corresponding explanatory variable.

Warning! One of the problems encountered in interpreting
probabilities is their non-linearity: the probabilities do not vary in the
same way according to the level of regressors

This is the reason why it is normal in practice to calculate the
probability of (the event occurring) at the average point of the
sample
Interpretation of Coefficients

Let’s run the more complete model
 logit inno lrdi lassets spe biotech
. logit inno lrdi lassets spe biotech
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-205.30803
-167.71312
-163.57746
-163.45376
-163.45352
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -163.45352
inno
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
[95% Conf. Interval]
.3390634
.7288574
-.3988654
2.668056
-15.43129
1.166436
1.265313
1.249434
4.93185
-7.837643
Interpretation of Coefficients
e -11.630.75rdi0.99lassets0.43spe3.79biotech
P
1  e-11.630.75rdi0.99lassets0.43spe3.79biotech

Using the sample mean values of rdi, lassets, spe and
biotech, we compute the conditional probability :
e -11.630.75rdi0.99lassets0.43spe3.79biotech
P
1  e-11.630.75rdi0.99lassets0.43spe3.79biotech
e1.953

 0,8758
1.953
1 e
Marginal Effects

It is often useful to know the marginal effect of a regressor on the probability
that the event occur (innovation)

As the probability is a non-linear function of explanatory variables, the
change in probability due to a change in one of the explanatory variables is
not identical if the other variables are at the average, median or first quartile,
etc. level.
Goodness of Fit Measures

In ML estimations, there is no such measure as the R2

But the log likelihood measure can be used to assess the goodness of
fit. But note the following :



The higher the number of observations, the lower the joint probability, the
more the LL measures goes towards -∞
Given the number of observations, the better the fit, the higher the LL
measures (since it is always negative, the closer to zero it is)
The philosophy is to compare two models looking at their LL values.
One is meant to be the constrained model, the other one is the
unconstrained model.
Goodness of Fit Measures



A model is said to be constrained when the observed set the
parameters associated with some variable to zero.
A model is said to be unconstrained when the observer release this
assumption and allows the parameters associated with some variable
to be different from zero.
For example, we can compare two models, one with no explanatory
variables, one with all our explanatory variables. The one with no
explanatory variables implicitly assume that all parameters are equal to
zero. Hence it is the constrained model because we (implicitly)
constrain the parameters to be nil.
The likelihood ratio test (LR test)

The most used measure of goodness of fit in ML estimations is the
likelihood ratio. The likelihood ratio is the difference between the
unconstrained model and the constrained model. This difference is
distributed c2.

If the difference in the LL values is (no) important, it is because the set
of explanatory variables brings in (un)significant information. The null
hypothesis H0 is that the model brings no significant information as
follows:
LR  2ln Lunc  ln Lc 

High LR values will lead the observer to reject hypothesis H0 and accept
the alternative hypothesis Ha that the set of explanatory variables does
significantly explain the outcome.
The McFadden Pseudo
2
R

We also use the McFadden Pseudo R2 (1973). Its interpretation is
analogous to the OLS R2. However its is biased doward and remain
generally low.

Le pseudo-R2 also compares The likelihood ratio is the difference
between the unconstrained model and the constrained model and is
comprised between 0 and 1.
Pseudo R
2
MF
ln Lc  ln Lunc 
ln Lunc


 1
ln Lunc
ln Lc
Goodness of Fit Measures
Constrained model
. logit
inno
Iteration 0:
log likelihood = -205.30803
Logistic regression
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
Log likelihood = -205.30803
. logit
inno
Coef.
_cons
1.494183
Std. Err.
.1244955
z
12.00
=
=
=
=
431
0.00
.
0.0000
P>|z|
[95% Conf. Interval]
0.000
1.250177
1.73819
LR  2  ln L unc  ln Lc 
 2  163.5    205.3 
 83.8
inno lrdi lassets spe biotech, nolog
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -163.45352
inno
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
Unconstrained model
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
[95% Conf. Interval]
.3390634
.7288574
-.3988654
2.668056
-15.43129
1.166436
1.265313
1.249434
4.93185
-7.837643
Ps.R 2MF  1  ln L unc ln Lc
 1  163.5 205.3
 0.204
Other Binary Choice models

The Logit model is only one way of modeling binary
choice models

The Probit model is another way of modeling binary
choice models. It is actually more used than logit models
and assume a normal distribution (not a logistic one) for
the z values.

The complementary log-log models is used where the
occurrence of the event is very rare, with the distribution
of z being asymetric.
Other Binary Choice models

Probit model
Pr(Y  1| X)    Xβ   
z


e
 z2 2
2
 dz  
Xβ
 Xβ  2
2
e

Complementary log-log model
Pr(Y  1| X)    Xβ   1  exp   exp( Xβ) 
2
 dz  
Xβ

  t   dz
Likelihood functions and Stata commands
n
Logit : L( y, x,  )  
i 1
1 yi
yi
 e Xβ   1 
f ( yi , xi ,  )   
Xβ  
Xβ 
i 1 1  e
 1  e 
n
n
n
i 1
i 1
Probit : L( y, x,  )   f ( yi , xi ,  )    ( Xβ)  i 1  ( Xβ) 
y
n
n
i 1
i 1
1 yi
Log-log comp : L( y, x,  )   f ( yi , xi ,  )   1  exp( exp( Xβ))  i exp(  exp( Xβ))
Example
logit inno rdi lassets spe pharma
probit inno rdi lassets spe pharma
cloglog inno rdi lassets spe pharma
y
1 yi
0
.1
.2
y
.3
.4
Probability Density Functions
-4
-2
0
x
Probit Transformation
Complementary log log Transformation
2
4
Logit Transformation
0
.2
.4
y
.6
.8
1
Cumulative Distribution Functions
-4
-2
0
x
Probit Transformation
Complementary log log Transformation
2
4
Logit Transformation
Comparison of models
Ln(R&D intensity)
ln(Assets)
Spe
BiotechDummy
Constant
Observations
OLS
Logit
Probit
C log-log
0.110
0.752
0.422
354
[3.90]***
[3.57]***
[3.46]***
[3.13]***
0.125
0.997
0.564
0.493
[8.58]***
[7.29]***
[7.53]***
[7.19]***
0.056
0.425
0.224
0.151
[1.11]
[1.01]
[0.98]
[0.76]
0.442
3.799
2.120
1.817
[7.49]***
[6.58]***
[6.77]***
[6.51]***
-0.843
-11.634
-6.576
-6.086
[3.91]**
[6.01]***
[6.12]***
[6.08]***
431
431
431
431
Absolute t value in brackets (OLS) z value for other models.
* 10%, ** 5%, *** 1%
Comparison of marginal effects
OLS
Logit
Probit
C log-log
Ln(R&D intensity)
0.110
0.082
0.090
0.098
ln(Assets)
0.125
0.110
0.121
0.136
Specialisation
0.056
0.046
0.047
0.042
Biotech Dummy
0.442
0.368
0.374
0.379
For all models logit, probit and cloglog, marginal effects have been computed for a one-unit variation
(around the mean) of the variable at stake, holding all other variables at the sample mean values.
Multinomial LOGIT Models
Multinomial models
Let us now focus on the case where the dependent variable has
several outcomes (or is multinomial). For example, innovative firms
may need to collaborate with other organizations. One can code this
type of interactions as follows

Collaborate with university (modality 1)

Collaborate with large incumbent firms (modality 2)

Collaborate with SMEs (modality 3)

Do it alone (modality 4)
Or, studying firm survival

Survival (modality 1)

Liquidation (modality 2)

Mergers & acquisition (modality 3)
Multiple alternatives without obvious ordering
 Choice of a single alternative out of a
number of distinct alternatives
e.g.: which means of transportation do you use to
get to work?
bus, car, bicycle etc.
 example for ordered structure:
how do you feel today: very well, fairly well, not too
well, miserably
36
Random Utility Model



RUM underlies economic interpretation of discrete
choice models. Developed by Daniel McFadden for
econometric applications
 see JoEL January 2001 for Nobel lecture; also Manski
(2001) Daniel McFadden and the Econometric
Analysis of Discrete Choice, Scandinavian Journal of
Economics, 103(2), 217-229
Preferences are functions of biological taste templates,
experiences, other personal characteristics
 Some of these are observed, others unobserved
 Allows for taste heterogeneity
Discussion below is in terms of individual utility (e.g.
migration, transport mode choice) but similar reasoning
applies to firm choices
Random Utility Model


Individual i’s utility from a choice j can be
decomposed into two components:
U ij  Vij   ij
Vij is deterministic – common to everyone,
given the same characteristics and
constraints


representative tastes of the population e.g.
effects of time and cost on travel mode choice
ij is random

reflects idiosyncratic tastes of i and unobserved
attributes of choice j
Random Utility Model

Vij is a function of attributes of alternative j
(e.g. price and time) and observed
consumer and choice characteristics.
Vij   tij   pij   zij
• We are interested in finding , , 
• Lets forget about z now for simplicity
RUM and binary choices



Consider two choices e.g. bus or car
We observe whether an individual uses
one or the other
yi  1 if i chooses bus
Define
yi  0 if i chooses car
• What is the probability that we observe an individual
choosing to travel by bus?
• Assume utility maximisation
• Individual chooses bus (y=1) rather than car (y=0) if utility
of commuting by bus exceeds utility of commuting by car
RUM and binary choices

So choose bus if
U i1  U i 0
Vi1   i1  Vi 0   i10
 i1   i10  Vi1  Vi 0 
• So the probability that we observe an individual choosing
bus travel is
Pr ob  i1   i 0  Vi1  Vi 0 
 Pr ob  i1   i 0    ti1  ti 0     pi1  pi 0  
The linear probability model
• Assume probability depends linearly on observed
characteristics (price and time)
Pr ob  i chooses bus     ti1  ti 0     pi1  pi 0 
• Then you can estimate by linear regression
yi1    ti1  ti 0     pi1  pi 0    i1
• Where yi1 is the “dummy variable” for mode choice (1
if bus, 0 if car)
• Other consumer and choice characteristics can be
included (the zs in the first slide in this section)
Probits and logits

Common assumptions:


Cumulative normal distribution function –
“Probit”
Logistic function – “Logit”
exp Vi 
Pr ob  i chooses bus  
1  exp Vi 
• Estimation by maximum likelihood
Pr ob  yi  1  F  xi β 
Prob  yi  0   1  F  xi β 
i n
ln L   yi lnF  xi β   1  yi  1  F  xi β 
i 1
A discrete choice underpinning
• choice between M alternatives
• decision is determined by the utility level Uij, an
individual i derives from choosing alternative j
• Let:
U  x'   
ij
ij
j
ij
(1)
where i=1,…,N individuals; j=0,…,J alternatives
The alternative providing the highest
level of utility will be chosen.
45
The probability that alternative j will be chosen is:
P( yi  j )  P(U ij  U ik | x, k  j )
 P( ik   ij  xij'  j  xij'  k | x, k  j )
In general, this requires solving multidimensional
integrals  analytical solutions do not exist
46
Exception: If the error terms εij in are assumed to
be independently & identically standard extreme
value distributed, then an analytical solution exists.
In this case, similar to binary logit, it can be shown
that the choice probabilities are
P(yi  j) 
exp(x ij'  j )
'
exp(x

ik k )
k
47
Likelihood functions

Let us assume that you have a sample of n random observations.
Let f(yj ) be the probability that yi = j. The joint probability to observe
jointly n values of yj is given by the likelihood function:
n
f  y1 , y2 ,..., yn    f ( yi )
i 1

We need to specify function f(.). It comes from the empirical discrete
distribution of an event that can have several outcomes. This is the
multinomial distribution. Hence:
f (y j )  p 0
dYi0
dYi1
1
p
p j
dYij
pk
dYik
 pj
jK
dYik
The maximum likelihood function

The maximum likelihood function reads:
 k dYij 
L(y)   f  yi      p j 
i 1
i 1  j1

dYi0
dYij 



 

( j|0)
x

   

n
n
k 

1
e
( j|0)
  
 
L(y)   f  yi , x i ,     
j k
j k
( j|0)
( j|0)
x

 x   
i 1
i 1 
j1 
 1   e 
1

e

 
 

j1
j1


 

n
n
The maximum likelihood function
The log transform of the likelihood yields





 
(
j|0
)

 x i    

 k 

n
1
e
    dyij  ln 
 
LL(y, x, ( j|0) )    dyi0  ln  j k
j

k


x i( j|0 )  
x i( j|0 )    




i 1
j1


1   e

1   e
  
j

0
j

0



 



 j k  xi( j|0 )   k  j ( j|0)
 j k  xi( j|0 )    
)     ln 1   e
    dyi x i  ln 1   e
  

i 1 
 j 0
 j1 
 j 0
 
n
LL(y, x, 
( j|0)
LL(y, x, 
( j|0)
n
k
)   dy x i
i 1 j1
j
i
( j|0)
 k 
 j k  xi( j|0 )    
  k  1     ln 1   e
  
 j1 
 j 0

 
Multinomial logit models
 Stata Instruction : mlogit
mlogit y x1 x2 x3 … xk
[if] [weight] [, options]
 Options : noconstant : omits the constant
robust : controls for heteroskedasticity
 if : select observations
 weight : weights observations
Multinomial logit models
use mlogit.dta, clear
mlogit type_exit log_time log_labour entry_age entry_spin cohort_*
Goodness of fit
Parameter estimates, Standard
errors and z values
Base outcome, chosen by STATA, with the
highest empirical frequency
Interpretation of coefficients
The interpretation of coefficients always refer to the base category
Does the probability of being boughtout decrease overtime ?
No!
Relative to survival the probability of
being bought-out decrease overtime
Interpretation of coefficients
The interpretation of coefficients always refer to the base category
Is the probability of being bought-out
lower for spinoff?
No!
Relative to survival the probability of
being bought-out is lower for spinoff
Marginal Effects
Pij
J 1


 Pij   jk   Pim  mk  , j  1,
x ik
m 1


, J 1
Elasticities
J 1
Pij x ik


 x ik   jk   Pim  mk  , j  1,
x ik Pij
m 1


, J 1
 relative change of pij if x increases
by 1 per cent
55
Independence of irrelevant alternatives - IAA

The model assumes that each pair of outcome is independent from
all other alternatives. In other words, alternatives are irrelevant.

From a statistical viewpoint, this is tantamount to assuming
independence of the error terms across pairs of alternatives

A simple way to test the IIA property is to estimate the model taking
off one modality (called the restrained model), and to compare the
parameters with those of the complete model

If IIA holds, the parameters should not change significantly

If IIA does not hold, the parameters should change significantly
Multinomial logit and “IIA”



Many applications in economic and
geographical journals (and other research
areas)
The multinomial logit model is the
workhorse of multiple choice modelling in
all disciplines. Easy to compute
But it has a drawback
Independence of Irrelevant Alternatives

Consider market shares




IIA assumes that if red bus company shuts
down, the market shares become



Red bus 20%
Blue bus 20%
Train 60%
Blue bus 20% + 5% = 25%
Train 60% + 15% = 75%
Because the ratio of blue bus trips to train
trips must stay at 1:3
Independence of Irrelevant Alternatives





Model assumes that ‘unobserved’ attributes of all
alternatives are perceived as equally similar
But will people unable to travel by red bus really
switch to travelling by train?
Most likely outcome is (assuming supply of bus seats
is elastic)
 Blue bus: 40%
 Train: 60%
This failure of multinomial/conditional logit models is
called the
Independence of Irrelevant Alternatives assumption
(IIA)
Independence of irrelevant alternatives - IAA


H0: The IIA property is valid
H1: The IIA property is not valid


 
  
1
* 
* 
ˆ
ˆ
ˆ
ˆ
H  R  C var R  var C
ˆ R  ˆ *C




The H statistics (H stands for Hausman) follows a χ² distribution with
M degree of freedom (M being the number of parameters)
STATA application: the IIA test


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
Omitted variable
Application de IIA


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
We compare the parameters of the model
“liquidation relative bought-out”
estimated simultaneously with
“survival relative to bought-out”
avec
the parameters of the model
“liquidation relative bought-out”
estimated without
“survival relative to bought-out”
Application de IIA


H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
The conclusion is that outcome survival
significantly alters the choice between
liquidation and bought-out.
In fact for a company, being bought-out must be
seen as a way to remain active with a cost of
losing control on economic decision, notably
investment.
Cramer-Ridder Test
Often you want to know whether certain alternatives
can be merged into one:
e.g., do you have to distinguish between employment
states such as “unemployment” and “nonemployment”
The Cramer-Ridder tests the null hypothesis that the
alternatives can be merged. It has the form of a LR
test:
2(logLU-logLR)~χ²
64
Derive the log likelihood value of the restricted
model where two alternatives (here, A and N)
have been merged:

log L R = n A logn A + n N logn N

-(n A + n N )log(n A + n N )+ log L P

of the
L R is the log likelihood

restricted model, log
L P is the log likelihood
where log
of the pooled model, and nA and nN are the
number of times A and N have been chosen
65
Exercise



use http://www.statapress.com/data/r8/sysdsn3
tabulate insure
mlogit insure age male nonwhite site2 site3