Download OLS assumption(unbiasedness) An estimator, x, is an unbiased

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Discrete choice wikipedia , lookup

Regression analysis wikipedia , lookup

Bias of an estimator wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
OLS assumption(unbiasedness)
An estimator, x, is an unbiased estimator for an unknown parameter, if E(x^) = x. This does
not mean that x^ = x, rather, it implies if we could indefinitely draw random sample from on
y from the entire population, and compute an estimate each time and then average all
estimates over all random samples on y, we would obtain the unknown estimator.
1. Linear in parameters: the dependent variable is a linear combination of various
independent variables and the unobservable error term.
2. Random sampling: a random sampling of size n. This assumption has to be established to
for the errors to be randomly distributed.
3. Sample variation in the explanatory variable: the sample outcome corresponding to x
should not be of the same value
4. Zero conditional mean: this is to say using the same values of x to explain a randomly
choosen sample y, the expect value of error should be zero. This implicitly implies that
X ╨ u
5. Homoskedasticity: the error has the same variance, var(u|x) = σ², given any value of the
explanatory variable
Proof: because Var(u|x) = E(u²|x) – [E(u|x)]², because E(u|x) = 0, and σ² = E(u²|x), which
implies thatσ² is an unconditional expectation of u². This means σ² = E(u²) = var(u).
Essentially, σ² is the unconditional variance of u.
We can write a linear function as
E(Y|X) = B0 + B1X,
Var(y|x) = σ²,
Which implies that the conditional expectation of y given x is linear in x, while the
variance of y given x is constant.
Heteroskedasticity is present whenever var(y|x) is a function of x.
Expected value of the OLS estimators
1. Linear in parameters:
2. Random sampling: a random sample over n obs
3. No perfect collinearity: none of the IVs is constant and there’re no exact relationship
among the IVs. (one of them cannot be a lincom of another (constant multiplier))
4. Zero conditional mean: can fail if the functional relationship between Y and X is
mis-specified. Omitted variable bias can also lead this assumption to fail
Omitted variable bias/underspecified model
Y = b0 + b1x1 + b2x2 + u, excluding x2
So that b1~ = b1^ + b2^δ2
E(b1~) = E(b1^ + b2^δ2) = E(b1^) + E(b2^)δ2 = b1 + b2δ2
So the bias in b1~ is
E(b1~) – b1 = b2δ2
δ2 is the sample covariance between x1 and x2 over the sample variance of x1, δ2 = 0 iff
x1 and x2 are orthogonal in the sample. This will lead to biased estimate.
5. Homoskedasticity: the variance of the unobserved error u does not depend on the levels
of explanatory variables. *if var(u|x) is not constant, then t stat and CI are invalid no
matter how large the sample size is
6. Normality (including 4 and 5): u ~ N(0, σ²) and iid. Y has a normal distribution with mean
linear in x and a constant variance. Disadvantage: assuming all u affect y in a separate,
additive fashion ( but noted that u is unobserved)
Because var(u) = plim (𝑁 −1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥 )²), therefore, when N is not very large. Variance
will become substantial and t distribution will be a poor approximation to the distribution of
t stat when u is not distributed normally. Conversely, as n grows, σ²^ will converge in prob to
the constantσ². And so is the F stat.
Inconsistency
E(u|X1,…,Xn) = 0
If the zero conditional mean assumption is violated (i.e., the error term is correlated with
any of the independent variables), then OLS is biased and inconsistent. And because this
deficiency is inherent in the model specification, this bias will persist even if the sample size
grows.
An estimator is called consistent if it converges in probability to the quantity being estimated
in which
Plim(b^ - b) = cov(x, u)/var(x),
Since u is unknown, we cannot estimate cov(x, u) #
# plim b^ = b1 + Cov(x1, u)/var(x1) = b1 because Cov(x1, u) = 0
OLS is consistent only if we assume zero correlation between x1 and u.
Assuming y = b0 + b1x1 + b2x2 + u
Now we omit x2 so that u = b2x2 + r and assuming the regression coefficient in the new
model is b1~, which equals b1 + b2v, and assuming that b2 is positively correlated to x1,
then
ß = cov(x1, x2)/var(x1),
since var(x1) > 0, the direction of the relationship between x1 and x2 determine the
direction of the inconsistency of this asymptotic bias.
If b^ is consistent, then the distribution of b^ becomes more and more tightly distributed
around b as n increases.
Gauss markov theorem (BLUE)
Linear: b^ of b is linear iff, it can be expressed as a linear function of the data on the DV
B = (𝑋 ′ 𝑋)−1X’Y
Best: smallest variance
For any estimator b~ that is linear and unbiased, var(b^) < var(b~)
Efficiency (smallest asymptotic variance)
OLS is asymptotically efficient among a certain class of estimators under the
GAUSS-MARKOV assumptions.
Why binary DV model is different from OLS assumption?
Why can’t we treat a dichotomous variable as a special form of continuous variable and use
OLS regression?
1. Errors are not normally distributed, because Y can only take values of 0 and 1, the
error term is also dichotomous, which violates the normality assumption.
2. The errors cannot have constant variance, which violates homoskedasticity
assumption.
3.
The predicted values of Y are not constrained to the 0-1 prob range, one can have
predicted prob of Y that are less than 0 or greater than 1.
Logit transformation solves these problems and allows one to model the log of the
odds as a linear DV, however, the logistic regression coef canot be directly interpreted
in the same way as reg coef
Difference between multinominal and ordered probit/logit
Multi always has a set of estimated slope coef for each contrast, whereas for ordered
probit/logit there is only one slope coef for each IV but several intercept. In other words,
ordered reg assumes parallel slopes across the categories of the IV (i.e., the parallel
regression assumption), whereas this assumption is not necessary in multinomial reg, that is,
the effect of a parameter b is assumed to be identical for all J-1 groupings.
Marginal effect does not indicate the change in the probability that would be observed for a
unit change in x
Parallel regression assumption: the relationship between each pair of outcome groups is
the same. In other words, ordinal logistic regression assumes that the coefficients that
describe the relationship between, say, the lowest versus all higher categories of the
response variable are the same as those that describe the relationship between the next
lowest category and all higher categories, etc. This is called the proportional odds
assumption or the parallel regression assumption. Because the relationship between all
pairs of groups is the same, there is only one set of coefficients (only one model)
Ordered Probit/Logit
When the response variable, y, is an ordered response, as is often the case in survey
research, then methods for modeling ordered outcomes should be considered. The most
commonly used methods are ordered probit and logit. In ordered probit/logit, responses can
be ordered from low to high by using thresholds to partition a continuous scale into various
regions (usually more than 2), which is equivalent to mapping a number of observed
outcomes into a latent continuous scale though the distance between the adjacent
categories are unknown and may not be of the same distance.
A few caveats should be addressed before I continue to discuss the situation under which
ordered probit/logit model should be used. First, as many prior researches have indicated,
the values of a response variable can be ordered does not mean they should be ordered
(Miller and Volker 1985), this depends on researchers’ theoretical questions and their
purposes. Second, though the response values have ordinal meaning among themselves, we
cannot say a subject receiving 2 is twice more something than a subject receiving 1 on its
response value, because the distances between these response values are not equal and the
thresholds (or cutpoints) used to partition these responses contain many substantive
(usually more theoretical than statistical) information.
To begin with, researchers’ primary concern when they employ ordered probit/logit model
(McKelvey and Zavoina 1975) is how changes in the predictors affect the probability that an
observed response variable, 𝑌𝑗 (which is a linear combination of some IVs), is more (or less
likely) to fall into a particular ordinal category. Ordered probit /logit models employ the
same probit/logit link as probit/logit function, but unlike these binary response models (e.g.,
probit/logit), ordered probit/logit model partition the latent continuous scale into various
regions corresponding to various ordinal categories by defined by establishing some
threshold values and map observed response values into these categories. An extension of
this difference between ordered probit/logit and logit/probit is the identification issue,
because ordered probit/logit have “too many” free parameters (associated with the number
of thresholds) to identify, therefore, there is no way to choose among these sets of
parameters with the observed data and therefore, in contrast to the OLS or a probit/logit
model, they are unidentified (though this can be compensated for by identifying a
corresponding change in the thresholds).
However, these characteristics make OLS model inappropriate in estimating ordinal
outcomes.
To begin with, neither ordered probit or logit models are linear; for the model to be linear,
one must transform the dependent variable (in logistic model, this is done through applying
natural log) so that latent outcomes are fall along a continuous scale, but this goes against
our purpose because its dependent variable of interest is a non-interval outcome variable
(i.e., the distances between outcomes are not equal). If we proceed as such, then we lose
the information about the ordering and not answer the research question.
Second, the residuals of the ordered probit/logit model are not normally distributed, they
can only take on few values for each combination of independent variables, which is to say
the errors are heteroskedasticity and not normal (unless the thresholds are all of the same
distance apart). As a result, normality assumption is violated and this makes OLS model an
inappropriate choice for the task of estimating ordinal outcomes.
Nevertheless, researchers should, as discussed earlier, be cautious about their choice of
ordinal outcome. First of all, the number of outcomes and their corresponding cutpoints are
solely determined by researcher, if the response values can be ordered, then it follows that it
can also be reordered. When another researcher is not satisfied with a large number of
observations that fall into a particular ambiguous category or when the number of cases in a
particular response category is small (so the model may fail to converge), she can always
reorder the data by further differentiating response answers with more thresholds or
merging the category with small number of cases into adjacent categories (but this could
lead to inefficient estimate).
Following the central limit theorem, as the number of ordinal response categories along the
latent continuous scale increases, the distribution of y, our response variable, approaches
that of normal distribution under the OLS model’s normality assumption. This makes OLS
model an acceptable choice for modeling ordinal response. For instance, in many
comparative politics or IR empirical researches that employ the untransformed Polity IV
score (0 to -10 for autocracy, 0 to 10 for democracy), the 21 points regime score category
makes a good enough fit to the OLS’s normality assumption.
Additionally, because ordered probit/logit model also use maximum likelihood, it requires a
relative large data set (as compared to OLS model) for the model to converge. Therefore,
when the data under consideration does not have large number of observations and/or
when the researchers have numerious ordinal outcomes in mind so that the distribution of
responses among these ordinal categories makes data distribution pattern approach that of
standard normal. Then OLS would be a viable model of choice for the researchers to
consider.
Interpretation
For ordered probit, the interpretation of coefficient is the same as that of OLS model. For the
ordered probit model, the partial change in y with respect to x can be interpreted as
For a unit increase in x, y is expected to change by b units, holding all other variables
constant.
y-standardized and fully-standardized coefficients can be interpreted similarly.
When ordinal outcomes are listed in the regression table and their coefficients are obtained
through regression analysis, then we can match the category of interest with its
corresponding variable and finds its coefficient. The coefficient, B, can be interpreted as,
a unit increase in x, the probability that outcome y is more likely change by b units,
holding all other variables constant
For dummy variable, such as gender, its coefficient should be treated as the effect of moving
from its minimum value (x = 0) to the maximum (x = 1) and be interpreted as
Finally, for ordered logit model, it is often interpreted in terms of odds rations for cumulative
probabilities where we estimate the cumulative probability that the outcome is equal to or
less than a particular value. For instance, in a four point response scale ranging from
disagree, neutral, agree, strongly agree. Then for a variable x, which increases by 1 and with
a coefficient B, should be interpreted as
For an increase of 1 in X, the odds of D versus all others changed by a factor of exp(1*B) or
the odds becomes exp(1*B) times greater than all other outcomes (or, the odds of D is
increased by 100[exp(1*B) -1]) , holding all other variables constant.
####focusing on marginal effect
MLE
Violation of OLS assumption
1. Zero conditional mean:
y = b0 + b1x1 + b2x2 + e
Adding an unknown, nonzero constant 𝛿 into the model
Y = B0* (B0 + 𝛿) + B1X1 + B2X2 + E* (E- 𝛿)
Thus E(B0*^) = B0 + 𝛿, and B0 is not identified individually though we can identify B0 +
𝛿. For a parameter that is unidentified it is impossible to estimate a parameter regardless
of the data available, but we can identify it through the combination of other
parameters.
Y = b0 + b1x1 + b2x2 + e
b0 + b1x1 + r
assuming the error absorbs the omitted variable,
r = b2x2 + e
so if x1 and x2 are correlated, then e and r must be correlated, then the OLS model is a
biased and inconsistent estimate of B1.
Heteroskedasticity in binary outcome:
Var(y|x) = Pr(y = 1|x) [1 – Pr(y = 1|x)] = xb(1 – xb), so the OLS estimate Is inefficient and
the se are biased, resulting in incorrect test statistics.
Marginal effect: the ration of the change in y to the change in x, when the change in x is
infinitely small, holding other variables constant. While in the linear model, the partial
derivative is the small for all values of the x, because in non-linear model, the d is not
continuous, rather, the discrete change in y as d change from 0 to 1 is the case. The
effect of a unit change in a variable depends on the values of all variables in the model
ML Estimator: the value of the parameter that makes the observed data most likely and that
value also maximizes the log of the likelihood.
Odds ratio is a multiplicative coefficient, which means that “positive” effects are greater
than 1, while neg effects are between 0 and 1. A constant factor change in the odds does not
correspond to a constant or constant factor change in the probability, it is important to know
the current level of the odds is.
Empirical Bayes
The OLS estimator Y can be viewed as the best unbiased estimator of β0j in the level-1
model. Similarly, the estimated grand mean, β00 can also be seen as an alternative
estimator of β0j . A Bayesian estimator is an optimal weighted combination of these two
β∗0J = ΛY + (1 - Λ) β00 ,
Where Λ is equal to the reliability of the least squares estimator (for the parameter, β00 )
or, the intraclass correlation (ICC)
Λ = Var(β0J ) / Var(Y),
where Λ is bounded between 0 and 1. So as Λ approaches 1, the Bayesian estimator
will be pulled toward the OLS estimator, otherwise, it will be pulled toward to the estimated
grand mean β00 . Noted that because Var(β0J ) is decrease in nj , therefore, as the simple size
of group j increases, the Bayes estimator will give the simple mean more weight; on the
contrary, if the simple size is too small, the Bayes estimator will assign more weight to the
estimated grand mean. Similarly, the sample means b0j will increasingly pool around the
grand mean when error variance is large (relative to parameter variance).
Because the variances are unknown but β∗0J is based on substituting an estimate of Λj into
the equation and tend to pull Y toward the grand mean β00 , the estimates have been
termed empirical Bayes estimates or shrinkage estimator.
Therefore, the estimator in RE is a weighted average between no pooling (the FE) and
complete pooling (the OLS), because it borrows information from completely pooled mean
to
generate estimate of cluster mean (problem with small cluster sizes)
Pooling is the degree to which the cluster means (the intercepts) are pulled toward the
global mean of Y
Consequence:
If Λ = 0, bGLS reduces to bW (FE), so as sample size increases, β∗0J becomes more similar
to W
If Λ = 1, bGLS reduces to bOLS.
Controversial assumption:
We need an accurate estimate of b1. Note that this is fixed, so that bGLS,
bW, bB are all estimates of the same parameter, b.
Note that bGLS implicitly assumes that the between and within effects are the
same (i.e., b). But why might they differ?
The between effect could differ from the within effect due to omitted level-2
variables, which would then be related to the observed variable at level 1.
– “Cluster confounding”: b1 could be confounded by conflicting between
and within effects.
– We can solve this by estimating both between and within effects of b in
the random effects model.
– Fundamental issue: the equality (or lack thereof) of within and between
effects.
Hausman
Hausman test for the equality of RE and FE
estimates.
– RI and FE are consistent (for b) if correctly specified. However,
if we violate the “controversial assumption,” RI becomes
inconsistent, yet FE remains consistent
The Hausman test checks a more efficient model against a less efficient but consistent model
to make sure that the more efficient model also gives consistent results (first run FE, then
RE)
The hausman test tests the null hypothesis that the coefficients estimated by the efficient
random effects estimator are the same as the ones estimated by the consistent fixed effects
estimator. If they are (insignificant P-value, Prob>chi2 larger than .05) then it is safe to use
random effects. If you get a significant P-value, however, you should use fixed effects.
Duration Dependence BTSCS
Structure of the Data: N versus T
As is often the case, the first starting point is to examine the dimension of the data in order
to have an informed sense of where the possible sources of variance come from.
A quick diagnostic to begin with is to check the number of units (N) and the time span (T)
over which units are observed.
Beck and Katz (1995) first noted the finite sample properties of fGLS model and its inability
to control for the nuisance from temporal correlation in TSCS data and treat it as a nuisance
that can be controlled for by adding lagged dependent variable (LDV) into the model (Beck
and Katz 1996). In their following article (Beck and Katz 1998), the authors also pointed out
the “discrete” character of TSCS data with binary outcome (BTSCS) and the possible
temporal dependence in the binary data structure, which may lead to severe inefficiency in
estimation. They suggest using time dummies to account for the effect of the length of prior
spells of peace and applying spline function to characterize the slowly-changing hazard rate
over time. In a nutshell, Beck and Katz recommend analysts who use pooled TSCS data to
first correct for autocorrelation within panel and then follow their recommended PCSE
procedure to correct for panel heterogeneity, and, when BTSCS data is the data at hand, use
time dummies to address the problem of temporal dependence across discrete time
intervals.
However, if researchers blindly follow Beck and Katz’s PCSE approach without actually paying
attention to the “unobserved unit heterogeneity” and dynamics within their data (whether
they be TSCS or BTSCS data), they could be led to obtain biased estimates. To begin with,
PCSE assume a common error process and a common intercept for all units and, as Wilson
and Butler (2007) illustrated in their article, temporal correlation within units can distort the
regression results estimated using OLS with PCSE (which is quintessentially a fully pooled
model). Also, if some sluggish variables (e.g., political institution) are of interest, then
measuring the variable with different unit of analysis (e.g., replacing month with year) can
have a profound impact on the statistical significance of these sluggish variables “within”
units. That said, this all depends on researchers’ underlying theory and the variables they
want to examine; if sluggish variables like political institution is the variable of interest and
that researchers discover their data to have substantial unit heterogeneity, then the dynamic
panel model or simply fixed effects model (FEM) as proposed by Wilson and Butler are
feasible options.
Wilson and Butler’s suggestion leads us to move the discussion to the consideration of the
dimensionality of BTSCS data. Recall that in Beck and Katz’s original approach, they still use
the OLS model as their workhorse model which, according to them, is because of its
consistency. But note that OLS, if used to estimate TSCS data, is no longer consistent when T
is finite and is more efficient than the fGLS model only when the size of N approaches that of
T (Chen, Lin, Reed 2010). Given that social science data, particularly political economy data,
seldom have T larger than 30 because of the difficulty of recording in the previous centuries
while having over a hundred countries (N), this deficiency might lead us to prefer the FEM.
The use of FEM reflects the researchers’ results (as well as their concerns) are largely driven
by the potentially large unobserved variance across units (Green, Kim, Yoon 2001), but this
taking does not come without a cost, first, FEM is less efficient in estimating sluggish
variables which have very little within variance (Beck 2001) and, second, it does not allow
the incorporation of other time-invariant variables. Going back to our previous discussion, if
sluggish variables like political institution are of interest (e.g., the effect of political
institution on dyadic conflict) but their within variance is greater than their between
variance (cross-unit variance), then FEM might give us a poor estimate. Plumper and Troeger
(2007) propose the Fixed Effects Vector Decomposition (FEVD) approach that separate the
effect of sluggish variable from unit fixed effect, which provides a better standard error
estimate than conventional FEM. Nevertheless, researchers should still be aware of the
magnitude (and the ratio) of the within and between variance within their data and the
properties of their interested variable, if the within-variance dominates the between
variance across dyads, then it makes much more sense to employ the OLS model with PCSE
or even the fGLS model in such situation. Similarly, if the overall effect of a particular
variable across all units is the point of concern for the researcher, especially comparativists
whose utmost concern is “generality,” then the within effect obtained from the FEM does
not offer much generality to substantiate the researchers’ hypothesis.
A related problem with the dimensionality of data concerns their hierarchical structure in
that observations (e.g., years) are nested within dyads and the same variable, such as the
political institution discussed earlier, can exert distinct within and between effect across
different levels of analysis, a phenomenon many researchers called “cluster-confounding.”
While the progress in multilevel/Bayesian models and the availability of commercial
software have facilitated the analysis of hierarchical data, but the issue of
“cluster-confounding” has seldom been seriously addressed, many researchers simply
treated the between and within effects as the same, and this can have important implication
for out statistical inference. If the directionality of the within and between effects are the
same, then, there should be little to worry about, but if, on the contrary, they are opposite
to each other, then, depending on the size of the each unit-cluster, the empirical Bayes
estimator (which is an average between the coefficient estimate of the sample mean and
the group mean) will not accurately convey the difference in the variable’s effect across two
different levels, as shown in Zorn (2001b) and Neuhaus and Kalbfleish (2007). Bartles (2008),
through partitioning and rearranging the within and between effects, constructs a
Hausman-like test that help researchers assess the significance of the difference between
the within and between effects.
Nevertheless, researchers should have a good understanding of the structure of their data
before proceeding to employ certain statistical models.
Duration Dependence
Another major issue confronts researchers using BTSCS data is the problem of duration
dependence in which the hazard rate of current event occurrence is a function of prior event
occurrence and this nuisance could affect the validity of our inference regarding the
relationship between the covariates and the duration time.
The first issue is repeated events. It is possible that a dyad experience multiple events over
the observed period and that a dyad is at risk of experiencing their second dyadic conflict
only after they had experienced their first conflict. Thus the proportional hazard assumption
of which the hazard of experiencing certain event is constant across events. This is to say,
there exists substantial “duration dependence” in the outcome and therefore the probability
that an event would occur is not simply a function of the covariates of interest but also
conditional on the timing and the number of occurrence prior event. Ignoring this problem
could lead to serious estimation inefficiency.
While many survival models allow for a more flexible (nonconstant) baseline hazard
assumption (such as the Cox model whose baseline hazard is simply treated as a nuisance
and being integrated out of the likelihood), but they still do not specify statistically the issue
of duration dependence resulting from prior occurrence of event or the non-proportionality
of covariates. If the hazard rates change over the rank of survival time, then the power of
estimation becomes less precise.
A second problem, as defined previously, is related to BTSCS’s data structure. BTSCS data are
grouped duration data (Beck and Katz 1998) where event occurrences are recorded in some
fixed and discrete time interval (say, dyad-year) but are observed once a year. This kind of
data structure makes many modeling strategies commonly used in continuous time
inappropriate for tackling BTSCS data.
Finally, like in a hierarchical data structure, units sometimes exhibit unit-specific unobserved
heterogeneity distinct from the aggregate group effect that makes some units more likely to
experience certain events than others.
To correct for the effect of duration dependence in BTSCS data, Beck and Katz (1998)
propose including temporal dummies into the original probit/logit specification to test if
grouped observations are temporally related. The two authors also suggest using cubic
spline that provides a smooth characterization of duration dependency. Beck and Katz’s
method provide an easily accessible method for analysts to deal with duration dependence
in BTSCS data but does not pay proper attention to issues of non-proportional hazard and
unit heterogeneity.
If researchers expect non-proportional hazard and unit heterogeneity to exist in their data
through careful examination, then several diagnostic and correction methods can be
applied.
Steffensmeiser and Zorn (2001) recommend various procedures to treat the
non-proportional hazard problem. First, they suggest using residual-based test as a
diagnostic test to test the relationship between an observation’s Schoenfeld residual and the
length of its survival time. If proportional hazard assumption holds, then researchers should
find out no relationship between the residual and the rank of the survival time; otherwise,
the fitted model may over or under estimate the underlying hazard rates.
Steffensmeiser and Zorn also suggest using piece regression that partitions the observed
survival time by their mean to see if certain covariates exhibit non-proportionality across
these two time scale and then multiplying these these “offending covariates” by the log of
time to correct for their non-proportionality. This procedure leads to better specified model
and contributes to greater accuracy though at the risk of incurring inefficiency because of
the possibility of including many covariates that are proportional in the piecewise regression
test.
If researchers find out non-proportional hazard and unit heterogeneity to be serious
problems in their data and “repeated events” is a common phenomenon across units, then
conditional frailty model (Steffensmeier and De Boef 2006; Steffensmeier, De Boef, and
Joyce 2007) measured in gap time with stratification should be considered. The model,
unlike the unstratified Anderson-Gill method, assumes an observation is not at risk for a
later event until it has experienced all prior events and treats each event occurrence as
different failure types and stratifies observations based on these different failure types,
which allows each stratum to have different baseline hazard functions that capture the
“non-proportionality” of hazard rates over time. Another flexibility afforded by this model is
that it specifies unshared random effects for each unit, thus controlling for the unobserved
“frailty” across units that might bias the estimation.
Nevertheless, researchers using the conditional frailty model should make sure that the
amount of variation along different dimensions is sufficient for the use of conditional frailty
model. Variation on the number of events experienced within a unit provides the statistical
properties for incorporating random effects to model a unit’s frailty; variation on the timing
of multiple events provides information on the degree of event dependence. Both variances
are required for the proper use of conditional frailty model on BTSCS data.
Conditional logit model differs from standard logit model in that data are grouped according
to some attributes and the likelihood is estimated conditional on each grouping; any factor
that remains constant within the grouping is perfectly collinear with the group (in the
conditional logit model, the grouping is in terms of the country-dyad)
For underlying continuous time process
GLS estimate of B = (X’Ω−1 X)−1 X’Ω−1Y,
where Ω is the variance-covariance matrix of the error and is constant (follows a common
error process) within and across units. But this assumption may be violated when
contemporaneously-correlated errors resulting from panel heterogeneity and/or when
observations are temporally-correlated (i.e., serial correlation exists within units).
Var(B) = (X’X)−1(X’ΩX) (X’X)−1
Terminology
Probability theory: a quantitative analysis of the occurrence random phenomena. It
concerns how random variable (a random numerical value whose generation follows a
stochastic process and is expected to change every time and, if measured on an infinite and
continuous scale, the probability of generating any specific value should approach zero) is
generated and certain event occurrence that follows a non-deterministic (stochastic) process
in which the initial condition is known but future evolution have many possible paths and
some paths are more probable than others.
Because it assumes that if a random event were to repeat many times, then based on the
law of large numbers and central limit theorem, the sequence of the random event will
exhibit certain systematic pattern that can be observed and estimated.
Technically, probability theory makes use of some mathematical knowledge like density
function, probability distribution, and sample space to estimate the occurrence of random
variables and events. It is a theory because it covers many systematic concepts that help
analysts understand “randomness” and explain the mechanics behind event occurrence, and
most importantly, it provides the mathematical foundation for modern statistics that
facilitate the prediction of event occurrence, thus suffice itself to be regarded as a theory.
Random variable: a measurable function that assigns unique numerical values to all possible
outcome, whose process is determined by a (usually infinitely repeated) well-defined
experiment.
Continuous rv: a random variable that takes on any real value with zero probability.
Probability function: a measure of the probable distribution of some random variable,
which assigns the likelihood of occurrence for all possible outcomes.
PDF: summarizes the information concerning the possible outcomes of X and gives the
probability that a random variable takes on each value.
Parameter: an unknown value that describes a population relationship.
Estimator: a procedure for combining data to produce a numerical value for a population
parameter and the numerical value thus produced does not depend on the particular
sample chosen (because the sample is randomly chosen)
Sampling: the practice of selecting a subset of population with a purpose to obtain some
knowledge of a statistical population
Central Limit Theorem: an element of probability theory stipulating that a sufficiently
large number of independent and identically distributed random variables will be
approximately normally distributed with mean zero and estimatable variance that tends to
standard normal
Correlation Coefficient: a measure of linear dependence between two random variables
that does not depend on the unit of measurement and is bounded between -1 and 1.
Critical value: In hypothesis testing, it is a value against which a test statistic is compared to
determine whether or not the null hypothesis is rejected.
Interaction effect: the partial effect of an explanatory variable depends on the value of a
different explanatory variable.
Marginal Effect: The effect on the DV that results from changing an IV by a small amount.
MLE: an estimation method where the parameter estimates are chose to maximize the
log-likelihood function.
Overdispersion: in modeling a count variable, the variance is larger than the mean
Plim: the value to which an estimator converges as the sample size grows.
Prediction: the estimate of an outcome obtained by putting specific values of the IVs into an
estimation model.
Random Sample: a sample obtained by sampling randomly from the specified population
Robustness: a model or a model specification that performs well even if its assumptions are
somewhat violated by the true model from which the data were generated or from
alternative specifications
Sample selection bias: a bias that is induced by using data that arise from endogenous
sample selection
Sampling distribution: the probability distribution of an estimator over all possible sample
outcomes
Serial correlation: correlation between the errors in different time periods
Variance: a measure of spread in the distribution of a random variable
Counterfactual and Matching
In experimental studies or large-N social science research, researchers often assign subjects
to treatment (usually a measurable parameter; this refers to treatment group ) and
non-treatment status (the control group) in order to examine the effects of treatment on the
outcome of interest. But researchers often confront problems arising from `unobserved
heterogeneity’ and thus having difficulty ascertaining whether the observed effect is
attributable to the treatment. Freedman (1999) provides an account on how to make causal
inference based on association-based natural experiment.
Rubin (1990) proposes the SUTVA assumption stipulating that potential outcomes for any
particular unit i following treatment t is stable such that for every set of allowable treatment
allocation there is a corresponding set of fixed (non-stochastic) potential outcomes that
would be observed. Given its randomized nature, treatment must be randomly assigned in
order to control for ‘unobserved heterogeneity’ that might distort our causal inference.
However, subjects (usually humans) differ in their personal attributes and initial conditions,
so random assignment may not remedy the heterogeneity exists between units.
From the perspective of controlling individual-specific heterogeneity, a unit itself is definitely
its best control group, but during the course of experiment, a unit cannot undergo both
treatment and control, so using the unit itself as control is unfeasible and we will never
know what would have been observed had the treatment t been turned off; this
counterfactual dilemma is what Holland called ‘the fundamental problem of causal inference”
(Holland 1986). Rubin also suggests randomly assigning treatment to the same individual at
different time point over time to gauge the average treatment effect (ATT) over time and
extend this experiment to multiple individuals to generalize the ATT. But this procedure still
had not answer the fundamental question of how to find an appropriate “control” and the
SUTVA assumption can easily be violated if
(1) Treatment varying in effectiveness: this means the “outcome” is a distribution of
potential outcome but not fixed potential outcome, thus “what’s the effect of
treatment t?” is not a well-posed causal question and actually the subject receives
two treatment.
(2) Interference between units: this causes our inference to be spurious because the
effect did not come from the assigned treatment. (Morgan and Winship, 2007)
In experimental study, these potential violations pose little problem, because experiments
are usually conducted in a control environment. But in social science, randomization poses a
great challenge for in observational data (as is usually the case in social science data), there
is a non-random assignment mechanism and researchers cannot control the assignment of
“treatment.” Thus, how to minimize unobserved heterogeneity across observations and
improve randomization have become a central problem for social science statistic and
statistical matching technique is a response to such challenge.
Rosenbaum and Rubin (1983) first proposes propensity score matching (PSM) as a technique
to select a subset of comparison units similar to the treatment unit, which can be used as
counterfactual in non-experimental setting. PSM measures the probability of a unit being
assigned to a particular condition in a study given a set of known characteristics. Statistically,
this procedure can be translated as the conditional probability of a unit being selected into
treatment given independent variable X.
Alternatively, Morgan and Winship (2007) suggest using conventional regression model as a
conditional-variance-covariance weighted matching procedure. The authors treat
regression to execute stratification, through which the marginal distribution of
observations are distributed across different strata (conditional on observed characteristics)
and the conditional variance assign weight to each stratum according to its stratum-specific
variance. In a nutshell, Morgan and Winship equate stratification estimator (the variables) as
matching. In addition, Morgan and Winship also suggest interacting treatment dummy with
all observed characteristics as a way to capture the effects of treatment status as observed
characteristics vary in all sample (including the hypothesized average treatment effect on
the control group (ATC)).
While PSM reduces selection bias by equating groups based on many known characteristics
and provides an alternative method for estimating treatment effect when treatment
assignment is not random, Morgan and Winship’s stratification as matching technique helps
distinguishing stratum-specific effects from treatment effect. Both offer an easy to
implement procedure and linear interpretation.
However, the validity of PSM might not hold when our data has only limited observations
and there exists little overlap in known characteristics between observations (which is
quintessentially a small sample problem), thus making our search for matched pair
extremely difficult (Shadish, Cook, and Campbell 2002). Pearl (2009) suggests modeling on
qualitative causal relationship between treatment, outcome, and unobserved variables.
The same argument also applies to Morgan and Winship’s stratification technique.
Additionally, stratification estimator as matching only provides consistent estimates when
the true propensity score does not differ by strata (or, equivalently, the average
stratum-specific causal effect does not vary bvy strata). Conversely, if we stratify our data
too well, then we might encounter “overmatching” which leaves little variation to be
explained across observations (Manski 1995). Also, if the marginal distribution of
observations within the intersection of certain observed characteristics and treatment are
non-existent, then we won’t be able to use those groups as counterfactual or control groups.
Lastly, both approaches only control for observed variables and we can surmise that
unobserved heterogeneity still exist due to the still-unobservable omitted variables.
Recently, Ho et al. (2007) propose a non-parametric approach as a general procedure for
researchers to implement in multivariate analysis. Researchers first balance one’s data with
conventional matching routine and then estimate a regression model on the balanced data.
Ho et al.’s matching procedure can be interpreted as a preprocessing procedure that can be
used to prepare data for subsequent analysis using conventional parametric methods.
In light of this evidence, we can infer that for “matching” to match properly, the data
under consideration must, like all regression models, be a large data with sufficient
observations and that we should have enough observations with each groups sharing
overlaps in many known characteristics such that they are similar enough are close to one
another in terms of their propensity score (which minimizes within-stratum heterogeneity)
in order for “matching” to provide a statistically well-behaved estimate and theoretically
sound interpretation of causal effect.