Download Extended abstract - Conference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Correcting for misclassification under edit restrictions
in combined survey-register data using Multiple
Imputation Latent Class modelling (MILC)
Laura Boeschoten ([email protected])1 2, Ton de Waal1 2, Daniel Oberski1 3 and Marcel
Croon1
Keywords: misclassification, composite dataset, edit rules, three-step approach.
1.
INTRODUCTION
National Statistical Institutes (NSIs) often use large datasets to estimate population tables
on many different aspects of society. A way to create these rich datasets as efficiently and
cost effectively as possible is by utilizing already available register data. When more
information is required than already available, registers can be supplemented with survey
data [1]. Composite data containing both surveys and registers were for example used when
constructing the different population tables for the 2011 Dutch Census [2].
However, caution is advised as both surveys and registers can contain classification errors.
These can be detected when the separate datasets within the composite dataset contain
variables measuring the same attribute, or when combinations of scores on different
variables are in practice impossible.
To estimate the number of classification errors in a composite dataset and to
simultaneously impute a new variable which takes the uncertainty caused by these
classification errors into account, we developed a method which combines Multiple
Imputation (MI) and Latent Class (LC) analysis [3]. With this method it is possible to
obtain estimates that are consistent and that take edit rules into account, which is especially
useful within official statistics since cells in cross tables that represent a combination of
scores that is in practice impossible, should contain zero observations [1].
However, if a researcher is interested in estimating a cross table between variable that has
been imputed by the MILC method and another variable, this variable should be taken into
account in the LC model as a covariate. However, it is not always possible incorporate this
covariate in the LC model at the moment the model was constructed. Reasons for this can
be that not all necessary information was available at the time the LC model was
constructed, or that the researcher was not interested in the relation with this covariate at
that time. However, not incorporating this covariate in the LC model leads to biased results
when this cross table is estimated.
When LC analysis is applied, a solution to this problem is found by using the β€˜3-step’
approach. With this approach, a variable containing LC assignments is related to covariates
that were not taken into account in the initial LC model. This is done by taking the relation
between the assigned LCs and the `true' LCs into account [4].
In this paper, we illustrate how we incorporated the three-step approach into the MILC
method. In the next section, we discuss the methodology of the three-step approach and
the MILC method. Next, we discuss the setup and first results of a simulation study which
1
Tilburg University
2
Statistics Netherlands
3
Utrecht University
1
investigates the performance of the MILC method in combination with the three-step
approach.
2.
METHODS
When relations between latent variables and other variables are estimated, these variables
need to be included in the LC model as covariates. Edit rules can also be taken into account
in this way. This can be done in the initial LC model, or later on by making use of the
three-step method.
2.1.
The three-step approach
When the three-step method is used, the measurement model for the relationship between
the latent variable and its indicators is built in the first step. Here we have L indicator
variables (Y1,…,YL) of the latent property X, and X has C categories (a specific category is
denoted by x (x=1,…,C)). The LC model for response pattern P(Y=y) is then estimated as:
𝑃(π‘Œ = 𝑦) = βˆ‘πΆπ‘₯=1 𝑃(𝑋 = π‘₯) βˆπΏπ‘™=1 𝑃(π‘Œπ‘™ = 𝑦𝑙 |𝑋 = π‘₯)
(1)
Next, the posterior membership probabilities can be obtained:
𝑃(𝑋 = π‘₯|π‘Œ = 𝑦) = βˆ‘πΆ
𝑃(𝑋=π‘₯) βˆπΏπ‘™=1 𝑃(π‘Œπ‘™
= 𝑦𝑙 |𝑋 = π‘₯ )
π‘Œπ‘™ = 𝑦𝑙 |𝑋 = π‘₯)
𝐿
π‘₯=1 𝑃(𝑋=π‘₯) βˆπ‘™=1 𝑃(
(2)
The posterior membership probabilities give the probability that a unit is member of a
latent class given the combination of scores on the indicators. When drawing from these
probabilities, a new imputed variable can be created, W. Creating imputed variable W can
be considered as the second step.
In the third step, the predicted class membership variable W is used in further analysis with
other variables. Although we investigate the relation between imputed variable W and
covariates (which we denote by Z), we are actually interested in the relation between the
`true' latent variable X and Z.W is not exactly equal to X (there is some classification error)
and this should be taken into account.
This can be done by using information about how the X×Z distribution is related to the
W×Z distribution. Therefore, we specify an LC model where we use W as an indicator of
X and define the form of the X-Z distribution:
𝑃(π‘Š = 𝑀, 𝑍 = 𝑧) = βˆ‘πΆπ‘₯=1 𝑃(𝑋 = π‘₯, 𝑍 = 𝑧)𝑃(π‘Š = 𝑀|𝑋 = π‘₯)
(3)
We assume here that Z is independent of Y given X, and we see that the W and Z distribution
are weighted sums of the entries in the X and Z distribution, where the weights are the
misclassification probabilities P(W=w|X=x). The relationship between W and Z can be
obtained by adjusting the relationship between W and Z for the misclassification
probabilities P(W=w|X=x) [5]. P(W=w|X=x) gives the probability of a certain class
assignment conditional on the true class and is a quantification of the overall quality of the
classification obtained from the LC model in the first step [6]. The larger the probability
for w=x, the better the classification. Using the LC parameters this quantity can be obtained
as follows:
𝑃(π‘Š = 𝑀|𝑋 = π‘₯) = βˆ‘π‘Œπ‘¦=1
𝑃(π‘Œ=𝑦)𝑃(𝑋=π‘₯|π‘Œ=𝑦)𝑃(π‘Š=𝑀|π‘Œ=𝑦)
𝑃(𝑋=π‘₯)
2
(4)
Adjusting the relationship between W and Z in this manner is also known as the Maximum
Likelihood (ML) approach. Another option is to use the Bolck-Croon-Hagenaars (BCH)
approach [6].
2.2.
Incorporating the three step approach into the MILC method
The MILC method starts with a composite dataset in which multiple data sources are linked
on unit level. The different sources contain the same variable. The differences between the
scores on these variables represent classification errors in one or more of the variables. The
first step is to take m bootstrap samples from the original composite dataset. The second
step is to create an LC model for every bootstrap sample. The L variables that measure the
same property but originate from the different data sources are used as indicators of a latent
`true' variable X [3]. After the m LC models have been estimated, the posterior membership
probabilities of the m LC models are used to draw m new imputed variables, W1,…,Wm.
The estimates of interest can now be obtained from the m imputed variables, and can be
pooled by making use of the pooling rules defined by Rubin [8].
To incorporate the three-step approach into the MILC method, we apply β€˜step-three’ on
the m imputed variables W1,…,Wm. This means that we estimate P(W=w|X=x) and
incorporate this when estimating P(X=x,Z=z) with either the ML method or the BCH
method. We can then obtain new posterior membership probabilities for the LCs
(P(W=w|Z=z)) and by sampling from these obtain m new imputed variables, which can be
used to analyse the relation between W and Z. Estimates of interest can again be pooled by
the rules defined by Rubin [7].
3.
SIMULATION STUDY
3.1.
Simulation setup
To empirically evaluate the performance of the three step approach incorporated in the
MILC method, we conducted a simulation study. The main properties of this simulation
study are summarized as follows:
ο‚·
ο‚·
ο‚·
Different classification probabilities of the three dichotomous indicators of X.
Impossible combinations between variables. Specific cell containing the
impossible combination is P(X=1,Z=2)
Strength of relationship between variables: different logit coefficients of X
regressed on Q.
We investigate the bias after only the first step is implemented in the MILC method (as a
baseline) and when the third step is implemented using ML.
3
3.2.
Simulation results
Figure 1: Bias of the logit coefficient of latent variable X regressed on covariate Q in
situations with different values for the classification probabilities, proportions of
P(Z=2) and values for the coefficient itself. Sample size is 1,000.
In figure 1, we see the bias of the logit coefficient of latent variable X regressed on
covariate Q under the different simulation conditions. Here, we compare β€˜step 1’, where
covariate Q was not taken into account by the LC model for X, with β€˜step 3’, where
covariate Q was added to the LC model for X by making use of the ML method of the
three-step approach. When investigating the figure, it is immediately clear that using β€˜step
1’ (so not adding covariate Q to the LC model of X) produces bias when estimating the
relation between X and Q. This is especially the case when the classification probabilities
are low (0.70, 0.80). Furthermore, we see that this bias is not influenced by P(Z=2). We
see that for β€˜step 3’, when covariate Q is added to the LC model of X by using the ML
method of the three-step approach, the bias is much smaller under all conditions. Only with
low classification probabilities, some bias is still detected, under other conditions, it is
approximately 0.
4.
CONCLUSIONS
The first simulation results have shown that approximately no bias is created when the
three-step approach is used when the relation between a latent variable and a covariate is
investigated, while not taking the covariate into account (only using β€˜step 1’), will result
in large amounts of bias.
In the presentation we plan to present whether comparable results are obtained when edit
rules are applied (relationship between Z and X), when a sample with a larger size is used
and when other methods for applying the three-step approach are used (the BCH).
Furthermore, we also plan to describe what results are obtained in terms of coverage of the
95% confidence interval, and of the ratio of the average standard error over the estimate
over the standard deviation of the estimates. At last, we plan to present how the three-step
approach can be incorporated into other methodology used in practice, such as plausible
values or imputation.
REFERENCES
[1] T. de Waal, Obtaining numerically consistent estimates from a mix of administrative
data and surveys. Statistical Journal of the IAOS (2015), 1-13.
4
[2] E. Schulte Nordholt, J. Van Zeijl & L. Hoeksma, Dutch Census 2011, analysis and
methodology. The Hague/Heerlen (2014).
[3] L. Boeschoten, D. Oberski and T. de Waal, Estimating classification error under edit
restrictions in combined survey-register data, CBS discussion paper (2016)
[4] Z. Bakk, F.B. Tekle, and J.K. Vermunt, Estimating the association between latent class
membership and external variables using bias adjusted three-step approaches, Sociological
Methodology, vol.43, 1 (2013) 272-311.
[5] J.K. Vermunt, Latent class modelling with covariates: Two improved three-step
Approaches. Political Analysis, 18 (2010) 450-469.
[6] A. Bolck, M. Croon, and J.A. Hagenaars, Estimating latent structure models with
categorical variables: One-step versus three-step estimators. Political Analysis, 12 (2004)
3-27.
[7] D.B. Rubin, Multiple imputation for nonresponse in surveys, John Wiley and Sons,
Vol. 81 (1987).
5