Download Extended abstract - Conference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Confidence interval wikipedia , lookup

Least squares wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Transcript
Multilevel modelling of survey data under two-stage
design: inference for regression parameter and small
domain means by using an empirical likelihood
approach
Melike Oguz-Alper ([email protected])1 and Yves G. Berger ([email protected])2,
Keywords: Design-based inference, generalised estimating equation, empirical
likelihood, two-stage sampling, small domain estimation, unequal inclusion probability.
1. INTRODUCTION
Data used in social, behavioural, health or biological sciences may have hierarchical
structure due to the population of interest or the sampling design. Multilevel or marginal
models are often used to analyse such hierarchical data, or to estimate small domains
means. Hierarchical sample data may be selected with unequal probabilities from a
clustered and stratified population. The sample design is informative when the selection
probabilities are associated with the study variable after conditioning on the model
covariates. Ignoring informativeness may provide invalid inference for the parameter [1].
We propose using a design-based profile empirical likelihood approach to make
inference for the regression parameters, which are defined as the solutions to generalised
estimating equations, and for small domain means. This approach can be used for point
estimation, hypothesis testing and confidence intervals. It provides asymptotically valid
inference for the finite population parameters. Right coverages are obtained for small
domain means. We consider a two-stage sampling design, where the first stage units are
selected with unequal probabilities. We assume that the model and the design have the
same hierarchical structure [e.g. 1, 2, 3]. We consider an ultimate cluster approach [4],
where the empirical likelihood function is defined at the ultimate cluster level.
2. METHODS
Let be a finite population comprised of disjoint finite primary sampling units (
)
of sizes , with
. Let be the sample of , selected with replacement with
unequal probabilities [5] from , where
. Let denote the fixed number
of draws from . We assume that the sampling fractions
are negligible. Let
. The sample can be also a without-replacement set of units, because sampling
with and without replacement are asymptotically equivalent when
is negligible [6,
p.112]. Under sampling without replacement, denotes the inclusion probability of unit
. Let be the sample of secondary sampling units (
), of size , with
,
selected with conditional probabilities
within the th
selected at the first stage.
We assume that the size of the
, , are asymptotically bounded. Let
be the values
of a variable of interest and
the vector of values of explanatory variables. The
variables
and
are associated with the th unit within the th cluster, where
and
. Consider the multilevel model
1
Statistics Norway, Postboks 8131 Dep, 0033, Oslo, Norway. This research is funded by the Economic
and Social Research Council, United Kingdom.
2
University of Southampton, Southampton Statistical Sciences Research Institute, Southampton, SO17
1BJ, United Kingdom.
where and
are independent random variables with zero means and variances
respectively.
and
2.1. The regression parameter of interest
The finite population parameter of interest
generalised estimating equation [e.g. 7] given by
is the solution to the population
where
and
, where
[e.g. 8, p.174], where
is the
identity matrix and
is the
column vector of ones. The sample weighted estimator
is defined as
the solution to [2, p.4].
where
and
are the sample-based sub-matrices of
and , which contains the
observations of the sample , and
is a sample-based estimator of
[e.g. 8, p.174].
2.2. Domain means of interest
Let
be a domain of interest, where
. Let
if the unit in
domain and
otherwise. Let
size of . Under model (1), the population means of
where
Assuming
is within the
denote the population
is
and
, the consistent finite population parameter is
.
Assuming
known, a design--consistent estimator of the finite population parameter
is the regression synthetic estimator [e.g. 8, p.36] given by
2.3. Empirical likelihood approach
Consider the empirical log-likelihood function [9] given by
the
are unknown scale-loads allocated to the
[10] and
vector of . Let the
maximizes
subject to the constraints
where
and
, where
denotes the
and
. The maximum value of
under
and (5) is given by
Suppose that the parameter of interest
is a sub-parameter of
; that is,
where is a nuisance parameter. The profile empirical log-likelihood
function is defined by
.
The maximum empirical likelihood estimator that maximizes
is given by (2).
Under some regularity conditions [11], we have that
in distribution with respect to the sampling design and
can be used for testing
hypotheses. Confidence intervals can be constructed based on (7), when is scalar.
When the finite population parameters of interest are domain means (3),
in (5) and
and
in
(6)
are
respectively
replaced
by
,
and
. We propose treating
the regression parameters
as the nuisance parameter and the finite set of
as the
parameter of interest. Thus the profile empirical log-likelihood function considered is
where denotes the parameter space of
.
The maximum empirical likelihood estimator
that maximizes
is
given by (4) with
being the solution to (2). We have
..
3. RESULTS
In this Section, we report the observed coverages of the empirical likelihood confidence
intervals for domain means
(3). The population is generated from
where
,
,
,
with
and
, with
. Here,
and
are selected randomly
with-replacement among the values
and
, respectively. The number of
clusters is
. The cluster sizes are generated randomly from
,
with
, where
is the standard deviation of , which gives
ranging
between
and
, with
. The number of domains is
. The
domain sizes
were determined based on
that were generated from
. The
of
was inflated by , another
was inflated by
, and the rest was inflated
by
. The units were randomly allocated into 250 domains of sizes ranging from
to
. The range of is [-0.33,0.65]. We selected
two-stage samples. The first
stage selects a randomized systematic sample of
with unequal probabilities
proportional to
, where
. We have
. For the second stage, simple random samples of
were selected within the ith sample
. We have
. The sample sizes within
domains are random.
In Table 1, we present the range of observed coverages of empirical likelihood
confidence intervals for domain means
(3). The observed coverages are not
significantly different from the nominal level (
) for all the domains. We observe right
coverages even when domains have few or no sample units.
Table 1. Observed coverages (%) of 95% confidence intervals for domain means (3).
Expected sample sizes
within domains
The range of
domain sample sizes
Number of
domains
Range of observed
coverages (%)
0-4
5-9
10-14
15-19
20-29
30-69
[0, 18]
[0, 27]
[0, 33]
[2, 37]
[3, 53]
[10, 103]
22
78
50
30
42
28
[94.9, 95.1]
[94.8, 95.1]
[94.9, 95.1]
[94.8, 95.1]
[94.8, 95.0]
[94.9, 95.1]
4. CONCLUSIONS
The approach proposed may provide better confidence intervals even when the point
estimator is not normal, the data is skewed or includes outlying values and the sample
sizes within clusters are small. The numerical work shows that the empirical likelihood
confidence intervals have the right coverages for small domain means. Standard
confidence intervals may have poor coverages when sample sizes are not large enough or
data includes outlying values. The empirical likelihood confidence intervals have the
advantage of not depending on variance estimation, re-sampling, linearisation and second
order inclusion probabilities. It is not based on the normality of the point estimator. The
approach proposed takes into account of the sampling design, and can accommodate
informative sampling.
REFERENCES
[1] D. Pfeffermann, C. Skinner, D. Holmes, H. Goldstein, and J. Rasbash, “Weighting for
unequal selection probabilities in multilevel models,” Journal of the Royal Statistical Society.
Series B, 60, 23–40, 1998.
[2] C. J. Skinner and M. De Toledo Vieira, “Variance estimation in the analysis of
longitudinal survey data,” Survey Methodology, 33 (1), 3–12, 2007.
clustered
[3] J. N. K. Rao, F. Verret, and M. Hidiroglou, “A weighted composite likelihood approach to
inference for two-level models from survey data,” Survey Methodology, 39, 263–282, 2013.
[4] M. Hansen, W. Hurwitz, and W. Madow, Sample Survey Methods and Theory, volume I.
New York: John Wiley and Sons, 1953.
[5] M. H. Hansen and W. N. Hurwitz, “On the theory of sampling from finite populations,”The
Annals of Mathematical Statistics, 14 (4), 333–362, 1943.
[6] J. Hájek, Sampling from a Finite Population. New York: Marcel Dekker, 1981.
[7] K. Liang and S. Zeger, “Longitudinal data analysis using generalized linear
models, ”Biometrika, 73, 13–22, 1986.
[8] J. N. K. Rao and I. Molina, Small Area Estimation. Wiley, Hoboken, NJ, 2nd ed., 2015.
[9] Y. G. Berger and O. De La Riva Torres, “An empirical likelihood approach for inference
under complex sampling design,” JRSS Series B, 78(2),. 319–341, 2016.
[10] H. O. Hartley and J. N. K. Rao, “A new estimation theory for sample surveys, ”Biometrika,
55(3), 547–557, 1968.
[11] M. Oguz-Alper and Y. G. Berger, “Empirical likelihood approach for modelling survey data,”
Biometrika, 103(2), 447–459, 2016.