Download Microsoft PowerPoint - NCRM EPrints Repository

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bias of an estimator wikipedia , lookup

Confidence interval wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
A simulation study of the effect of sample size
and level of interpenetration on inference
from cross-classified multilevel logistic
regression models
Rebecca Vassallo
ESRC Research Methods Festival, July 2012
Introduction
• Influence of the interviewer and area on survey response behaviour
• Reflects unmeasured factors including the interviewer’s and area’s
characteristics
• Violation of the assumption of independence of observations
• Standard analytical techniques will underestimate standard errors
and can result in incorrect inference (Snijders & Bosker, 1999)
• Multilevel modelling has become a popular method in analysing
area and interviewer effects on nonresponse
2
Introduction
• Estimation problem relating to the identifiability of area and
interviewer variation
• Interpenetrated sample design considered as the gold standard for
separating interviewer effects from area effects
• Restrictions in field administration capabilities and survey costs
only allow for partial interpenetration
• Multilevel cross-classified specification used in such cases (Von
Sanden, 2004)
• No studies available examining the properties of parameter
estimates from such models under different conditions
3
Study Aims
• Examine the implications of interviewer dispersal patterns
within different scenarios on the quality of parameter estimates
• Percentage relative bias, confidence interval coverage, power of
significance tests and correlation of random parameter estimates
• Different scenarios vary in sample sizes, overall rates of
response, and the area and interviewer variance
• Identify the smallest interviewer pool and the most
geographically-restrictive interviewer case allocation required
for acceptable levels of bias and power
4
Methodology: Simulation Model
• Model: logit(𝑝𝑖𝑗𝑠 )= β0 + uj + vs ; uj ~N(0, σ2u ); vs ~N(0, σ2v )
• STATA Version 12 calling MLwiN Version 2.25 through the
‘runmlwin’ command (Leckie & Charlton, 2011)
• Markov Chain Monte Carlo (MCMC) estimation method
• MCMC method produces less biased estimates compared to firstorder marginal quasi-likelihood (MQL) and second-order penalised
quasi-likelihood (PQL) (Browne, 1998; Browne & Draper, 2006)
• IRIDIS High Performance Computing Facility cluster at the
University of Southampton
5
Methodology: Data Generating Procedure
• Overall probability 𝜋 of the outcome for the area and the
interviewer with zero random effects determines overall intercept
(fixed for all cases)
• Cluster-specific random effects for each interviewer and area
generated separately from N(0, σ2𝑢 ) & N(0, σ2𝑣 )
• 𝑢𝑗 and 𝑣𝑠 are generated for every simulation, but maintained
constant across different scenarios where the only factor that
changes is interviewer case allocations
• The allocation of workload from different areas to specific
interviewers is limited to a finite number of possibilities
6
Methodology: Data Generating Procedure
• logit(𝑝𝑖𝑗𝑠 ) of each case are computed and converted to probabilities
𝑝𝑖𝑗𝑠
• Values of the dependent variable 𝑌𝑖𝑗𝑠 - a dichotomous outcome for
each case - are generated from a Bernoulli distribution with
probability 𝑝𝑖𝑗𝑠
• For each scenario of the experimental design, 1000 simulated
datasets are generated using R Version 2.11.1
7
Methodology: Simulation Factors
• Simulated scenarios vary in the following factors:
-the overall sample size (N)
-the number of interviewers and areas (Nints ; Narea𝑠 )
-the interviewer-area classifications [which vary in terms of the
number of areas each interviewer works in (maximum 6 areas) and
the overlap in the interviewers working in neighbouring areas]
-the ICC (variances σ2𝑢 & σ2𝑣 )
-the overall probability of the outcome variable (π)
• Medium scenario design (similar to values observed in a real
dataset - a realistic starting point): 120 areas (48 cases/area)
allocated to 240 interviewers (24 cases/int), totalling 5760 cases,
σ2𝑣 =0.3, σ2𝑢 =0.3, π=0.8
8
Methodology - Quality Assessment Measures
• Correlation between area and interviewer parameter estimates.
High negative values indicate identifiability problems
1
1000
1000
𝑖=1
𝑐𝑜𝑣 𝜎𝑢2 𝜎𝑣2
𝜎𝑢 𝜎𝑣
• Percentage relative bias
1
1000
1000
𝑖=1
𝜃−𝜃
∗ 100
𝜃
• Confidence interval coverage for 95% Wald confidence interval
and the 95% MCMC quantiles are compared to nominal 95%
• Power of Wald test - proportion of simulations in which the null
hypothesis is correctly rejected
9
Results - Power of Tests
• For medium scenario power ≈1 for all interviewer case allocations
• For smaller N, more sparse allocations are required to get power
>0.85
• Lower σ2𝑣 (0.2) results in lower power
• When Nints =Nareas more interviewer dispersion is required for
acceptable levels of power
• Higher π (0.9) requires 2 areas/int for power>0.9
• Reduced interviewer overlapping for a constant number of
areas/int does not improve power
10
Results - Correlation between σ2𝑣 & σ2𝑢 Estimates
• For all scenarios, high negative ρ (>-0.4) are observed when
interviewers work in 1 area only
• No substantial change in ρ with varying total sample sizes
• Very high negative ρ (up to -0.9) for N ints =N areas scenarios; ρ only
reduced to <-0.1 when interviewer is working in 4+ areas (compared
to 2+ areas/int for N ints =2*N areas scenarios)
• Higher ρ with increasing π up till 2 areas/int allocations; thereafter
no change in ρ by π
• Lower ρ with increasing σ2𝑣 up till 3 areas/int allocations
• Lower ρ with less interviewer overlapping for the 2 areas/int cases
11
Results – Percentage Relative Bias
• In most scenarios N=5760, the relative percentage bias is around
1-2% once interviewers are allocated to 2+ areas
• Further interviewer dispersion (3+ areas) & less interviewer
overlapping do not yield systematic drops in bias
• When interviewers are working in 2+ areas, the bias in the σ2𝑣
estimate is generally greater than the bias in σ2𝑢 estimate [when
Nints =2*Nareas ]
• Greater bias observed for smaller sample sizes, with the scenario
including 1440 cases with Nints =Nareas obtaining bias values
between 5-13% for all allocations
12
Results - Confidence Interval Coverage
• Close to 95% nominal rate in all scenarios
• Some cases of under-coverage or over-coverage for scenarios
when interviewer works in just one area
-87% coverage (N=5760, N ints =2*N areas , 𝜎𝑣2 =0.2, 𝜎𝑢2 =0.3, π=0.8,
one area/int) for 𝜎𝑣2 CI
-88% coverage (N=2880 or N=1440, N ints =2*N areas , 𝜎𝑣2 =0.3,
𝜎𝑢2 =0.3, π=0.8, one area/int) for 𝜎𝑣2 CI
-100% coverage (N=5760 or 2880 or 1440, N ints =N areas , 𝜎𝑣2 =0.3,
𝜎𝑢2 =0.3, π=0.8, one area/int) for for 𝜎𝑣2 and for 𝜎𝑢2 CIs
• No clear evidence that the MCMC quantiles perform better
than the Wald asymptotic normal CIs
13
Conclusion
• Interpenetration not required to distinguish between area and interviewer
variation
• Good quality estimates obtained for large sample sizes (≈6000 cases) if
interviewers work in at least two areas
• Better estimates obtained when the number of interviewers is greater than
the number of areas
• Higher overall probabilities & smaller variances (smaller ICC) require
more interviewer dispersion for some survey conditions
• The extent of interviewer overlapping shown not to be important
• Results and their implications can be extended to other applications
14
Acknowledgements
• University of Southampton, School of Social Sciences Teaching
Studentship
• UK Economic and Social Research Council (ESRC), PhD
Studentship (ES/1026258/1)
• Gabriele B. Durrant & Peter W. F. Smith, PhD Supervisors
15
References
• Browne, W. J. (1998). Applying MCMC Methods to Multi-level Models. PhD
thesis, University of Bath.
• Browne, W. & Draper, D. (2006). A comparison of Bayesian and likelihoodbased methods for fitting multilevel models. Bayesian Analysis, 1, 473-514.
• Leckie, G. & Charlton, C. (2011). runmlwin: Stata module for fitting multilevel
models in the MLwiN software package. Centre for Multilevel Modelling,
University of Bristol.
• Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Analysis: an introduction to
basic and advanced multilevel modelling. London: Sage.
• Von Sanden, N. D. (2004). Interviewer effects in household surveys: estimation
and design. Unpublished PhD thesis, School of Mathematics and Applied
Statistics, University of Wollongong.
16