Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A simulation study of the effect of sample size and level of interpenetration on inference from cross-classified multilevel logistic regression models Rebecca Vassallo ESRC Research Methods Festival, July 2012 Introduction • Influence of the interviewer and area on survey response behaviour • Reflects unmeasured factors including the interviewer’s and area’s characteristics • Violation of the assumption of independence of observations • Standard analytical techniques will underestimate standard errors and can result in incorrect inference (Snijders & Bosker, 1999) • Multilevel modelling has become a popular method in analysing area and interviewer effects on nonresponse 2 Introduction • Estimation problem relating to the identifiability of area and interviewer variation • Interpenetrated sample design considered as the gold standard for separating interviewer effects from area effects • Restrictions in field administration capabilities and survey costs only allow for partial interpenetration • Multilevel cross-classified specification used in such cases (Von Sanden, 2004) • No studies available examining the properties of parameter estimates from such models under different conditions 3 Study Aims • Examine the implications of interviewer dispersal patterns within different scenarios on the quality of parameter estimates • Percentage relative bias, confidence interval coverage, power of significance tests and correlation of random parameter estimates • Different scenarios vary in sample sizes, overall rates of response, and the area and interviewer variance • Identify the smallest interviewer pool and the most geographically-restrictive interviewer case allocation required for acceptable levels of bias and power 4 Methodology: Simulation Model • Model: logit(𝑝𝑖𝑗𝑠 )= β0 + uj + vs ; uj ~N(0, σ2u ); vs ~N(0, σ2v ) • STATA Version 12 calling MLwiN Version 2.25 through the ‘runmlwin’ command (Leckie & Charlton, 2011) • Markov Chain Monte Carlo (MCMC) estimation method • MCMC method produces less biased estimates compared to firstorder marginal quasi-likelihood (MQL) and second-order penalised quasi-likelihood (PQL) (Browne, 1998; Browne & Draper, 2006) • IRIDIS High Performance Computing Facility cluster at the University of Southampton 5 Methodology: Data Generating Procedure • Overall probability 𝜋 of the outcome for the area and the interviewer with zero random effects determines overall intercept (fixed for all cases) • Cluster-specific random effects for each interviewer and area generated separately from N(0, σ2𝑢 ) & N(0, σ2𝑣 ) • 𝑢𝑗 and 𝑣𝑠 are generated for every simulation, but maintained constant across different scenarios where the only factor that changes is interviewer case allocations • The allocation of workload from different areas to specific interviewers is limited to a finite number of possibilities 6 Methodology: Data Generating Procedure • logit(𝑝𝑖𝑗𝑠 ) of each case are computed and converted to probabilities 𝑝𝑖𝑗𝑠 • Values of the dependent variable 𝑌𝑖𝑗𝑠 - a dichotomous outcome for each case - are generated from a Bernoulli distribution with probability 𝑝𝑖𝑗𝑠 • For each scenario of the experimental design, 1000 simulated datasets are generated using R Version 2.11.1 7 Methodology: Simulation Factors • Simulated scenarios vary in the following factors: -the overall sample size (N) -the number of interviewers and areas (Nints ; Narea𝑠 ) -the interviewer-area classifications [which vary in terms of the number of areas each interviewer works in (maximum 6 areas) and the overlap in the interviewers working in neighbouring areas] -the ICC (variances σ2𝑢 & σ2𝑣 ) -the overall probability of the outcome variable (π) • Medium scenario design (similar to values observed in a real dataset - a realistic starting point): 120 areas (48 cases/area) allocated to 240 interviewers (24 cases/int), totalling 5760 cases, σ2𝑣 =0.3, σ2𝑢 =0.3, π=0.8 8 Methodology - Quality Assessment Measures • Correlation between area and interviewer parameter estimates. High negative values indicate identifiability problems 1 1000 1000 𝑖=1 𝑐𝑜𝑣 𝜎𝑢2 𝜎𝑣2 𝜎𝑢 𝜎𝑣 • Percentage relative bias 1 1000 1000 𝑖=1 𝜃−𝜃 ∗ 100 𝜃 • Confidence interval coverage for 95% Wald confidence interval and the 95% MCMC quantiles are compared to nominal 95% • Power of Wald test - proportion of simulations in which the null hypothesis is correctly rejected 9 Results - Power of Tests • For medium scenario power ≈1 for all interviewer case allocations • For smaller N, more sparse allocations are required to get power >0.85 • Lower σ2𝑣 (0.2) results in lower power • When Nints =Nareas more interviewer dispersion is required for acceptable levels of power • Higher π (0.9) requires 2 areas/int for power>0.9 • Reduced interviewer overlapping for a constant number of areas/int does not improve power 10 Results - Correlation between σ2𝑣 & σ2𝑢 Estimates • For all scenarios, high negative ρ (>-0.4) are observed when interviewers work in 1 area only • No substantial change in ρ with varying total sample sizes • Very high negative ρ (up to -0.9) for N ints =N areas scenarios; ρ only reduced to <-0.1 when interviewer is working in 4+ areas (compared to 2+ areas/int for N ints =2*N areas scenarios) • Higher ρ with increasing π up till 2 areas/int allocations; thereafter no change in ρ by π • Lower ρ with increasing σ2𝑣 up till 3 areas/int allocations • Lower ρ with less interviewer overlapping for the 2 areas/int cases 11 Results – Percentage Relative Bias • In most scenarios N=5760, the relative percentage bias is around 1-2% once interviewers are allocated to 2+ areas • Further interviewer dispersion (3+ areas) & less interviewer overlapping do not yield systematic drops in bias • When interviewers are working in 2+ areas, the bias in the σ2𝑣 estimate is generally greater than the bias in σ2𝑢 estimate [when Nints =2*Nareas ] • Greater bias observed for smaller sample sizes, with the scenario including 1440 cases with Nints =Nareas obtaining bias values between 5-13% for all allocations 12 Results - Confidence Interval Coverage • Close to 95% nominal rate in all scenarios • Some cases of under-coverage or over-coverage for scenarios when interviewer works in just one area -87% coverage (N=5760, N ints =2*N areas , 𝜎𝑣2 =0.2, 𝜎𝑢2 =0.3, π=0.8, one area/int) for 𝜎𝑣2 CI -88% coverage (N=2880 or N=1440, N ints =2*N areas , 𝜎𝑣2 =0.3, 𝜎𝑢2 =0.3, π=0.8, one area/int) for 𝜎𝑣2 CI -100% coverage (N=5760 or 2880 or 1440, N ints =N areas , 𝜎𝑣2 =0.3, 𝜎𝑢2 =0.3, π=0.8, one area/int) for for 𝜎𝑣2 and for 𝜎𝑢2 CIs • No clear evidence that the MCMC quantiles perform better than the Wald asymptotic normal CIs 13 Conclusion • Interpenetration not required to distinguish between area and interviewer variation • Good quality estimates obtained for large sample sizes (≈6000 cases) if interviewers work in at least two areas • Better estimates obtained when the number of interviewers is greater than the number of areas • Higher overall probabilities & smaller variances (smaller ICC) require more interviewer dispersion for some survey conditions • The extent of interviewer overlapping shown not to be important • Results and their implications can be extended to other applications 14 Acknowledgements • University of Southampton, School of Social Sciences Teaching Studentship • UK Economic and Social Research Council (ESRC), PhD Studentship (ES/1026258/1) • Gabriele B. Durrant & Peter W. F. Smith, PhD Supervisors 15 References • Browne, W. J. (1998). Applying MCMC Methods to Multi-level Models. PhD thesis, University of Bath. • Browne, W. & Draper, D. (2006). A comparison of Bayesian and likelihoodbased methods for fitting multilevel models. Bayesian Analysis, 1, 473-514. • Leckie, G. & Charlton, C. (2011). runmlwin: Stata module for fitting multilevel models in the MLwiN software package. Centre for Multilevel Modelling, University of Bristol. • Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Analysis: an introduction to basic and advanced multilevel modelling. London: Sage. • Von Sanden, N. D. (2004). Interviewer effects in household surveys: estimation and design. Unpublished PhD thesis, School of Mathematics and Applied Statistics, University of Wollongong. 16