Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An example on: Non-response adjustment by registers and estimation of variance by replicate weights Peter Linde, Interviewservice [email protected] Statistics Denmark >> >> Agenda Challenge with non-response Case: Household Budget Survey How to weight as correctly as possible Case: Weighting Household Budget Survey How everything is done easy with replicate weights Case: Analyzing Household Budget Survey Conclusion 2 >> Four statements No statistic is better than its weakest point The weakest point (the bias) is often one of the three points: - Population, frame and selection - Non-response or - Measurement error in the data collection Not everything that counts is countable - and not everything that is countable counts (Einstein) Every good statistics try to control all the weakest points as good as possible with the know information and resources 3 >> Bias and standard error of mean The bias is “dominating” the standard error Horizontal: Bias (assumed = 3%) Black curve: The max. standard error by SRS. E.g. 9.8% if n=100 Green curve: The total error (square root of the mean square error) Scale horizontal: 100 to 4000 sample units. Vertical: 0-15% 4 >> Challenge (1) In the 90s the response rate was around 65-80% Today the response rate is 50-65% in Denmark In 2000 the tax the income overstated by 4% i a SRS survey before adjusting for non-response To day the tax income is overstated by 7% Both demographic and social factors have a growing and more complex negative impact, e.g. sex and nationality 5 >> Challenge (2) Response rate 1992 Response rate 2012 Age 16-29 30-39 40-49 50-64 65-74 years years years years years 62% 66% 73% 68% 67% Education Basic school 62% Youth education Short further education Medium education 83% Long further education 47% 58% 57% 61% 71% 51% 67% 85% 56% 73% 74% 80% 70% 6 >> Challenge (3) Non-response on 52% in Household Budget Survey From the tax register the taxable income for the entire population (and sample) is known Average household income in the population is 520.639 DKK before tax. (69.000 euro) In the sample after non-response the average household income is 557.159 DDK or 7% higher 7 >> Challenge (4) When weighting for non-response one wish: Estimate the mean and parameters correctly Use the correct sample size n, and not the sum of the weights or weights rounded, when calculate the variance Calculate the variance with respect for - The price for different weights - The benefit of explained variation 8 >> Challenge (5) And extended to be used in multivariate analysis with model-assisted calibration of weights with GREG using all known categorical and continuous variables In standard programs as SAS (proc survey…) and STATA (svy) is this possible 9 >> How to weight (1) In the simple case weighting for sex: Men: Women: Total: If rescale with n / N 10 >> How to weight (2) When both weights and strata (respondent belong to N1 and N2) are known Two things are important 1) gj is squared (it costs variance) 2) centered on strata average (it wins variance: R**2) 11 >> How to weight (3) An indicator for the cost in variance: Example if g1=1.5, g2=0.5 and n=2: (2.25+0.25)/2=1.25 Designeffect from 1.1 to 1.25 is normal compare to g=1 When weights are known and strata are unknown: If e.g. dataproducer not disclose strata or only weights are used 12 >> How to weight (4) Calibration with generalised regression (GREG) estimation use information from registers Continuous variables may be involved Traditional post stratification: sex*age*region Different tables separately at the same time: sex*age and sex*nationality Continuous variables and separate tables It is a soft weighting – mode assisted – the weights only sum up to the population tables. If no sample bias compared to population tables (=“model”), no weighting More math in my paper on the internet, which also shows how to get from the general formula to the simple example with stratification for sex 13 >> Household Budget Survey (1) Sample: Simple Random Sample from the hole population. Before non-response therefor only small (random) difference between then sample and the population (weak point 1 ok). Mode: Personal interview in households and diaries Non-response on 52% Register information in frame and sample: Region, age, sex, socio-economic background, years in the house, education, dwelling type, family type, Danish citizen and household income and ….. Measurement variable in questionnaire: Consumption 14 >> Household Budget Survey (2) a) No weighting (no adjustment for non-response on 52%) b) Region, age and sex c) Region, age * sex, socio-economic background, years in the house, education, dwelling type, family type and Danish citizen d) Region, age * sex, socio-economic background, years in the house, education, dwelling type, family type and Danish citizen and household income in 5 groups Weighting a Weighting b (strata known) Weighting c (strata known) Weighting d (strata known) Weighting d (only weights) Population Household income Pct. consumption Standard error more than 300.000 of mean 557.159 (+7%) 46% 0.0100 542.757 (+4%) 44% 0.0089 518.297 (-½%) 39% 0.0070 520.639 39% 0.0070 520.639 39% 0.0107 520.639 not known 15 >> Household Budget Survey (3) Bias reduced by up to 7% Design effect when all information is used: Effective sample size increases from 2462 to 2462/0.51 = 4827 compared with SRS If only the weights are used (but not to strata) the bias is still control but Design effect is: Effective sample size decreases from 2462 to 2462/1.19 = 2069 compared with SRS. 16 >> Repeated weighting (1) Different methods is used to take a sample in the sample and use the new samples to estimate the initial sample variance. BRR - Balanced Repeated Replicates Bootstrapping Jackknife, e.g. JK1, where e.g. the first 10 units SRS is taken out of a SRS sample on 1000. Then the next 10 is taken out and so on until you have G=100 samples with each 990 units. 17 >> Repeated weighting (2) In Household Budget Survey the standard error of consuming more than 300.000 is 0.0070 On the basis of G=80 replicate weights the standard error is estimated in the range of 0.0065-0.0075 The Jackknife method is able roughly accurately to estimate the standard error at the same level as the GREG estimator In SAS and STATA you just inform about the names of the replicate weights and the method e.g. JK1 and a regression analysis is make in 2-3 minutes. 18 >> Conclusion If register information is used to adjust for nonresponse and if the data producer estimate the overall weight and replicate weights Researcher today can achieve the complete benefits by including registry variables in weighting for nonresponse: - least possible bias - greatest possible effective sample - and correct estimation of the variance when weighting Thank you for listening 19