Download Non-response - European Survey Research Association

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Opinion poll wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
An example on:
Non-response adjustment by registers
and estimation of variance by replicate
weights
Peter Linde, Interviewservice
[email protected]
Statistics Denmark
>>
>>
Agenda
 Challenge with non-response
Case: Household Budget Survey
 How to weight as correctly as possible
Case: Weighting Household Budget Survey
 How everything is done easy with replicate
weights
Case: Analyzing Household Budget Survey
 Conclusion
2
>>
Four statements
 No statistic is better than its weakest point
 The weakest point (the bias) is often one of the three
points:
- Population, frame and selection
- Non-response or
- Measurement error in the data collection
 Not everything that counts is countable - and not
everything that is countable counts (Einstein)
 Every good statistics try to control all the weakest
points as good as possible with the know information
and resources
3
>>
Bias and standard error of mean
The bias is “dominating” the standard error
Horizontal: Bias (assumed = 3%)
Black curve: The max. standard error by SRS. E.g. 9.8% if n=100
Green curve: The total error (square root of the mean square error)
Scale horizontal: 100 to 4000 sample units. Vertical: 0-15%
4
>>
Challenge (1)
 In the 90s the response rate was around 65-80%
 Today the response rate is 50-65% in Denmark
 In 2000 the tax the income overstated by 4% i a
SRS survey before adjusting for non-response
 To day the tax income is overstated by 7%
 Both demographic and social factors have a
growing and more complex negative impact,
e.g. sex and nationality
5
>>
Challenge (2)
Response rate 1992 Response rate 2012
Age
16-29
30-39
40-49
50-64
65-74
years
years
years
years
years
62%
66%
73%
68%
67%
Education
Basic school
62%
Youth education
Short further education
Medium education
83%
Long further education
47%
58%
57%
61%
71%
51%
67%
85%
56%
73%
74%
80%
70%
6
>>
Challenge (3)
 Non-response on 52% in Household Budget
Survey
 From the tax register the taxable income for the
entire population (and sample) is known
 Average household income in the population is
520.639 DKK before tax. (69.000 euro)
 In the sample after non-response the average
household income is 557.159 DDK or 7% higher
7
>>
Challenge (4)
When weighting for non-response one wish:
 Estimate the mean and parameters correctly
 Use the correct sample size n, and not the sum
of the weights or weights rounded, when
calculate the variance
 Calculate the variance with respect for
- The price for different weights
- The benefit of explained variation
8
>>
Challenge (5)
 And extended to be used in multivariate
analysis
 with model-assisted calibration of weights with
GREG using all known categorical and
continuous variables
In standard programs as SAS (proc survey…) and
STATA (svy) is this possible
9
>>
How to weight (1)
In the simple case weighting for sex:
Men:
Women:
Total:
If rescale with n / N
10
>>
How to weight (2)
When both weights and strata (respondent
belong to N1 and N2) are known
Two things are important
1) gj is squared (it costs variance)
2) centered on strata average (it wins variance:
R**2)
11
>>
How to weight (3)
An indicator for the cost in variance:
Example if g1=1.5, g2=0.5 and n=2: (2.25+0.25)/2=1.25
Designeffect from 1.1 to 1.25 is normal compare to g=1
When weights are known and strata are unknown:
If e.g. dataproducer not disclose strata or only weights
are used
12
>>
How to weight (4)
Calibration with generalised regression (GREG) estimation use
information from registers
 Continuous variables may be involved
 Traditional post stratification: sex*age*region
 Different tables separately at the same time:
sex*age and sex*nationality
 Continuous variables and separate tables
It is a soft weighting – mode assisted – the weights only sum
up to the population tables. If no sample bias compared to
population tables (=“model”), no weighting
More math in my paper on the internet, which also shows how to
get from the general formula to the simple example with
stratification for sex
13
>>
Household Budget Survey (1)
Sample: Simple Random Sample from the hole population.
Before non-response therefor only small (random)
difference between then sample and the population (weak
point 1 ok).
Mode: Personal interview in households and diaries
Non-response on 52%
Register information in frame and sample:
Region, age, sex, socio-economic background, years in the
house, education, dwelling type, family type, Danish
citizen and household income and …..
Measurement variable in questionnaire: Consumption
14
>>
Household Budget Survey (2)
a) No weighting (no adjustment for non-response on 52%)
b) Region, age and sex
c) Region, age * sex, socio-economic background, years in the house, education,
dwelling type, family type and Danish citizen
d) Region, age * sex, socio-economic background, years in the house, education,
dwelling type, family type and Danish citizen and household income in 5 groups
Weighting a
Weighting b (strata known)
Weighting c (strata known)
Weighting d (strata known)
Weighting d (only weights)
Population
Household income Pct. consumption Standard error
more than 300.000 of mean
557.159 (+7%)
46%
0.0100
542.757 (+4%)
44%
0.0089
518.297 (-½%)
39%
0.0070
520.639
39%
0.0070
520.639
39%
0.0107
520.639
not known
15
>>
Household Budget Survey (3)
 Bias reduced by up to 7%
 Design effect when all information is used:
Effective sample size increases from 2462 to
2462/0.51 = 4827 compared with SRS
 If only the weights are used (but not to strata) the
bias is still control but
Design effect is:
Effective sample size decreases from 2462 to
2462/1.19 = 2069 compared with SRS.
16
>>
Repeated weighting (1)
 Different methods is used to take a sample in the
sample and use the new samples to estimate the initial
sample variance.
 BRR - Balanced Repeated Replicates
 Bootstrapping
 Jackknife, e.g. JK1, where e.g. the first 10 units SRS is
taken out of a SRS sample on 1000. Then the next 10 is
taken out and so on until you have G=100 samples with
each 990 units.
17
>>
Repeated weighting (2)
 In Household Budget Survey the standard error
of consuming more than 300.000 is 0.0070
 On the basis of G=80 replicate weights the
standard error is estimated in the range of
0.0065-0.0075
 The Jackknife method is able roughly accurately
to estimate the standard error at the same level
as the GREG estimator
 In SAS and STATA you just inform about the
names of the replicate weights and the method
e.g. JK1 and a regression analysis is make in 2-3
minutes.
18
>>
Conclusion
 If register information is used to adjust for nonresponse
 and if the data producer estimate the overall weight
and replicate weights
 Researcher today can achieve the complete benefits
by including registry variables in weighting for nonresponse:
- least possible bias
- greatest possible effective sample
- and correct estimation of the variance when
weighting
Thank you for listening
19