Download PSTAT 262 AS: Survey Sampling and Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
(3) Let  2 be the population variance :  2 
Var( ys ) 
2
1
2
N
i 1 ( yi   )
N 1
(1  f )
n
Here, the factor (1 - f ) is called the finite population correction
• usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998
n =1000 and N = 400,000: 1- f = 0.9975
n =1000 and N = 5,000,000: 1-f = 0.9998
• effect of changing n much more important than effect of
changing n/N
1
An unbiased estimator of  2 is given by the
sample variance
1
2
2
s 
is ( yi  ys )
n 1
2
s
The estimated variance Vˆ ( y s )  (1  f )
n
Usually we report the standard error of the estimate:
SE( y s )  Vˆ ( y s )
Confidence intervals for  is based on the
Central Limit Theorem:
For large n, N  n : Z  ( ys   ) /  (1  f ) / n ~ N (0,1)
Approximat e 95% CI for  :
ys  1.96  SE( ys ), ys  1.96  SE( ys )  ys  1.96  SE( ys )
2
Example
N = 341 residential blocks in Ames, Iowa
yi = number of dwellings in block i
1000 independent SRS for different values of n
n
Proportion of samples Proportion of samples
with |Z| <1.64
with |Z| <1.96
30
50
0.88
0.88
0.93
0.93
70
90
0.88
0.90
0.94
0.95
3
For one SRS with n = 90:
y s  13
s 2  75
SE( y s )  (1  90 / 341)75 / 90  0.78
Approximat e 95% CI : 13  1.96  0.78  13  1.53  (11.47, 14.53)
4
Absolute value of sampling error is not informative when
not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very
large if estimate is 3
The coefficient of variation for the estimate:
CV ( ys ) SE ( ys ) / ys
In example : CV ( ys )  0.78 / 13  0.06  6%
•A measure of the relative variability of an estimate.
•It does not depend on the unit of measurement.
• More stable over repeated surveys, can be used for
planning, for example determining sample size
• More meaningful when estimating proportions
5
Estimation of a population proportion p
with a certain characteristic A
p = (number of units in the population with A)/N
Let yi = 1 if unit i has characteristic A, 0 otherwise
Then p is the population mean of the yi’s.
Let X be the number of units in the sample with
characteristic A. Then the sample mean can be
expressed as
pˆ  y s  X / n
6
Then under SRS :
E ( pˆ )  p
and
p(1  p )
n 1
Var( pˆ ) 
(1 
)
n
N 1
since the population variance equals  2 
Np(1  p )
N 1
n
s 
pˆ (1  pˆ )
n 1
2
So the unbiased estimate of the variance of the estimator:
pˆ (1  pˆ )
n
ˆ
V ( pˆ )
(1  )
n 1
N
7
Examples
A political poll: Suppose we have a random sample of 1000
eligible voters in Norway with 280 saying they will vote
for the Labor party. Then the estimated proportion of Labor
votes in Norway is given by:
p̂  280 / 1000  0.28
p̂( 1  p̂ )
n
0.28  0.72
SE( p̂ )
(1  ) 
 0.0144
n 1
N
999
Confidence interval requires normal approximation.
Can use the guideline from binomial distribution, when
N-n is large: np  5 and n(1  p)  5
8
In this example : n = 1000 and N = 4,000,000
Approximat e 95% CI : p̂  1.96  SE( p̂ )
 0.280  0.028  (0.252, 0.308)
Ex: Psychiatric Morbidity Survey 1993 from Great Britain
p = proportion with psychiatric problems
n = 9792 (partial nonresponse on this question: 316)
N @ 40,000,000
pˆ  0.14
SE ( pˆ )  (1  0.00024 )0.14  0.86 / 9791  0.0035
95 % CI : 0.14  1.96  0.0035  0.14  0.007  (0.133,0.1 47)
9
General probability sampling
• Sampling design: p(s) - known probability of selection for
each subset s of the population U
• Actually: The sampling design is the probability distribution
p(.) over all subsets of U
• Typically, for most s: p(s) = 0 . In SRS of size n, all s with
size different from n has p(s) = 0.
• The inclusion probability:
 i  P( unit i is in the sample)
 P(i  s )   p( s )
{s:is}
10
Illustration
U = {1,2,3,4}
Sample of size 2; 6 possible samples
Sampling design:
p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8
The inclusion probabilities:
 1   p( s )  p({1,2})  p({1,4})  5 / 8
{s:1s}
 2   p( s )  p({1,2})  p({2,3})  3 / 4  6 / 8
{s:2s}
 3   p( s )  p({2,3})  p({3,4})  3 / 8
{s:3s}
 4   p( s )  p({3,4})  p({1,4})  2 / 8
{s:4s}
11
Some results
( I )  1   2  ...   N  E ( n ) ; n is the sample size
( II ) If sample size is determined to be n in advance :
 1   2  ...   N  n
Proof :
Let Z i  1 if unit i is included in the sample, 0 otherwise
 i  P( Z i  1)  E ( Z i )
n  i 1 Z i  E (n) i 1 E ( Z i )  i 1  i
N
N
N
12
Estimation theory
probability sampling in general
Problem: Estimate a population quantity for the variable y
N
For the sake of illustration: The population total t   yi
An estimator of t based on the sample : tˆ
i 1
Expected value : E (tˆ )  s tˆ( s ) p( s )
Variance : Var(tˆ )  E[tˆ  Etˆ]2  s [tˆ( s )  Etˆ]2 p( s )
Bias : E (tˆ )  t
tˆ is unbiased if E (tˆ )  t
13
Let Vˆ (tˆ) be an (unbiased if possible) estimate of Var(tˆ)
The standard error of tˆ : SE(tˆ) Vˆ (tˆ)
Coefficient of variation of tˆ : CV (tˆ) SE(tˆ) / tˆ
CV is a useful measure of uncertainty, especially when
standard error increases as the estimate increases
Margin of error : 2  SE(tˆ)
Because, typically we have that
P(tˆ  2SE(tˆ)  t  tˆ  2SE(tˆ))  0.95 for large n, N  n
Since tˆ is approximat ely normally distribute d for large n, N  n
t̂  2  SE( t̂ ) is approximat ely a 95% CI
14
Some peculiarities in the estimation theory
Example: N=3, n=2, simple random sample
s1  {1,2}, s2  {1,3}, s3  {2,3}
p( sk ) 1 / 3 for k  1,2,3
Let tˆ1  3 ys , unbiased
Let tˆ2 be given by :
ˆt 2 ( s1 )  3  1 ( y1  y2 )  tˆ1 ( s1 )
2
ˆt 2 ( s2 )  3  ( 1 y1  2 y3 )  tˆ1 ( s 2 )  1 y3
2
3
2
ˆt 2 ( s3 )  3  ( 1 y2  1 y3 )  tˆ1 ( s3 )  1 y3
2
3
2
15
Also tˆ2 is unbiased :
1 3 ˆ
1
ˆ
ˆ
E (t2 )  s t2 ( s ) p( s )  k 1 t2 ( sk )   3t  t
3
3
1
ˆ
ˆ
Var(t1 )  Var(t2 )  y3 (3 y2  3 y1  y3 )
6
 Var(tˆ1 )  Var(tˆ2 ) if y3  0 and 3 y2  3 y1  y3
If yi  0 / 1  variables, this happens when y1  0, y2  y3  1
For this set of values of the yi’s:
tˆ1 ( s1 )  1.5, tˆ1 ( s2 )  1.5, tˆ1 ( s3 )  3 : never correct
tˆ2 ( s1 )  1.5, tˆ2 ( s2 )  2, tˆ2 ( s3 )  2.5
tˆ2 has clearly less variabilit y than tˆ1 for these y - values
16
Let y be the population vector of the y-values.
This example shows that
Ny s
is not uniformly best ( minimum variance for all y)
among linear design-unbiased estimators
Example shows that the ”usual” basic estimators do not
have the same properties in design-based survey
sampling as they do in ordinary statistical models
In fact, we have the following much stronger result:
Theorem: Let p(.) be any sampling design. Assume each
yi can take at least two values. Then there exists no
uniformly best design-unbiased estimator of the total t
17
Proof:
Let tˆ be unbiased, and let y 0 be one possible value of y.
Then there exists unbiased tˆ0 with Var(tˆ0 )  0 when y  y 0
tˆ0 ( s, y)  tˆ( s, y)  tˆ( s, y 0 )  t0 , t0 is the total for y 0
1) tˆ0 is unbiased : E (tˆ0 )  t  s tˆ( s, y 0 ) p( s )  t0  t
2) When y  y 0 : tˆ0  t0 for all samples s  Var(tˆ0 )  0
This implies that a uniformly best unbiased estimator
must have variance equal to 0 for all values of y,
which is impossible
18
Related documents