Download Element sampling: Part 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 3
Element sampling: Part 1
3.1
Introduction
Before selecting the sample, the population must be divided into parts that are
called sampling units, or units. These units must cover the whole population and
they must not overlap, in the sense that every element in the population belongs to
one and only one unit. Sometimes the appropriate unit is obvious, as in a population
of light bulb, in which the unit is the single bulb. Sometimes there is a choice of
unit. In sampling the people in a town, the unit might be an individual person, the
members of a family, or all persons living in the same city block. In sampling an
agricultural crop, the unit might be a field, a farm, or an area of land whose shape
and dimensions are at our disposal.
The construction of this list of sampling units, called a sampling frame, is often one of the major practical problems. We use the term direct element sampling
to denote sample selection from a frame that directly identifies the individual elements of the population of interest. That is, in element sampling, the sampling unit
is equal to the reporting unit. A selection of elements can take place directly from
the frame. In this chapter, we first consider a simple type of sampling designs where
the first-order inclusion probabilities are equal for every element in the population.
1
2
3.2
CHAPTER 3. ELEMENT SAMPLING: PART 1
Simple random sampling
Consider the problem of selecting n units from a finite population of size N. There
are Nn possible samples in this case and the simplest way of selecting a sample
is to select one randomly, that is, to select one realization at random with equal
probability. Such sampling design is called simple random sampling(SRS) without
replacement , or simple random sampling. The sample distribution of the SRS of
size n is given by
P(A) =
  N −1
if
0
otherwise.
n
|A| = n
(3.1)
In this case, the sample inclusion probabilities are computed as follows:
πi = N −1 n

N −1 n
πi j =
N −1 (N − 1)−1 n (n − 1)
(3.2)
if i = j
(3.3)
if i 6= j
Thus, the Horvitz-Thompson estimator of the population total Y = ∑Ni=1 yi can be
written
ŶHT =
N
∑ yi = N ȳ.
n i∈A
(3.4)
and the HT estimator satisfies design-unbiasedness under the SRS design. That
is, under SRS, the sample mean (ȳ) is unbiased for the population mean Ȳ =
N −1 ∑Ni=1 yi . The sampling variance of the HT estimator is, by equation (2.8),
V ŶHT
= −
yi y j 2
1 N N
(π
−
π
π
)
−
∑ ∑ i j i j πi π j
2 i=1
j6=i
j=1
=
1 N N −n N N
∑ ∑ (yi − y j )2 .
2 n N (N − 1) i=1
j=1
Since
N
N
N
2
∑ ∑ (yi − y j )2 = 2N ∑ (yi − Ȳ ) ,
i=1 j=1
i=1
3.2. SIMPLE RANDOM SAMPLING
3
we can obtain
V ŶHT
=
N2 N − n 2
S
n N
where
S2 =
=
1 N
2
∑ (yi − Ȳ )
N − 1 i=1
N N
1
1
∑
∑ (yi − y j )2 .
2 N (N − 1) i=1 j=1
(3.5)
Thus, we can derive
1 N −n 2
S .
n N
For the special case of n = 1, (3.6) becomes
V (ȳn ) =
(3.6)
1 N
2
∑ (yi − Ȳ ) ,
N i=1
which is often called population variance, denoted by σy2 . That is, σy2 can be understood as the sampling variance of the single sample observation under SRS of size
n = 1. Using the population variance, variance formula in (3.6) can be written as
1
n−1
V (ȳn ) =
1−
σy2 ,
(3.7)
n
N −1
where 1 − (N − 1)−1 (n − 1) is the variance reduction factor due to withoutreplacement sampling and it is often called FPC(Finite population correction) term.
The FPC term disappears under the simple random sampling with replacement.
To implement the SRS method in practice, one may consider a draw-by-draw
method as follows: In the first draw, select one element at random from the entire
population U with probability 1/N. Let k1 be the index of the element selected from
the first draw. In the second draw, select one element at random from U − {k1 } with
probability 1/(N − 1). Let k2 be the index of the element selected from the second
draw. We can continue the process until the n-th draw. In the n-th draw, select one
element at random from U − {k1 , · · · , kn−1 } with probability (N − n + 1)−1 . In this
draw-by-draw procedure, the probability of selecting element i in the first draw is
4
CHAPTER 3. ELEMENT SAMPLING: PART 1
1/N, the probability of selecting element i in the second draw is {(N − 1)/N} ×
1/(N − 1) = 1/N, and, continuing this way, the probability of selecting element i
in the j-th draw is {(N − 1)/N} × (N − 2)/(N − 1) × · · · × 1/(N − j + 1) = 1/N
for any j = 1, 2, · · · , n. Thus, for any of the n draw, we have the same selection
probability of 1/N and the total probability of being selected among the n draws
is πi = n/N. Such draw-by-draw method may be quite cumbersome if N is very
large, as it would require numbering the elements in the population in advance and
then repeating the process n times.
Mcleod and Bellhouse (1983) proposed a novel method of implementing the
SRS in a single pass through the list of records, but it requires both n and N to be
known. The method is later named reservoir sampling method in the sense that n
sample elements are selected in a reservoir and then replaced if the next element
in the population is selected. In the proposed reservoir method, the first n elements
in the population is stored in the reservoir. The k-th element (k = n + 1, · · · , N) is
selected in the reservoir with probability n/k and then one of the n elements are
removed from the reservoir with equal probability. The elements in the reservoir
remaining after the final selection are the elements in the final sample.
Now, consider variance estimation of the HT estimator under SRS. Since SRS
is a fixed-sized sampling design, we can use SYG variance estimation formula in
(2.12) to get
V̂ ŶHT
(πi j − πi π j )
1
= − ∑∑
2 i∈A j∈A
πi j
=
yi y j
−
πi π j
2
1 N N −n
∑ ∑ (yi − y j )2 .
2 n n (n − 1) i∈A
j∈A
Since
∑ ∑ (yi − y j )2 = 2n ∑ (yi − ȳ)2 ,
i∈A j∈A
(3.8)
i∈A
we can obtain
V̂ ŶHT
=
N2 N − n 2
s
n N
(3.9)
3.2. SIMPLE RANDOM SAMPLING
5
where
s2 =
1
∑ (yi − ȳ)2 .
n − 1 i∈A
Thus, under SRS, we have
E s2 = S 2 .
(3.10)
If y is dichotomous, taking either 1 or 0, the population mean of y equals to the
proportion of y = 1 in the population, namely P = Pr (y = 1). In this case, we can
obtain σy2 = P (1 − P) and the variance of the HT estimator P̂ = ȳ of P is then equal
to
1
n N
V P̂ =
1−
P (1 − P)
n
N N −1
and its unbiased estimator is
n
1 1−
P̂ 1 − P̂ .
V̂ P̂ =
n−1
N
We now discuss determination of the sample size n under simple random sampling. Under the significant level α, the margin of error, denoted by d, is defined to
satisfy
Pr θ̂ − θ ≤ d = 1 − α.
That is, d is half of the length in the confidence interval for θ , Thus, solving
r 1
n 2
zα/2
1−
S ≤d
n
N
with respect to n to get
n≥ S2
2
d
zα/2
(3.11)
2
+ SN ,
which provide the lower bound of the desired sample size for given d. However,
since we usually do not know S2 before the sample observation, we need an estimate for S2 , from a pilot survey or a similar survey in the same population. In many
public opinion surveys, y is a dichotomous variable and the maximum value of S2
in this case is 1/4 and (3.11) becomes
n ≥ 0.25
z
α/2
d
2
.
.
For α = 0.05, zα/2 = 2 and the above inequality reduces to n ≥ d −2 .
(3.12)
6
CHAPTER 3. ELEMENT SAMPLING: PART 1
3.3
Simple random sampling with replacement
We now consider the sampling design where the sample of size n is selected with
equal probability with replacement. The realized sample size can be smaller than n
because the sample is selected with replacement and thus it allows for duplication.
In the k-th sample draw, where k = 1, · · · , n, the i-th element in the population is
selected with probability 1/N. Let z1 be the realized value of yi in the first draw.
The probability distribution of z1 is given by


y1 with probability 1/N




 y2 with probability 1/N
z1 =
..


.



 y
with probability 1/N.
N
Similarly, let zk be the realized value of yi in the k-th draw. The with-replacement
sampling makes z1 , · · · , zn be independently and identically distributed (IID). The
mean and variance of z1 is
N
E (z1 ) = N −1 ∑ yi = Ȳ
(3.13)
V (z1 ) = N −1 ∑ (yi − ȳN )2 = σy2 .
(3.14)
i=1
N
i=1
Thus, the best linear unbiased estimator of the population mean Ȳ is the sample
mean z̄ = n−1 ∑ni=1 zi and its variance is
1
V (z̄) = σy2
n
(3.15)
The variance formula in (3.7) under SRS is smaller than the variance formula in
(3.15) under SRS with replacement. The SRS with replacement is less efficient
because the expected value of the actual sample size is smaller as it allows for
duplication due to with-replacement sampling.
For variance estimation, since z1 , · · · , zn are IID, the sample variance
s2z =
1 n
∑ (zi − z̄)2
n − 1 i=1
3.4. BERNOULLI SAMPLING
7
can be used to estimate the population variance. Since z1 , z2 , . . . , zn are IID with
(Ȳ , σy2 ), we have
E s2z = σy2
(3.16)
and the variance estimator of z̄ is
1
V̂ (z̄) = s2z .
n
The SRS with replacement is a special case of the PPS sampling that will be covered in Section 4.3.
3.4
Bernoulli Sampling
Bernoulli sampling design is a sampling design based on independent Bernoulli
trials for the element in the population. That is, the sample indicator function follows
i.i.d.
Ii ∼ Bernoulli (π) , i = 1, 2, · · · , N,
(3.17)
where π is the first order inclusion probability for each unit. We can express
π = n0 /N where n0 is the expected sample size determined before the sample selection. In this Bernoulli sampling, the (realized) sample size follows a binomial
distribution Bin(N, π) and the fact that the realized sample size is a random variable
can reduce the efficiency of the resulting HT estimator.
Under this Bernoulli sampling, the HT estimator of the population mean is
ȲHT =
n
yi
1
= ȳ.
∑
N i∈A πi n0
Thus, the HT estimator of the mean is not necessarily equal to the sample mean.
The asymptotic variance of the sample mean is
1 n0 2
V (ȳ) =
1−
S
n0
N
On the other hand, the variance of the HT estimator of the mean is
1 n0 1 N 2
V (ȲHT ) =
1−
∑ yi
n0
N N i=1
Thus, the sample mean is more efficient than the HT estimator in (3.19).
(3.18)
(3.19)
8
CHAPTER 3. ELEMENT SAMPLING: PART 1
Example 3.1. Suppose that we are interested in estimating the proportion of students who pass a certain test in a university and there are N=600 of students who
took the test in the university. Using a Bernoulli sampling with π = 1/6. n = 90
sample size is realized. Among the 90 sample students, 60 students are found to
have passed. In this case, the HT estimator of the mean is 0.9 × 2/3, which is different from the actual passing rate 2/3 in the sample. In the extreme case, even if
all the students pass the exam, the HT estimate is still 0.9.
If the sampling procedure is such that we repeat the Bernoulli sampling until
the realized sample size n is equal to the expected sample size n0 , then the resulting
sampling procedure is exactly equal to SRS of size n0 . To show this result, note that
!
N
Pr I1 , I2 , · · · , IN , ∑Ni=1 Ii = n0
.
(3.20)
Pr I1 , I2 , · · · , IN | ∑ Ii = n0 =
Pr (n = n0 )
i=1
Since
N
Pr I1 , I2 , · · · , IN , ∑ Ii = n0
!
(
=
i=1
(
=
1−Ii
∏Ni=1 pIi (1 − p)
if ∑Ni=1 Ii = n0
0
otherwise
pn0 (1 − p)N−n0
if ∑Ni=1 Ii = n0
0
otherwise
and
Pr (n = n0 ) =
N
!
pn0 (1 − p)N−n0 ,
n0
the conditional density in (3.20) is equal to the sampling design (3.1) under SRS
of size n0 .
3.5
Systematic sampling
Systematic sampling is an alternative method of selecting an equal probability sample but it offer several practical advantages, particularly its simplicity of execution.
In systematic sampling, a first element is drawn at random with equal probability,
among the first a elements in the population list. The positive integer a is fixed
3.5. SYSTEMATIC SAMPLING
9
in advance and is called the sampling interval. The rest of the sample is determined by systematically taking every a-th element thereafter, until the end of the
list. Thus there are only a possible samples, each having the same probability of
being selected. The simplicity of only one random draw is a great advantage. For
example, to select a sample of 200 students from the list of 20,000 students at Iowa
State University, we first select one element among the first 100 student. Suppose
that the random integer we choose is 73. Then the students numbered 73, 173, 273,
· · · , 19,973 would be in the sample.
For a more formal definition of the systematic sampling, let a be the fixed
sampling interval and let n be the integer part of N/a, where N is the population
size. Then,
N = na + c
where the integer c satisfies 0 ≤ c < a. In the systematic sampling, we first select
one integer from {1, 2 · · · , a} with equal probability 1/a. If r is selected from the
selection, the final sample from the systematic sampling is
(
Ar =
{r, r + a, r + 2a, · · · , r + (n − 1) a}
if c < r ≤ a
{r, r + a, r + 2a, · · · , r + na}
if 1 ≤ r ≤ c.
The first order inclusion probability for each unit is πi = 1/a but the second order
inclusion probability is
(
πi j =
1/a
if j = i + ka for some integer k
0
otherwise.
That is, the systematic sampling can be viewed as selecting one cluster at random
among the a possible clusters. In this case, the second order inclusion probability
of two units is positive only when the two units belong to the same cluster. Thus,
unbiased estimator of the variance of HT estimator does not exist. Also, the efficiency of systematic sampling depends on the way the list is sorted. Such concept
can be investigated using the intracluster correlation coefficient in cluster sampling,
which will be covered in Chapter 6.
10
CHAPTER 3. ELEMENT SAMPLING: PART 1
In the systematic sampling, the finite population U is partitioned into a groups
U = U1 ∪U2 ∪ · · · ∪Ua
where Ui are mutually disjoint. The population total is then expressed as
a
Y=
a
∑ yi = ∑ ∑ yk = ∑ tr
i∈U
r=1 k∈Ur
r=1
where tr = ∑k∈Ur yk . Thus, in estimating the total, the finite population can be
treated as a population with a elements with measurements t1 , · · · ,ta .
The HT estimator can be written
ŶHT =
tr
= a ∑ yk ,
1/a
k∈A
if A = Ur . Since we are doing SRS from the population of a elements {t1 , · · · ,ta },
the variance is
Var ŶHT
where
a2
=
1
1 2
1−
S
a t
St2 =
1 a
∑ (tr − t¯)2
a − 1 r=1
= a (a − 1) St2
and t¯ = ∑ar=1 tr /a.
Now, assuming N = na
V ŶHT
a
= n2 a ∑ (ȳr − ȳu )2
r=1
where ȳr = tr /n and ȳu = t¯/n. Since U = ∪ar=1Ur , we can use the ANOVA decomposition to get
a
SST
=
∑ (yk − ȳu )2 = ∑ ∑ (yk − ȳu )2
k∈U
a
=
r=1 k∈Ur
a
2
∑ ∑ (yk − ȳr )
r=1 k∈Ur
= SSW + SSB.
+ n ∑ (ȳr − ȳu )2
r=1
3.5. SYSTEMATIC SAMPLING
11
Thus, the variance can be written
V ŶHT = na · SSB = N · SSB = N (SST − SSW ) .
If SSB is small, then ȳr are more alike and V ŶHT is small. If SSW is small, then
V ŶHT is large.
To compare the systematic sampling and SRS in terms of variance, note that
n 1 N
N2 2
1−
(yk − ȲN )
VSRS ŶHT
=
∑
n
N N − 1 k=1
a
VSY ŶHT
= n2 a ∑ (ȳr − ȳu )2 .
r=1
We can compare the variance by making extra assumptions about the finite population. Cochran (1946) introduced superpopulation model which the finite population is believed to be generated from. The superpopulation model is an assumption
about the finite population and it quantifies the characteristics of the finite population in terms of smaller number of parameters.
If the finite population is ordered randomly, then we may use an IID model,
denoted by ζ : {yk } iid µ, σ 2 . In this case, we can obtain
n 2
N2 1−
σ
Eζ VSRS ŶHT
=
n
N
N2 n 2
Eζ VSY ŶHT
=
1−
σ
n
N
Thus, the model expectations of the design variances are the same under the IID
model.
Reference
Cochran, W.G. (1946). Relative accuracy of systematic and stratified random samples for a certain class of populations. Annals of Mathematical Statistics, 17,
164-177.
McLeod, A.I. and Bellhouse, D.R. (1983). A convenient algorithm for drawing a
simple random sample. Applied Statistics, 32, 182-184.