Download Survey Sampling I: Simple Random Sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Survey Sampling I: Simple Random Sampling1
1
Lecture 3
K. Zuev
January 11, 2017
Sample surveys are used to obtain information about a large population. The purpose of survey sampling is to reduce the cost and the
amount of work that it would take to survey the entire population.
Familiar examples of survey sampling include taking a spoonful of
soup to determine its taste (a cook does not need to eat the entire pot)
and making a blood test to measure the red blood cell count (a medical technician does not need to drain you of blood). In this lecture we
learn how to estimate the population average and how to assess the
accuracy of the estimation using simple random sampling, the most basic
rule for selecting a subset of a population.
Figure 1: By a small sample we may
judge of the whole piece, Miguel de
Cervantes “Don Quixote.” Photo
source: wikipedia.org.
A Bit of History
The first known attempt to make statements about a population
using only information about part of it was made by the English merchant John Graunt. In his famous tract (Graunt, 1662) he describes a
method to estimate the population of London based on partial information. John Graunt has frequently been merited as the founder of
demography.
The second time a survey-like method was applied was more than
a century later. Pierre-Simon Laplace realized that it was important
to have some indication of the accuracy of the estimate of the French
population (Laplace, 1812).
Figure 2: Captain John Graunt. Photo
source: http://www.york.ac.uk/
Terminology
Let us begin by introducing some key terminology.
• Target population: The group that we want to know more about.
Often called “population” for brevity2 .
• Population unit: A member of the target population. In studying
human populations, observation units are often individuals.
• Population size: The total number of units in the population3 .
Usually denoted by N.
• Unit characteristic: A specific piece of information about each
member of the population4 . For unit i, we denote the numerical
value of the characteristic by xi , i = 1, . . . , N.
• Population parameter: A summary of the characteristic for all
units in the population. One could be interested in various parameters, but here are the four examples that are used most often:
Figure 3: Pierre-Simon Laplace. Photo
source: wikipedia.org.
2
Defining the target population may be
nontrivail. For example, in a political
poll, should the target population be
all adults eligible to vote, all registered
voters, or all persons who voted in the
last election?
3
For very large populations, the exact
size is often not known.
4
For example, age, weight, income, etc.
survey sampling i: simple random sampling
2
1. Population mean (our focus in this lecture):
N
1
N
µ=
∑ xi .
(1)
i =1
2. Population total:
N
τ=
∑ xi = Nµ.
(2)
i =1
3. Population variance (our focus in the next lecture):
σ2 =
1
N
N
∑ ( x i − µ )2 .
(3)
i =1
4. Population standard deviation
v
u
u1 N
σ = t ∑ ( x i − µ )2 .
N i =1
(4)
In an “ideal survey,” we take the entire target population, measure
the value of the characteristic of interest for all units, and compute
the corresponding parameter. This ideal (as almost all ideals) is rarely
met in practice: either population is too large, or measuring xi is
too expensive, or both. In practice, we select a subset of the target
population and estimate the population parameter using this subset.
• Sample: A subset of the target population.
• Sample unit: A member of the population selected for the sample.
• Sample size: The total number of units in the sample. Usually
denoted by n. Sample size is often much less than the population
size, n N.
Let P = {1, . . . , N } be the target population and S = {s1 , . . . , sn }
be a sample from P 5 . When it is not ambiguous, we will identify P
and S with the corresponding values of the characteristic of interest,
that is
P = { x1 , . . . , x N } and S = { xs1 , . . . , xsn }.
(5)
5
si ∈ {1, . . . , N } and si 6= s j .
6
Essentially any function of X1 , . . . , Xn .
To avoid cluttered notation, we denote xsi simply by Xi , and thus,
S = { X1 , . . . , X n } ⊂ { x 1 , . . . , x N } = P .
(6)
• Sample statistic: A numerical summary of the characteristic of the
sampled units6 . The statistic estimates the population parameter.
For example, a reasonable sample statistic for the population mean
µ in (1) is the sample mean:
Xn =
1
n
n
∑ Xi .
i =1
(7)
survey sampling i: simple random sampling
3
• Selection Rule: The method for choosing a sample from the target
population.
Many selection rules used in practice are probabilistic, meaning
that X1 , . . . , Xn are selected at random according to some probability
method. Probabilistic selection rules are important because they
allow to quantify the difference between the population parameters
and their estimates obtained from the randomly chosen samples.
There is a number of different probability methods for selecting a
sample. Here we consider the simplest: simple random sampling7 .
More advanced methods include
stratified random sampling, cluster
sampling, and systematic sampling.
7
Simple Random Sampling
In simple random sampling (SRS), every subset of n units in the population has the same chance of being the sample8 . Intuitively, we
first mix up the population and then grab n units. Algorithmically, to
draw a simple random sample from P , we
8
This chance is 1/( N
n ).
1. Select s1 from {1, . . . , N } uniformly at random.
2. Select s2 from {1, . . . , N } \ {s1 } uniformly at random.
3. Select s3 from {1, . . . , N } \ {s1 , s2 } uniformly at random.
4. Proceed like this till n units s1 , . . . , sn are sampled.
In short, we draw n units one at a time without replacement9 .
Questions: What is the probability that unit #1 is the first to be
selected for the sample10 ? What is the probability that unit #1 is the
second to be selected for the sample? What is the probability that
unit #1 is selected for the sample? How about unit #k?
So, let X1 , . . . , Xn be the SRS sample drawn from the population P ,
and let us consider the sample mean X n in (7) as an estimate of the
population mean µ in (1).
Our goal: to investigate how accurately X n approximates µ.
Before we proceed, let me reiterate a very important point: xi , and
therefore µ, are deterministic; Xi , and therefore X n , are random.
Since X n = n1 ∑ Xi , it is natural to start our investigation from
the properties of a single sample element Xi . Its distribution is fully
described by the following Lemma.
Lemma 1. Let ξ 1 , . . . , ξ m be the distinct values assumed by the population
units11 . Denote the number of population units that have the value ξ i by ni .
Then Xi is a discrete random variable with probability mass function
nj
P( Xi = ξ j ) = , j = 1, . . . , m,
(8)
N
and its expectation and variance are
E [ Xi ] = µ
and
V [ Xi ] = σ 2 .
(9)
SRS with replacement is discussed in
S.L. Lohr (2009) Sampling: Design and
Analysis.
9
i.e. what is P(s1 = 1), or, equivalently,
what is P( X1 = x1 )?
10
For example, if x1 = 1, x2 = 1, x3 =
2, x4 = 3, and x5 = 3, then there are
m = 3 distinct values: ξ 1 = 1, ξ 2 =
2, ξ 3 = 3.
11
survey sampling i: simple random sampling
4
As an immediate corollary, we obtain the following result:
Theorem 1. With simple random sampling,
E[ X n ] = µ.
(10)
Intuitively, this result tells us that “on average” X n = µ12 . The
property of an estimator being equal to the estimated quantity on average is so important that it deserves a special name and a definition.
This is good news and justifies the
characteristic “reasonable estimate” of µ
that we gave to X n above.
12
Definition 1. Let θ be a population parameter and θ̂ = θ̂ ( X1 , . . . , Xn )
be a sample statistic that estimates θ. We say that θ̂ is unbiased if
E[θ̂ ] = θ.
(11)
Thus, X n is an unbiased estimate of µ. The next step is to investigate how variable X n is. As a measure of the dispersion of X n about µ,
we will use the standard deviation of X n 13
q
se[ X n ] = V[ X n ].
(12)
Let us find the variance14 :
#
#
"
"
n
1
1
1 n
Xi = 2 V ∑ Xi = 2
V[ X n ] = V
∑
n i =1
n
n
i =1
n
n
∑ ∑ Cov(Xi , Xj ).
(13)
i =1 j =1
σ2
.
N−1
(14)
And, therefore, we have:
Theorem 2. The variance of X n is given by
σ2
n−1
V[ X n ] =
1−
.
n
N−1
(15)
A few important observations are in order:
n −1
1. The factor 1 − N
−1 is called finite population correction. It is
n
n
approximately 1 − N
. The ratio N
is called the sampling fraction.
2. Finite population correction is always less than one. Therefore,
2
V[ X n ] < σn . This means that SRS is more efficient than sampling
with replacement.
3. If the sampling fraction is small, that is if n N, then
σ2
n
and
σ
se[ X n ] ≈ √ .
n
2
∑in=1 V[ Xi ] = n12 ∑in=1 σ2 = σn .
In SRS, however, sampling is done without replacement and this introduces
dependence between Xi .
1
n2
Lemma 2. If i 6= j, then the covariance between Xi and X j is
V[ X n ] ≈
If sampling were done with replacement then Xi would be independent, and we would have:
V[ X n ] = n12 V [∑in=1 Xi ] =
14
To continue, we need to compute the correlation.
Cov( Xi , X j ) = −
Standard deviations of estimators are
often called standard errors (se). Hence
the notation in Eq. (12).
13
(16)
survey sampling i: simple random sampling
4. To double the accuracy of approximation X n ≈ µ15 , the sample
size n must be quadrupled.
15
5. If σ is small16 , then a small sample will be fairly accurate. But if σ
is large, then a larger sample will be required to obtain the same
accuracy.
16
Further Reading
1. The history of survey, in particular, how sampling became an
accepted scientific method, is described in a nice discussion paper
by J. Bethlehem (2009) “The rise of survey sampling.”
Next Time
The result (15) and the above observations are nice, but we have a
serious problem: we don’t know σ! Next time we will learn how to
estimate the population variance using SRS.
5
i.e. to reduce se[ X n ] by half.
i.e. the population values are not very
dispersed.