Download 1 Basic Concepts of Finite Population Sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability box wikipedia , lookup

Bias of an estimator wikipedia , lookup

Transcript
1
Basic Concepts of Finite Population Sampling
A sample survey is a statistical procedure which
1. selects a sample of n units from a real, finite population of N units, each unit
being identified (for the purposes of selection) by a distinct label (eg name and
address for humans)
(THE DESIGN STAGE)
2. measures the fixed values of a number of variables (eg income, expenditure) on
each of the sampled units
(THE MEASUREMENT STAGE)
3. attempts to estimate the value of a population parameter θ (eg the population
mean, domain mean for men) and give a measure of accuracy: a confidence interval,
nominally 95%1
(THE ESTIMATION STAGE)
HOW?
1. (THE DESIGN STAGE): How do we select the sample? ie what is the sample
design or sampling mechanism? There are many possible sample designs, but
the most important, even if little used in actual survey practice, is
Definition 1 (SRS): Simple Random Sampling All possible samples of size n (ie
with n distinct units) are equally likely to be drawn, so that each has probability
1/ Nn .
The sampling fraction f = n/N is the probability that any particular unit is included in the sample (see later). It goes from 0 (sampling from an infinite population: where sampling with replacement is equivalent to sampling without replacement studied here) to 1 (a census where the whole population is sampled and
there is clearly no sampling error: inferences on population parameters should be
exact in the absence of non-sampling errors: see below).
2. (THE MEASUREMENT STAGE): (Note: The process of measurement in surveys, especially those on populations of human beings is a largely non-statistical
problem and varies greatly in different survey contexts. In this course we will
only be touching on a small aspect of this huge topic, namely that of questionnaire design, in the latter part.) Of major priority is to minimize non-response,
(this is when sampled units fail to respond either totally, unit non-response, or
1 this
means that although we say that the interval is 95%, we recognize that the coverage probability
may well be somewhat less than 0.95(see later)
1
eg just to some of the questions asked, item non-response), measurement error,
for example asking people how many cigarettes they smoked last week, and other
NON-SAMPLING ERRORS. For further discussion see later in this course.
3. In order to do statistical inference in surveys, we use the sampling distribution
of our estimator, which is a suitable function of the sample data:
Definition 2 An estimator of a population parameter gives an estimate whatever
the sample.
Of course some estimators are sensible and give precise inferences, and others
may be neither sensible (eg giving values outside the known range of a population
parameter) or be very imprecise (eg an estimator which ignores most of the sample
data)
Definition 3 The sampling distribution of an estimator is the collection of all
possible samples together with the corresponding estimates and their probabilities
under the sampling design for the fixed population
Let Var[e] be the variance of an estimator e under its sampling distribution for the fixed
(unknown) population, and let v[e] be a sample estimator of this variance, called a variance estimator for e. Then under a normal approximation to the sampling distribution
of e, a nominal 95% confidence interval for θ is given by
e
±
1.96
p
v[e]
(1.1)
Definition 4 The actual probability that this interval contains the value of θ is called
the coverage probability for the estimator, design and population (usually < 0.95: undercoverage, but sometimes above 0.95 overcoverage).
Let E[.] denote expectation (or mean) under the sampling distribution, then
Definition 5 e is unbiased for θ if for all populations
E[e] = θ ,
otherwise bias[e] = E[e] − θ is the bias of e.
Unbiasedness is a desirable property, but we should not get too worried about a very
small bias which diminishes as the sample size increases. It is rather we should be
worried about any estimator which has a large unknown bias. This applies not only to
the bias (if any) of estimators but also the bias of variance estimators. If E[v[e]] < Var[e],
we might expect undercoverage of our basic C.I. (1.1) as it will tend to be too narrow,
but we should not worry about this for a large sample if we know the bias decreases to
zero as n → ∞ (this is true for all reasonable variance estimators).
2
FARMS: Sampling distribution of the sample mean and variance estimator under
simple random sampling.
Sample prob.
ȳ
s2
v[ȳ]
Coverage?
A,B
0.1 111.5
612.5
183.75
NO
A,C
0.1 122.0
1568.0
470.40
NO
A,D
0.1 147.5
5724.5 1717.35
YES
A,E
0.1 182.0 15488.0 4646.40
YES
B,C
0.1 139.5
220.5
66.15
NO
B,D
0.1 165.0
2592.0
777.60
YES
B,E
0.1 199.5
9940.5 2982.15
YES
C,D
0.1 175.5
1300.5
390.15
YES
C,E
0.1 210.0
7200.0 2160.00
YES
D,E
0.1 235.5
2380.5
714.15
NO
mean
168.8
4702.7 1410.81
0.6
Example 1 Farms A simple population of five farms is to be studied by drawing a
simple random sample of size two in order to estimate Ȳ = N1 ∑N
i=1 Yi , where Yi is the
area (in acres, devoted to growing wheat) of the ith farm. The farms are respectively
identified simply by the first five letters of the Roman alphabet and have areas A: 94,
B: 129, C: 150, D: 201, E: 270. Thus Ȳ = 168.8 and the finite population variance
2
S 2 = ∑N
i=1 (Yi − Ȳ ) /(N − 1), where N = 5 is the finite population size, takes the value
4702.70. We will see in the next chapter that the variance of the sample mean as an
estimator of Ȳ , that is the variance of the sampling distribution of this estimator, is given
by
1− f 2
S ,
Var[ȳ] =
n
for a sample size n under simple random sampling, where f = n/N (here equal to 0.4)
is the sampling fraction. This result will be verified empirically in the Table as will the
result that this sampling variance can be estimated by the corresponding sample formula
v[ȳ] =
1− f 2
s ,
n
where s2 = ∑ni=1 (yi − ȳ)2 /(n − 1) is the sample variance, calculated only from the n
sample values here denoted y1 , y2 , . . . , yn .
But the prime objective of this table will be to examine a quantity for which there is
no theoretical result:
p the coverage probability, of the nominal 95% confidence interval
formed by ȳ ± 1.96 v[ȳ]. This is just the probability that an interval calculated from a
sample will contain or cover the true value that is the actual population quantity to be
estimated. For simple random sampling estimating the population mean it is simply the
proportion of samples giving intervals containing Ȳ .
Note that the means have been calculated by simple averaging as all 10 possible
samples have the same probability (0.1) of being selected. The mean or expectation of
3
the sampling distributions of the sample mean and sample variance are therefore equal
to the value of the population mean and population variance respectively. These agree
with the general theorem in the next chapter.
The coverage probability is considerably less then the nominal level of 0.95, a phenomenon known as undercoverage. However, if we had used a percentage point of the
t-distribution with say one degree of freedom instead of the Normal then the coverage
probability would have been 1, the phenomenon of overcoverage.
The variance of the (discrete) sampling distribution of ȳ also agrees with the theorem
in the next chapter as
Var[ȳ] = E[(ȳ)2 ] − (E[ȳ])2 =
1
(111.52 + 122.02 + · · · + 235.52 ) − 168.82 = 1410.81.
10
However, it can be seen that the distribution of ȳ is a very poor approximation to the
Normal!
It is worthwhile noting the short-cut formula (y1 − y2 )2 /2 for the sample variance,
when n = 2.
This further example illustrates the generality of the definitions we have made to all
sampling problems, not just SRS:
Example 2 Unequal probability sampling For this population e is biased (with a bias
of +0.02), and v[e] is also biased (with a bias of +0.0014). Note that the true variance
Var[e] is calculated directly from the sampling distribution using the standard formulae
for discrete distributions giving
E[e2 ] − (E[e])2 = 0.5 × (1.2)2 + 0.3 × (0.8)2 + 0.2 × (0.9)2 − (1.02)2 .
The coverage is NOT 2/3, but obtained by adding the probabilities of covering samples,
or equivalently taking the mean of the coverage indicator (counting YES as 1 and NO
as 0).
Sampling distribution of an estimator e of a parameter θ = 1
Sample prob.
e
v[e] Coverage?
1
0.5
1.2
0.01
NO
2
0.3
0.8
0.04
YES
3
0.2
0.9
0.09
YES
mean
1.02 0.035
0.5
variance
0.0336
4
Compare unbiased estimators e by their variances Var[e]:
e1 better than e2 if Var[e1 ] < Var[e2 ], but this inequality often depends on the population, that is for some populations e1 is better than e2 whereas for others it is e2 which is
the better estimator. One of the aims of sampling theory is to identify the types of population which make one estimator better than another. If an estimator is always worse
than another i.e. for every population, no matter what the values in it, then it is clearly
not worth considering (we say it is inadmissible).
More generally compare biased and unbiased estimators by
Definition 6 The mean square error of an estimator, MSE[e] is given by
MSE[e] = E[e − θ ]2 = Var[e] + (bias[e])2 .
However, and beyond the scope of an intermediate course, is
Godambe’s Theorem: There is no best (linear) (unbiased) estimator!
Summary To summarise, the sampling distribution is used in at least two important
ways:
1. to assess the accuracy of a given estimate by a nominal 95% confidence interval
based on a variance estimator (for the unknown variance of the estimator e of θ ,
the parameter of interest.
2. to compare different estimators, perhaps using different sample designs and/or different sample sizes by looking at variances or more generally mean square errors.
5