Download Cluster Sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
PPS Sampling
Situations can be met where the population contains a number of elements that have an extremely large
value for the study variable. This is often the case in business surveys. A suitable sampling technique in such a
case, especially for the estimation of a total, is one in which the inclusion probability depends on the size of the
population element. Reduction in variance can then be expected if the size measure and the study variable are
closely related.
Because this sampling technique is based on inclusion probabilities proportional to relative sizes of the
population elements, it is called sampling with probability proportional to size (PPS). In PPS sampling,
inclusion probabilities will vary according to the relative sizes of the elements. The size of a population element
is measured by an auxiliary positive-valued variable z. It is assumed that the value Zk of the auxiliary variable is
known for each population element k, since the relative size equals the quotient pk = Zk/Tz, where Tz is the
population total of the auxiliary variable or more precisely Tz = _Nk =1 Zk.
Commonly used size measures are variables that physically measure the size of a population element. In
business surveys, for example, the number of employees in a business firm is a convenient measure of size, and
in a school survey the total number of pupils in a school is also a good size measure.
The auxiliary variable z is selected such that its own variability resembles that of the study variable y. More
precisely, a size measure z is sought whose ratio to the value of the study variable is, as close as possible, a
constant. This is because the efficiency under PPS depends on the extent that the ratio Yk/Zk remains a constant
C, for all the population elements.
If the ratio remains nearly a constant, then the design variance of an estimator will be small. In PPS
sampling, the inclusion probabilities πk are proportional to the relative sizes pk = Zk/Tz of the elements, and
the individual weighting of the sampled elements is based on the inverse values of these relative sizes. It is
possible to draw a PPS sample either without or with replacement. Calculation of the inclusion probabilities is
easier tomanageunder with-replacement-type sampling.
Obtaining these probabilities can be complicated in without-replacement-type PPS sampling because
when the first element is sampled, the relative size of the remaining (N − 1) elements is changed and then new
inclusion probabilities should be calculated. Various techniques have been developed to overcome this
difficulty, and PPS sampling can be very efficient, especially for the estimation of the total, if a good size
measure is available.
Sample Selection
A number of sampling schemes have been proposed for selecting a sample with probability proportional to size.
The starting point is knowledge of the values of the auxiliary variable z for each population element so that
probabilities of selection can be calculated. The inclusion probability πk for a population element k is
proportional to the relative size Zk/Tz. For example, in the trivial case of simple random sampling with
replacement, the relative sizes are pk = 1/N for each k. The quantity 1/N is also called the single-draw selection
probability of a population element k. The inclusion probability of an element for a sample of size n would be
πk = n × pk = n/N. But in PPS sampling, the inclusion probabilities πk vary and, thus, it is not an equalprobability sampling design in contrast to simple random sampling and systematic sampling. In practice, the
selection of a PPS sample can be based on the relative sizes of the population elements or, alternatively, on the
cumulative sum of size measures. The cumulative total for the kth element is
The natural numbers [1, G1] are associated with the first population element, and the numbers [G1 + 1, G2]
with the second element; generally, the kth element receives the numbers belonging to the interval [Gk−1 + 1,
Gk]. The sample selection process is based on these figures. We consider five specific selection schemes for
PPS sampling. These are Poisson sampling, which resembles Bernoulli sampling, the cumulative total method
with replacement or without replacement, systematic sampling with unequal probabilities and the Rao–Hartley–
Cochran method (RHC method; Rao et al. 1962). Of these, the cumulative total method with replacement and
systematic sampling with unequal probabilities are considered in more detail. In the examples, the variable
HOU85 measures the size of a population element. It is register-based and gives the number of households in
each population municipality.
Poisson sampling
This sampling scheme uses a list-sequential selection procedure. First the inclusion probabilities πk = n×Zk/Tz
are calculated. Then, let ε1, . . . , εk, . . . , εN be independent random numbers drawn from the uniform (0,1)
distribution. If εk < πk, then the element k is selected. This procedure is applied to all population elements k =
1, . . . ,N, in turn.
Cluster Sampling
Cluster sampling in social and business surveys is motivated by the need for practical, economic and sometimes
also administrative efficiency.
An important advantage of cluster sampling is that a sampling frame at the element level is not needed.
The only requirements are for cluster-level sampling frames and frames for subsampling elements from the
sampled clusters.
Cluster-level frames are often easily accessible, for example, for establishments, schools, blocks or block-like
units etc.
Moreover, these existing structures provide the opportunity to include important structural information as part of
the analysis. For instance, in an educational survey it is practical to use the information that pupils are clustered
within schools and further clustered as classes or teaching groups within schools.
Schools can be taken as the population of clusters from which a sample of schools is first drawn and then a
further sample of teaching groups can be drawn from those schools that have been sampled.
If all the pupils in the sampled teaching groups are measured, then the design belongs to the class of two-stage
cluster-sampling designs. And in addition to sample selection and data collection, the multi-level structure can
be used in the analysis, for example, for examining differences between schools
Tompson (2012), p. 157
Cluster and Systematic Sampling
Although systematic sampling and cluster sampling seem on the surface to be opposites—the one spacing out
the units of a sample and the other bunching them together—the two designs share the same structure.
The population is partitioned into primary units, each primary unit being composed of secondary units.
Whenever a primary unit is included in the sample, the y-values of every secondary unit within it are observed.
In systematic sampling, a single primary unit consists of secondary units spaced in some systematic fashion
throughout the population.
In cluster sampling, a primary unit consists of a cluster of secondary units, usually in close proximity to each
other. In the spatial setting, a systematic sample primary unit may be composed of a collection of plots in a grid
pattern over the study area.
Cluster primary units include such spatial arrangements as square collections of adjacent plots or long, narrow
strips of adjacent units.
The key point in any of the systematic or clustered arrangements is that whenever any secondary unit of a
primary unit is included in the sample, all the secondary units of that primary unit are included.
Even though the actual measurements may be made on secondary units, it is the primary units that are selected.
In principle, one could dispense with the concept of the secondary units, regarding the primary units as the
sampling units and using, as the variable of interest for any primary unit, the total of the y-values of the
secondary units within it. Then all properties of estimators may be obtained based on the design by which the
sample of primary units is selected.
However, several common features of systematic and cluster sampling make these designs worth considering as
special cases:
1. In systematic sampling, it is not uncommon to have a sample size of 1, that is, a single primary unit.
2. In cluster sampling, the size of the cluster may serve as auxiliary information that may be used either in
selecting clusters with unequal probabilities or in forming ratio estimators.
3. The size and shape of clusters may affect efficiency
Thompson
Steczkowski
N
n
Mi
M

i
Liczba wiązek
Ile wiązek w próbie
Liczba obiektów w wiązce
Liczba obiektów w populacji
yij
yi=i
Wartość zmiennej Y w wiązce

Średnia Y w populacji
Średnia Y w wiązce
K
k
Nj
N

i
Suma Y w wiązce
Yij
j
Suma Y w populacji

THE BASIC PRINCIPLE
Since every secondary unit is observed within a selected primary unit, the within-primary-unit
variance does not enter into the variances of the estimators.
Thus, the basic systematic and cluster sampling principle is that to obtain estimators of low variance or mean
square error, the population should be partitioned into clusters in such a way that one cluster is similar to
another.
Equivalently, the within-primaryunit variance should be as great as possible in order to obtain the most
precise estimators of the population mean or total. The ideal primary unit contains the full diversity of
the population and hence is “representative.”
With natural populations of spatially distributed plants, animals, or minerals, and with many human populations,
the condition above is typically satisfied by systematic primary units, in which the secondary units are spaced
apart, but not by clusters of geographically adjacent units.
Cluster sampling is more often than not carried out for reasons of convenience or practicality rather than
to obtain lowest variance for a given number of secondary units observed.
With many natural populations, units near each other tend to be similar, so with compact clusters, ρ is greater
than zero. For such populations, the value of ρ, and hence the variance of ˆ τ , will tend to be larger with square
clusters, in which the secondary units are close together, than with long, thin clusters, in which at least some of
the secondary units are far apart. With systematic sampling, the secondary units of each primary unit are spaced
relatively far apart, so that ρ may well be less than zero. For this reason, systematic sampling is inherently
efficient with many real populations. The advantage of cluster sampling is that it is often less costly to sample a
collection of units in a cluster than to sample an equal number of secondary units selected at random from the
population.