Download Unequal probability sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Selecting Districts for the Uganda Health Workforce Retention Study Using Probability
Proportional to Size (PPS) Sampling
Probability proportional to size (PPS) refers to a sampling technique where the probability that a
particular sampling unit will be chosen in the sample is proportional to some known variable
such as population size or geographic size.1
Probability Sampling: A Very Brief Review
All sampling is either probability or non-probability sampling. In probability sampling, each
member of the population has a known probability of being selected. Random sampling is a
common form of probability sampling.2 Non-probability sampling is not random – instead
individuals are chosen based on some characteristic like being in the right place at the right time
(convenience sample) or being part of a social network (snowball sample).
There are more probability sampling options other than random sampling. Stratified sampling is
a commonly used probability method that is often considered to be superior to random sampling
because it reduces sampling error.3 A population is subdivided into strata that share at least one
common characteristic like gender, job type, geography, income, etc. To take a stratified sample,
you first identify the relevant strata and their actual representation in the population. You then
use random sampling techniques to choose a sufficient number of subjects from each stratum.
This can be very helpful when one stratum has a low incidence rate in the general population so
therefore is likely to be under-represented in a straight random sample.4
PPS is sort of a modified version of stratified sampling and is usually used in multi-stage cluster
(or stratified) sampling for population-level studies. It can also be called “unequal probability
sampling” because you are actually increasing the odds that a subject will be chosen in the
sample based on its size. PPS is used when the populations of sampling units vary in size. If
the sampling units are selected with equal probability, then the likelihood of a sampling unit with
a large population being selected for the survey is actually less than the likelihood of elements
from a sampling unit with a small population being selected. PPS reduces standard error and
bias by increasing the likelihood that a sampling unit from a larger population will be chosen
over a sampling unit from a smaller population. This can also be done after the sample is
complete by using weighting techniques, but using PPS instead of weighting allows you to do
your calculations up front, before sample selection, as opposed to having to do your calculations
during analysis.5 If you use PPS, you can completely avoid having to do weighting later in the
process.
1
CDC PPS Module. Available at: http://www.cdc.gov/descd/MiniModules/pps.
Probability and Non-Probability Sampling. Available at: http://www.statpac.com/surveys/sampling.htm
3
Probability and Non-Probability Sampling. Available at: http://www.statpac.com/surveys/sampling.htm
4
Probability and Non-Probability Sampling. Available at: http://www.statpac.com/surveys/sampling.htm
5
CDC PPS Module. Available at: http://www.cdc.gov/descd/MiniModules/pps.
2
October 21, 2006
Page 1
PPS sampling is used when populations of the sampling units vary to ensure that every element
in the target population has an equal chance of being selected. There is no rule to follow to know
how much your sampling units should vary before you use PPS, but if there is very little
difference between the sizes of all of your sampling units, then PPS sampling may not be
warranted.6
An Example of PPS in Action
To give an example, let’s say that you want to interview health workers across a country to learn
about their job satisfaction and their intention to stay in their jobs. You could get a list of all
health workers in the country and then take a random sample of them, maybe first stratifying by
gender, cadre, geography, or other factors. But, there are many challenges with this approach.
First of all, a good list may or may not exist. Even if this list did exist, tracking down individual
health workers by name and last known location of work may exceed the amount of resources
available for your study.
Using PPS helps to address these issues. Stratifying and analyzing at the district level, instead of
at the individual level, may be a more meaningful to the study. Plus, if the districts and facilities
are chosen randomly, the researchers will have access to a broad sample of health workers
without having to track down individuals. But, because districts instead of individuals are now
your unit of analysis, it does not make statistical sense to give two districts with very different
populations an equal probability of being chosen for the sample. The end goal is to have health
workers (the final sampling unit) to be representative of the population. This means that the
Nebbi District, with a small population, should not have an equal chance of being selected as
Kampala District, with a very large population. If the probability was equal, your final sample
would not be generalizeable to the population. By using PPS to select the districts where you
will recruit health workers, the researcher ends up with a sampling of districts that is more likely
to be representative of the country as a whole.
Here are the steps to take to determine the sample for the Uganda Health Workforce Retention
Study. The example used below is from the 2006 study, but a new set of random numbers would
generate a new set of selected districts. In this study, because we are interested in reaching three
“Hard to Reach” districts, the instructions require some oversampling until the three Hard to
Reach districts are obtained.
Step 1: Identify the names and population sizes of all districts.
Step 2: Determine “hardness” score for each district, with criteria from MOH HRD (Charles
Isabirye): Security (weight 50%); Remoteness (weight 10%); Social amenities & utilities (10%);
Human resources for health (30%). These scores range from 0 to .97.
Step 3: Create a spreadsheet with the following information (see end of posting for copy of
spreadsheet)
 District name
6
CDC PPS Module. Available at: http://www.cdc.gov/descd/MiniModules/pps.
October 21, 2006
Page 2





Hardness Score
Population size
Percent of the total population of the country
Cumulative percent of population (add the population percent of the district to the
population percent of all the earlier districts on the spreadsheet)
The cumulative population range for the district (the lower bound of the range is the
cumulative percent of the previous district on the list and the upper bound of the
range is the cumulative percent for the district)
Step 4: Generate a list of 20 random numbers between 0 and 1 using any random number
generating program (Microsoft Excel will do this for you).
Step 5: Match the random numbers with a district by finding the upper and lower bounds in
which the random number falls in the spreadsheet. Because the upper and lower bounds
are generated based on the percent of the total population in each district, larger districts
with have wider bounds and therefore will be more likely to be selected.
When you get close to using nine numbers, check to see if you have three hard to reach
districts chosen in your random sample. If you do not, skip numbers that do not match
with hard to reach districts until you find the next random number that matches with a
hard to reach district. In the end, you should have a list of 9 districts, including three
hard to reach districts. Because this project wants to oversample for hard to reach
districts, you may end up having to go 12 or 15 numbers down on your random sample
number list to get 3 hard to reach districts included in your sample.
Once you have chosen your clusters (in this example, districts), you can then use a variety of
methods to choose the sample from each cluster. If you want to maintain the integrity of your
sample, it is important to use another probability sampling technique and not resort to a
convenience sample or some other “nonscientific” methods for choosing your sample.7
See the spreadsheet of the Uganda districts chosen for the 2007 Health Workforce Retention
Study for an example of how this process works.
7
CDC PPS Module. Available at: http://www.cdc.gov/descd/MiniModules/pps.
October 21, 2006
Page 3