Download Sampling and Weighting - Vision Critical Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
TECHNICAL PAPER
Sampling and Weighting
©Copyright 2000-2005 Vision Critical Communications Inc.
1750 - 1111 West Georgia Street
Vancouver, BC, V6E 4N5
http://www.visioncritical.com
Table of Contents
Introduction ......................................................................................................................................... 3
Sampling Strategies ............................................................................................................................ 4
Probability Sampling ...................................................................................................................... 5
Non-probability sampling ............................................................................................................... 7
Stratified Quotas Sampling ................................................................................................................. 8
Multidimensionality and Tolerance .............................................................................................. 10
Mutually Excusive Samples .............................................................................................................. 11
Frequency of Inclusion ..................................................................................................................... 11
Sample Size and Estimates ............................................................................................................... 12
Simple Random Sampling ............................................................................................................ 12
Sample Size for Estimating Population Mean and Population Total ....................................... 12
Sample Size for Estimating Population Proportion .................................................................. 14
Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 14
Stratified Sampling ....................................................................................................................... 15
Sample Size for Estimating Population Mean and Population Total ....................................... 15
Sample Size for Estimating Population Proportion .................................................................. 17
Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 18
Stratified Quota Sampling............................................................................................................. 18
Sample Size for Estimating Population Mean, Population Total and Population Proportion . 18
Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 19
Adjustment Factors ........................................................................................................................... 19
Weighting .......................................................................................................................................... 20
Convergence ................................................................................................................................. 23
References ......................................................................................................................................... 25
2
Introduction
Information form surveys1 have a large affects on every facet of our every day lives.
Recorded measurements dictate the whole range of policies, such as government, economy and
social programs. Businesses conduct surveys for their internal operations and more importantly to
formulate crucial management decisions. One particular area of business activity that relays heavily
on surveying techniques is marketing. Decisions such as which products should be marketed, in
which area, and most importantly at what price are regularly made on the basis of survey data.
An ideal opinion poll would gather information from all members of the population of
interest. However in most cases the population size, hence the cost of conducting such poll is too
large for the researcher to attempt to examine all of its units. In fact the very first recorded opinion
poll conducted by The Harrisburg Pennsylvanian in 1824 gathered information only from a portion
of a population of interest. It showed Andrew Jackson leading John Quincy Adams by 335 votes to
169 in the United States presidency race. Since then unscientific surveys grew in popularity but
mostly remained local till 1916, when the Literary Digest embarked on a national survey and
correctly predicted Woodrow Wilson's to be the next president of the United States. However in
1936 sampling bias caught up with Literary Digest surveying practice. Esteemed journal falsely
reported Alf Landon to be the likely new president over Franklin D. Roosevelt based on
information collected from their readership (at the time circulation was estimated at 2.5 million).
Simultaneously, George Gallup conducted a far smaller, but more scientifically-based survey, in
which he polled a demographically representative sample and correctly predicted Roosevelt's
victory. Needles to say shortly after Literary Digest went out of business and the era of “scientific
surveys”2 begin.
Since in most cases the objective of a modern surveying is inference, it is of outmost
importance that the medium of inference, sample, is chosen carefully so that it can be used to
represent the population. In fact sampling is a major operational step for anyone creating a
statistically valid survey, quality control study, accuracy of records measurements, or any other
situation in which conclusion is drawn based on an inspection of a fragment of a population.
1
The word "survey" is used most often to describe a method of gathering information from a sample of population
units.
2
The term “scientific survey” is restricted to those studies that produce analytical information about society for the
needs of social or economic decision-making, scientific research or international comparisons.
3
However, a potential obstacle to inference, even when the sampling step is completed
correctly, is unit non-response (instance in which characteristics of the population units that have
responded to a particular survey differ from those present in targeted population). The most
efficient way to minimize non-response bias is to perform Weighting (i.e., post-stratification).
Weighting is a process of assigning weights to respondents so that marginal totals of the weights on
specified characteristics agree with the corresponding totals for the population [1]. Once the
weights are applied, the collected data match the overall characteristics of the population. Most of
the researchers today routinely apply weights to survey respondents when under-response is not
substantial. This is due to the fact that weight adjustments when used judicially bring the overall
proportions of respondents in line with the targeted population thereby reducing effect of nonresponse bias.
Sampling Strategies
Distinction between representative sample and any given subset of population units is best
described by defining two major data employment models: descriptive and inferential statistics.
Descriptive statistics summarizes a collection of data in a clear and understandable way but it does
not attempt to go beyond data-set (sample response) in order to make inferences. On the other hand
inferential statistics provides conclusions that extend beyond the data. That is inferential statistics
make inferences from the sample about the population from which it was drawn. In order to
accomplish this task, members of the sample must accurately reflect the characteristics of the
population they represent. Hence, for the purpose of inferential statistics sample can not be just any
given subset of population units, it must be representative of the population. Sampling techniques
in social science can be divided into two major categories:

Probability Sampling - Sampling models that utilize some form of random selection.

Non-Probability Sampling - Sampling models in which selection of population units is
arbitrary or subjective. Non-probability samples cannot depend upon the rationale of
probability theory.
4
Probability Sampling
For sampling models to be considered probability sampling models each population unit has
to have known probability of being included in a sample. This allows for the statistical projection
of characteristics based on the sample to the population of interest. The most common probability
sampling models used in an online panel environment are as follows:
Figure 1. Simple Random Sample
Random Sampling - Any sort of sampling where, in advance of the selection of the sample,
each member of the population has a calculable and non-zero chance of selection.
Simple Random Sampling - Sampling model in which each member of the population has
the same probability of being chosen. Moreover, sample is drawn is such a way that every
possible sample of the same size has the same chance of being selected.
Stratified Random Sampling - Sampling model in which samples are obtained by grouping
the members of the population into non-overleaping sub-groups (i.e., stratums) and than
selecting a simple random sample from each sub-group. This method often improves the
representatives of the sample by reducing sampling error.
5
Figure 2. Stratified Random Sample
Stratified Quotas Sampling – Sampling model in which the population is first segmented
into mutually exclusive sub-groups, just as in stratified sampling. However, each sub-group
is defined by setting quotas on the categorical variables of interest (for example gender, age,
employment, etc) to ensure a proper mix of different social groups. Final sample is
constituted of simple random samples drawn from each sub-group.
Other models used in social science are: Systematic, Clustered and Multistage Sampling.
Figures 1, 2 and 3 depict differences among probability sampling techniques. Members of
the sample are red colored units. Categorical variable of interest “Province” has 13 categories. (See
the top right corner of Figure 2.) Population of interest for the sampling models depicted in Figures
1 and 2 is consisted of Canadian residents, while population of interest for the sampling model
depicted in Figure 3 is consisted of potential customers of an imaginary franchise, say “Samples R
us”, which has 30% of its locations across BC, 30% of its locations across AL, 15% of its locations
across SK and finally 25% of its locations across QC. Sample Frame3 is consisted of Canadian
residents.
3
List of members of the population from which one selects the actual sample for the survey. Ideally, the sample frame
contains every member from the target population. However more often than not there are substantial differences.
6
Figure 3. Stratified Quotas Sample
Note that sample in Figure 1 is not a representative sample. This is because Simple Random
Sampling does not provide guarantees of proportionality for finite a population (and sample) size.
In fact this model is not recommended if a variable of interest is highly segmented and a difference
between sample and population size is large. Comparing Figure 1 and 2 one can see that in this
particular example Stratified Sampling has reduced sampling error thereby improving
representatives of the sample. Finally, Figure 3 depicts the scenario in which neither of the two
traditional models would produce representative sample due to the inherent bias in the sample
frame. In this case Stratified Quota Sampling is an appropriate model since it eliminates existing
bias. (For further details please refer to the Stratified Quota Sampling section.)
Non-probability sampling
Non-probability sampling strategies in general cannot be used to infer from the sample to
the population. Non-probability sampling models are considerably less expensive but the results
obtained from a non-probability study are of limited value and must be taken with caution when
utilized for anything but descriptive statistics. The most common non-probability sampling
methods used in an online panel environment are as follows:
7
Convenience Sampling - Sampling model in which members of the population are chosen
based on their relative ease of access.
Snowball Sampling - Sampling model in which the first respondent refers a friend which
than refers a friend, and so on.
Purposive (Panel Filter) Sampling - Sampling model in which the researcher chooses
sample based on who they think would be appropriate for the study.
Ad Hoc Quotas Sampling - A quota is established (say 70% men and 30% women) and
researchers are free to choose any respondent they wish as long as the quota is met.
There are numerous other non-probability sampling models used in social science (for
example Modal or Expert sampling).
Stratified Quotas Sampling
Origins of Stratified Quota Sampling can be traced to the sample-balancing problem defined
by Deming back in 1940. Census bureau of the US was required to derive cross-tabulation for the
joint distribution of two (or more) variables. However, joint distribution was available only from
sample data, while distribution for each variable was available across population. Sample-balancing
is a process which adjusts individual cell counts (sample data) to marginal totals (census data).
When presented with a form of quota sampling most of the traditional researchers restrict
data-analysis to descriptive statistics. This is because in its truest form quota sampling does not
require any randomness (see definition for Ad Hoc Quotas Sampling). However, when the sample
frame is highly skewed with respect to targeted population (as it is often the case for web based
panels), if used sensibly Stratified Quotas actually reduce sampling bias, therefore yielding much
more representative sample. To further clarify this point consider the following example:
Example 1: Suppose that three categorical variables of interest for our study are “Gender”, “Age”
and “Education”. “Gender” has two categories: (Male) and (Female). Let “Age” have three
categories: (1-19), (20-64), (65+), and finally let “Education” have five categories: (No High
School), (High School), (Trade or Diploma), (Non Degree College) and (Some University);
8
Suppose that distribution among panel members (bold digits) with respect to variables of
interest is as follows:
Gender: (Male -- 38%, 48%4), (Female -- 62%, 52%);
Age: ((1-19) -- 12%, 22%), ((20-64) -- 78%, 46%), ((65+) -- 10%, 32%);
Education: (No High School -- 13%, 22%), (High School -- 12%, 24%),
(Trade or Diploma -- 18 %, 13 %), (Non Degree College -- 25%, 18%),
(Some University -- 32%, 23%);
Moreover, suppose that our research requires us to conduct scientifically valid survey where
the targeted population is consisted of Canadian residents. If we were to use any standard
probability model, it is likely that the resulting sample would be significantly skewed with respect
to census data.
Since the distribution of any representative sample need to be in accordance with census
data not the sample frame, we have to set appropriate quotas on variables of interest (in fact
matching census data), thereby successfully eliminating (or at least reducing, for example see Table
1) sampling bias. Moreover if we define outlier as an observation that lies an abnormal distance
from other values in a representative sample from a population, properly implemented stratified
quota sampling modules automatically isolates and excludes outliers from the sample.
Variable
Gender
Age
Education
Categories
Male
Female
1-19
20-64
65+
No High School
High School
Trade or Diploma
Non Degree College
Some University
Sample
Frame
Distribution
38%
62%
12%
78%
10%
13%
12%
18%
25%
23%
(Quotas)
Population
Distribution
48%
52%
22%
46%
32%
22%
24%
13%
18%
23%
Sample
Distribution
50%
50%
22%
45%
33%
20%
26%
12%
20%
22%
Table 1: Output from a Stratified Quotas Sampling module.
4
Red digits are census Canada data.
9
It is important to note that in practice unlike Simple Random Sampling and traditional
Stratified Sampling, Stratified Quotas Sampling module does not necessarily produce sample of
required size and/or requested distribution. It is not hard to imagine sample frame (list of panelists)
so dissimilar to targeted population that is highly unlikely representative sample can be devised.
Multidimensionality and Tolerance
Stratified Quota Sampling produces sample by substituting equal inclusion probabilities of
units in the sample frame with unequal inclusion probabilities calculated in respect to specified
quotas. In other words Simple Random Sampling model within a stratum is replaced with Random
Sapling model. In complex studies number of quotas, (i.e., variables of interests and/or number of
categories per variable) tends to increase the complexity of producing a representative sample.
Problem of multidimensionality (often referred to as “Curse of Multidimensionality”) is present in
many areas of data analyses, such as Segmentation, Conjoint Analyses and Cross-Tabulation. As
the number of quotas increases task of calculating proper inclusion probabilities becomes more
demanding. Hence, it is important to choose variables of interest wisely. Ideal candidates are those
variables which are closely related to key survey outcome variables. (For technical details on
multidimensionality issues please refer to Weighting Section, as panelist inclusion probability is
closely related to its weight).
In practice not all variables influence key survey outcome variables equally. For example
variables “Salary”, “Gender”, “Age” and “Education“ are all related to variable, say “Computer
Literacy”, but it might be reasonable to assume that “Education” and “Age” are somewhat more
related than “Gender” and “Salary”. To ease the burden of multidimensionality Stratified Quota
Sampling module allows different tolerance levels to be specified for variables, where tolerance
refers to an acceptable difference between sample and a population totals. For example if quotas for
categories (Male) and (Female) of variable “Age“ are set at 48% and 52% respectively, and the
tolerance level is 5%, distribution in the representative sample must agree with specified quotas
within 5%. Consequently, percentage of male population in a representative sample (call it Pm)
must be in the range of 53 to 43 while percentage of female population (call it Pf) must be in the
range of 57 to 48, where Pm + Pf = 100 (see the rightmost two columns of Table 1). Combining
tolerance levels with quotas (especially in the situation where sample frame is significantly skewed
10
with respect to targeted population) allows more flexibility, thereby increasing likelihood of
obtaining a representative sample.
Mutually Excusive Samples
As the Cross-Sectional5-like designs grew in popularity through polling community so did
the necessity to develop an automated tool capable of devising multiple mutually exclusive6
representative samples. When such samples are produced sequentially and the sample frame
contains large number of population units task is computationally manageable. However when
sample frame is limited and representative samples need to be produced simultaneously (sometimes
utilizing different sampling models) complexity of the problem soon becomes substantial. In fact
producing a single optimal representative sample utilizing Stratified Quota Sampling model is a
hard problem (see Weighting Section for details), therefore producing simultaneously multiple
mutually exclusive representative samples utilizing Stratified Quota Sampling only adds to the
complexity.
Since in most cases in an online panel environment sample frame is highly restrictive, if
presented with choice, researchers are advised to draw mutually exclusive samples sequentially.
(Especially if one of the samples requires Stratified Quota Sampling module.) To see this consider
the scenario in which sample frame has cardinality of 10,000 and two mutually exclusive yet
similarly defined “Stratified Quota Samples” each of cardinality 1000 are needed. When requested
to be drawn in parallel due to the restrictions of the sample frame, obtainable representative
samples would most likely have a size less than required (making both samples inadequate for
research purposes). However if drawn sequentially there is a much greater chance that at least one
of the samples would have required number of units.
Frequency of Inclusion
Key to maintaining a high response rate in an online panel environment is developing an
automated tool which helps researchers maximize the likelihood of respondents accepting an
5
Study that collect measurements on a population over time by repeating the same survey on two or more occasions.
During each time period, a separate but comparable representative sample of population units is drawn from the
population.
6
Mutually exclusive samples have no population units in common.
11
invitation to participate while minimizing respondent burden in going through study. To increase
likelihood of acceptance goal is not too overburdened population units with research requests but at
the same time to maintain high coverage of the sample frame. An automated tool needs to
continuously monitor panelists activity in order to adjust their inclusion probability accordingly (for
example more recent behavioral pattern odd to be weighted more heavily than the past ones due to
evolving and dynamic nature of a web based panels). This is an essential feature since it relieves
researches of responsibility to filter out population units based on their inclusion frequency (and/or
response rate) prior to sampling, allowing them to shift their focus toward question development
and project management.
Sample Size and Estimates
Sample size is the number of observations included in the sample in order to make inference
about targeted population with required precision. If the sample is too large it will certainly carry
greater precision, but it will also waste resources. Conversely, small samples are less costly but
they may produce erroneous results. Therefore it is important to determine the proper size of the
sample. In general key survey outcome variables either estimate population total, mean or
percentage. Following section describe methods for determining proper sample size with respect
common probability sampling models.
Simple Random Sampling
Simple random sampling model is suitable choice when population of interest is relatively
small and where sampling frame is complete and up-to-date.
Sample Size for Estimating Population Mean and Population Total
The sample of size n that is required to estimate population mean  with bound of error B
can be found by setting two standard deviations of estimator equal to B: 2 V ( x )  B , where the
variance of the estimator, x , is given by V ( x ) 
2 N n
n
(
N 1
) and N is the size of targeted population.
Similarly, the sample size n required to estimate population total is calculated by setting
12
2 V ( Nx )  2 N V ( x )  B .
Solving for n, following formula is obtained:
n
N 2
, where
( N  1) D   2
2
D  B for estimating population mean
4
2
D  B 2 for estimating population total.
4N
However, the population variance  2 is unknown and it needs to be estimated from the prior
knowledge.
Example 2: Suppose the average salary  of a local newspaper subscriber is to be
estimated. There are 25,000 subscribers and the typical salaries range from $30,000 to
$70,000, and the error of estimation is $500.The range is frequently estimated as four
standard deviations, 4 , giving

range $70, 000  $30, 000

 $10, 000 .
4
4
Therefore, the required sample size is
4  30,000  (10,000)2
 1519 .
(30,000  1)  (500)2  4  (10,000)2
Example 3: Suppose the total salary of 25,000 local newspaper subscribers is to be
estimated with the error of estimation $10,000,000. As in the previous example population
variance  , is estimated to $10,000. Then, the required sample size is
n
4  (25,000)3  (10,000) 2
 610 .
(25,000  1)  (20,000,000) 2  4  (25,000) 2  (10,000) 2
13
Sample Size for Estimating Population Proportion
Estimation of population proportion reflects the proportion of targeted population that
possess some specified characteristic. Because each population unit can either possess or not
possess particular attribute a, this reveals characteristics of binomial experiment where a=1 or a=0
correspond to the presence or not presence of attribute a respectively. Therefore, population
proportion p can be observed as the population mean  of 1’s and 0’s. The sample of size n that is
required to estimate population mean  with bound of error B and population variance  2 is given
by
4 N 2
.
n
( N  1) B 2  4 2
Substituting  2 for p  (1  p ) we get
4 Np(1  p)
.
( N  1) B 2  4 p(1  p)
n
Therefore, in order to obtain required sample size, p needs to be estimated (i.e., available from
surveys conducted in the past). Value of p=0.5 can be used if no prior knowledge exists.
Example 4: Suppose the percentage of local newspaper subscribers with salary $60,000 + is
required. There are 25,000 subscribers, and the error of estimation is 0.05. Since no prior
knowledge exists, p is set to 0.5 and required sample size is
n
4  25, 000  0.5(1  0.5)
 394
(25, 000  1)  (0.05) 2  4  0.5  (1  0.5)
Estimating Population Mean, Population Total, and Bound on Error of
Estimation
n

Estimating population mean ˆ  x 
n
2 s ( N  n ) , where s 2 
n
n
2
x
 (x  x )
i 1
i
i 1
n
, with bound on error of estimation
2
i
n 1
;
14
n

Estimating population total ˆ  Nx 
N  xi
i 1
n
2 N 2 s ( N  n ) , where s 2 
n
n
2
 (x  x )
i 1
, with bound on error of estimation
n
2
i
;
n 1
n

Estimating population proportion pˆ  x 
2
x
i 1
n
i
, with bound on error of estimation
ˆ ˆ N n
pq
(
).
n 1 n
Stratified Sampling
Stratified sampling model is suitable choice when population of interest is large and when
the members of the sample frame can be subdivided into heterogeneous segments (especially if
incentives differ among segments). By forming representative groups that parallels the entire
population in some key characteristics and adding partial sums rather than individually sampled
points, estimates of higher precision are obtained.
Sample Size for Estimating Population Mean and Population Total
The sample of size n that is required to estimate population mean  with bound of error B
can be found by setting two standard deviations of estimator equal to B: 2 V ( x )  B . The variance
of the estimator, x , for large N can be approximated by
1
V (x )  2
N
Ni  ni  i2
N (
)( ) ,

Ni
ni
i 1
L
2
i
where L is the number of strata, N i is the size of the population in stratum i, and
N  N1  N 2    N L is the population size. Similarly, the sample size n required to estimate
population total is calculated by setting 2 V ( Nx )  2 N V ( x )  B . Let ni  n  wi , where wi is
fraction of sample n in stratum I, then solving for n, the following formula is obtained:
15
L
n
N 
i 1
2
i
2
i
/ wi
, where
L
N D   N i
2
i 1
2
i
2
D  B for estimating population mean, and
4
2
D  B 2 for estimating population total.
4N
It is often the case that often there is a different cost ci of observation associated with each stratum
L
i. In order to minimize cost let wi 
Ni i / ci
L
N 
k 1
k
k
, which gives n 
L
( N k k / ck )( Ni i ci )
k 1
i 1
.
L
N D   N i
2
/ ck
i 1
If the costs are unknown or equal, c1  c2    cL then w1  w2    wL 
Ni i
and
L
N 
k 1
k
2
i
k
L
n
( N k k ) 2
k 1
(This method is known as Neyman allocation.).
L
N D   N i
2
i 1
2
i
Example 5: Suppose sample across 5 strata is to be chosen. Given that the budget for
survey is $2,000, chose the sample size and allocation that minimize V ( x ) . N=2306.
i
Stratum
Ni
ci
A
220
5.26
$20
B
412
4.72
$5
C
375
6.29
$11
D
778
3.29
$11
E
521
7.35
$5
Table 2. Stratum definition
16
First calculate wi, i=1,2,3,4 and 5 using wi 
Ni i / ci
5
N 
k 1
5
N 
k 1
k
k
k
k
.
/ ck
/ ck  220  5.26 / 20  412  4.72 / 5  375  6.29 / 11  778  3.29 / 11  521 7.35 / 5
= 4323.91
w1 
220  5.26 / 20
412  4.72 / 5
375  6.29 / 11
 .06 , w2 
 .20 , w3 
 .16 ,
4323.91
4323.91
4323.91
w4 
778  3.29 / 11
521 7.35 / 5
 .18 , w5 
 .40
4323.91
4323.91
Because the total cost is $2000, it must be c1n1  c2 n2  c3n3  c4 n4  c5n5  $2,000 .
Substituting, ni  nwi ,
20  n  (.06)  5  n  (.20)  11 n  (.16)  11 n  (.18)  5  n  (.40)  2,000  n  2000  251.8
7.94
In order to keep the cost below $2,000 n  251 is chosen. The allocation per strata
is n1  15, n2  50, n3  40, n4  45, n5  101.
Sample Size for Estimating Population Proportion
Similar to discussion given for Simple Random Sampling estimation of population
proportion p can be observed as the population mean  of
1’s and 0’s (where a=1/a=0
corresponds to presence/not-presence of attribute a in population unit). The sample size, n, that is
required to estimate population mean  with bound of error B and population variance  2 is given
by
L
n
4 N i2 i2 / wi
i 1
.
L
N B  4 N i
2
2
i 1
2
i
L
Substituting  2 for p  (1  p ) we get n 
4 N i2 pi (1  pi ) / wi
i 1
L
N B 2  4 N i pi (1  pi )
2
i 1
17
Accordingly, the fraction sample allocated to stratum i is wi 
Ni pi (1  pi ) / ci
L
N
k 1
k
.
pk (1  pk ) / ck
Estimating Population Mean, Population Total, and Bound on Error of
Estimation

Estimating population mean x  1
N
L
N x
i 1
i i
, with bound on error of estimation
ni
N n s
1
N 2 ( i i )( ) , where si2 
2  i
Ni
ni
N i 1
2
i
L
2

 (x
ij
j 1
 xi ) 2
;
ni  1
L
Estimating population total ˆ   Ni xi , with bound on error of estimation
i 1
ni
L
2
N
i 1

2
i
(
Ni  ni s
)( ) , where si2 
Ni
ni
2
i
 (x
j 1
Estimating population proportion pˆ  1
N
estimation 2
1
N2
L
N
i 1
2
i
(
ij
 xi ) 2
;
ni  1
L
 N pˆ , with bound on error of
i 1
i
i
Ni  ni pˆ i qˆi
)(
).
Ni
ni  1
Stratified Quota Sampling
Stratified quota sampling model is suitable choice when sampling frame is skewed with
respect to targeted population.
Sample Size for Estimating Population Mean, Population Total and
Population Proportion
In Stratified Quotas Sampling the number of different stratums (cross-tabulation cells) more
often than not dramatically outgrows the cardinality of sample frame. To see this consider say, 7
variables each defined by 5 categories. Effectively sample frame units are dived into 57 = 78125
18
sub-groups, where sub-group is defined by population units of equal inclusion probability.
Acquiring a sample of size, say 2000 produces at least 78125 - 2000 = 76125 (or 97.5 %) empty
sub-groups. Also, knowing or estimating population variance for each sub-group is unfeasible in an
online environment. Instead, it is reasonable to assume that  ’s and costs are alike across subgroups. That being the case Neyman allocation formula reduces to allocation formula for Simple
Random Sampling when estimating sample size for population mean and population total.
Similarly, when calculating sample size for population proportion, it is reasonable to assume that
p’s are alike, in which case formula for calculating sample size for estimating population
proportion simplifies to the one given for estimating population proportion in Simple Random
Sampling.
Estimating Population Mean, Population Total, and Bound on Error of
Estimation
Similar to the argument given in the previous section, estimations are approximated by
corresponding formulas in the Simple Random Sampling Section.
Final Note: If the calculated sample sizes for the variables of interest are relatively close, the
researcher should use the largest calculated value as the sample size (and then perform adjustment
in respect to anticipated return rate, see the following section). However if there is a sufficient
variation among the calculated values (and research is conducted on limited budget) researchers
should relax the desired standard of precision in order to allow the use of a smaller sample size.
Adjustment Factors
Due to the voluntary nature of the collection methods in social science, response rates are
below 100% (usually ranging fro 20% to 80%). Hence, it is a common practice to utilize
oversampling in order to obtain a required sample size (not the minimum calculated) and/or to
adjust Stratified Quotas Sampling totals so that projected response rate is taken into account.
Researchers have utilized different techniques in order to estimate response rate (for example
19
conduct a pilot or a two-step7 study, review the literature for similar population, etc.) but the most
common approach is to use response rates from previous studies of the same (and/or similar)
population, field window8 and key outcome variables, if such are available. In an online panel
environment, it is a job of an automated sampling tool to collect such data and to incorporate them
in calculation of the adjusted sample size. In fact once projected response rate are properly
estimated, Bayesian theorem simplifies to a single division as shown in the following example:
Let required minimum sample size be 1000 and anticipated return rate be 70%. Then sample
size, N, adjusted for return rate is
N
1000
 1430 .
.70
Weighting
Weighting process is aimed to improve the relation between the sample and the population by
adjusting the sampling weights of the population units in the sample so that the marginal totals of the
adjusted weights on specified variables of interest agree with the corresponding population totals [1].
The most common applications of the Weighting process are: Reduction of the Sampling
Inconsistency (sample frame vs. population) and Non-response and Non-coverage Biases
Adjustment. Through out the literature Weighting process is also referred to as Raking or SampleBalancing [1, 3]. Adjustment itself is commonly achieved through various iterative methods, such
as Iterative Proportional Fitting (IPF) [2]. The easiest way to explain IPF is by example.
Example 6: Consider a sample of cardinality 20 and study return rate of 80%. Goal is to adjust
sampling weights (originally set to one for each population unit) to compensate for non-response
bias. Suppose that variables of interest Gender and Province are each defined by two categories
(Male -- 60%, Female -- 40%) and (BC -- 30%, AL -- 70%) respectively. Consider the cell counts
given in the two dimensional cross-tabulation shown in Figure 4.
With respect to category (Male) of the variable Gender number of respondents is 10 (refer
to blue leftmost digits in “3-digit” cells), while the required total is 20*0.6=12, similarly number of
respondents for the category (Female) is 6 while the required total is 20*0.4=8. With respect to
7
Conduct a first step (contact just a small portion of the sample) and use the resulting return rate to estimate the
number of responses that is to be expected from the second step.
8
The length of the study.
20
variable Province required category totals are 6 and 14 while the numbers of respondents are 6 and
10 respectively.
Formal Definition
cwij(0)  wij
cwij(1)  cwij(0)  (ti  / cwi(0)
 )
cwij(2)  cwij(1)  (t j / cw(0)j )
4.8 * 0.8 = 3.86
6/(7.46)=0.8
where i ={1, 2, …, Total Number Of Columns}
and j ={1, 2, …, Total Number Of Rows}
2.66 * 0.8 = 2.14
7.2* 1.12 = 8.14
14/(12.53)=1.12
5.33* 1.12 = 5.86
Example IPF
Gender
Province
(Male)
Required
Total
(Female)
(AL)
4
4.8
3.86
2
2.66
2.14
6
(BC)
6
7.2
8.14
4
5.33
5.86
14
12
Required Total
12/10=1.2
4 * 1.2 = 4.8
6 * 1.2 = 7.2
20
8
8/6=1.33
2* 1.33 = 2.66
4* 1.33 = 5.33
Figure 4. Weighting process example
IPF calculation progresses one variable at a time. Calculations for each category of the same
variable are carried independently. With respect to Example 6, the first step of the procedure
proportionally adjusts cell counts of the (Male) category (i.e., column) according to the formulas
given in the Formal Definition block. (For details please refer to computation shown underneath
the (Male) column). Obtained totals are 4.8 and 7.2, two middle bold digits in “3-digit” cells of the
(Male) column. Not that 4.8 + 7.2 = 12, precisely the required column total. Next iteration adjusts
the individual cells of the (Female) column to the required total of 8. After the first two iterations,
each marginal of the variable Gender perfectly matches required totals, but the rows marginal
(variable Province) are still apart from their corresponding totals. Next two iterations (see the
computation given above the cross-tabulation) properly adjust rows marginal. Obtained digits are
21
the rightmost red digits inside “3-digit” cells. At this point both rows and columns marginal match
corresponding required totals. Final weights are as follows:
(Male)(AL)  3.86/4=0.965
(Male)(BC)  8.14/6=1.357
(Female)(AL)  2.14/2=1.07
(Female)(BC)  5.86/4=1.465
In general Weighting process proceeds until “convergence” is achieved, that is until the
difference between the required totals and the obtained marginal is within predefined difference 
(i.e., variable tolerance level), where  is usually set at 5%. It is not always the case that
convergence is attainable. To see this fact it is helpful to view IPF as method to determine solutions
to the system of linear equations.
4·X1 + 6·Y1 = 12
2·X2 + 4·Y2 = 8
4·X1 + 2·X2 = 6
6·Y1 + 4·Y2 = 14
Example 6 yields completely determined system and therefore a unique solution set. However if a
researcher infers that 3 rather than 2 dichotomous variables are closely related to key outcome
variables, corresponding system of linear equations becomes undetermined.
Gender
(MALE), (FEMALE)
Province
(BC), (AL)
Employed
(YES), (NO)
Figure 5. Weighting process with 3 variables
22
Total number of cells in the cube shown in Figure 5 (i.e., the number of unknowns) is equal 23=8
while the number of the required totals is 2·3=6. (Same could be achieved in the 2-variable
example by increasing the number of categories in the Province variable from 2 to 4). In other
words Weighting can be viewed as an instance of the well known problem of finding a maximum
set of nonnegative solution, (i.e., the nonnegative x with most non-zeros satisfying y = Ax.) where
y  Rd, x  Rn, A is a sparse d×n matrix, d<n and y is considered known but x is unknown. Due to
the fact that number of unknowns grows exponentially when the number of variables (and/or
categories per variable) increases, it is to be expected that iterative methods on large number of
variables may fail to converge. It is left to researchers to formulate proper granularity for a
particular study (for further guidance please refer to the subsequent section).
The importance of the Weighting process can be seen through the following example:
Example 7: Suppose that research is conducted in order to conclude ratio of people that are willing
to purchase new product A over well established product B. Focusing only on female customers
from Example 6 suppose that a single population unit (Female)(AL) answered positively while the
rest gave negative response. Without performing the Weighting process one might conclude that 1/6
 17% women are willing to purchase “new” product A. However after the non-response
adjustment is carried out percentage drops to 1.07/8  13%. If the threshold for the successful
launch of product A was set to 15% it is easy to see how non-response bias could steer researcher
in the wrong direction.
In practice IPF provides a solid building block for adjusting cells values to required totals.
Combined with proper heuristics and the computational power of today’s computers, even in an
online environment Weighting process can be completed in matter of seconds on ten (and even
more) variables.
Convergence
As noted earlier when research requires specification of ten or more variables of interest,
consisting of several categories per variable (for example Province), Weighting process may fail to
converge. In such cases it is recommended to perform collapsing of “slow converging” categories of
less important variables, change variables tolerance level, or in an extreme case completely remove
some of the variables. However deciding which variables are causing non-convergence is not an easy
task. Battaglia et al have suggested in [1] that once “non-convergence” is detected it is helpful to view
23
the plot showing logarithm of the absolute value of the difference between the adjusted cells values
(categories marginal totals) and the required totals.
25
20
Log10 difference
15
10
5
0
1
5
9
13
17 21
25 29 33
37 41
45 49
53 57 61
65 69
73 77
81 85 89
93 97 101 105 109 113 117 121 125 129 133 137 141
Number of Cycles
-5
Figure 7. Converging process
Figure 7 portrays a converging process involving 8 variables having on average 6 categories (where
each category is represented by different color). X-axis give a cycle number (new cycle commence
when adjustment is performed for each category of each variable), and the y-axis is log of the absolute
10
value of the difference between the adjusted marginal totals and the required total according to
predefined tolerance level (see Multidimensionality and Tolerance Section for details on tolerance
levels).
5
4
Log10 difference
3
2
1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Number of Cycles
-1
Figure 8. Non-converging process
24
On the other hand Figure 8 shows highly skewed non-converging process involving only 4
variables having on average 4 categories. By inspection non-converging categories are easily
singled out (see lines that do not cross x-axis on Figure 8). This information along with variable
type (nominal, ordinal, etc) allow researcher to make a decision which categories odd to be
collapsed to improve probability for a successful completion once Weighting process is repeated. In
practice non-convergence detection algorithm should take not more than few seconds and plot data
should be available in a file format supported by readily available statistical packages (for example
csv).
References
[1] Michael P. Battaglia, David Izrael, David C. Hoaglin, and Martin R. Frankel, “Tips and Tricks
for Raking Survey Data (a.k.a. Sample Balancing)”, Fifty-Ninth Annual AAPOR Conference
Program Public Opin Q.2004; 68: 451-480.
[2] Deming, WE and Stephan, FF (1940), “On a Least Squares Adjustment of a Sampled Frequency
Table When the Expected Marginal Totals are Known.”, Annals of Mathematical Statistics, 11,
427-444.
[3] Izrael, D, Hoaglin, DC, and Battaglia, MP (2000), “A SAS Macro for Balancing a Weighted
Sample.”, Paper 258 SUGI (SAS Users Group International) 25.
[4] Richard L. Scheaffer, William Mendenhall, Lyman Ott.,” Elementary survey sampling”
Boston : Duxbury Press, 1986.3rd ed.
25