Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TECHNICAL PAPER Sampling and Weighting ©Copyright 2000-2005 Vision Critical Communications Inc. 1750 - 1111 West Georgia Street Vancouver, BC, V6E 4N5 http://www.visioncritical.com Table of Contents Introduction ......................................................................................................................................... 3 Sampling Strategies ............................................................................................................................ 4 Probability Sampling ...................................................................................................................... 5 Non-probability sampling ............................................................................................................... 7 Stratified Quotas Sampling ................................................................................................................. 8 Multidimensionality and Tolerance .............................................................................................. 10 Mutually Excusive Samples .............................................................................................................. 11 Frequency of Inclusion ..................................................................................................................... 11 Sample Size and Estimates ............................................................................................................... 12 Simple Random Sampling ............................................................................................................ 12 Sample Size for Estimating Population Mean and Population Total ....................................... 12 Sample Size for Estimating Population Proportion .................................................................. 14 Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 14 Stratified Sampling ....................................................................................................................... 15 Sample Size for Estimating Population Mean and Population Total ....................................... 15 Sample Size for Estimating Population Proportion .................................................................. 17 Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 18 Stratified Quota Sampling............................................................................................................. 18 Sample Size for Estimating Population Mean, Population Total and Population Proportion . 18 Estimating Population Mean, Population Total, and Bound on Error of Estimation .............. 19 Adjustment Factors ........................................................................................................................... 19 Weighting .......................................................................................................................................... 20 Convergence ................................................................................................................................. 23 References ......................................................................................................................................... 25 2 Introduction Information form surveys1 have a large affects on every facet of our every day lives. Recorded measurements dictate the whole range of policies, such as government, economy and social programs. Businesses conduct surveys for their internal operations and more importantly to formulate crucial management decisions. One particular area of business activity that relays heavily on surveying techniques is marketing. Decisions such as which products should be marketed, in which area, and most importantly at what price are regularly made on the basis of survey data. An ideal opinion poll would gather information from all members of the population of interest. However in most cases the population size, hence the cost of conducting such poll is too large for the researcher to attempt to examine all of its units. In fact the very first recorded opinion poll conducted by The Harrisburg Pennsylvanian in 1824 gathered information only from a portion of a population of interest. It showed Andrew Jackson leading John Quincy Adams by 335 votes to 169 in the United States presidency race. Since then unscientific surveys grew in popularity but mostly remained local till 1916, when the Literary Digest embarked on a national survey and correctly predicted Woodrow Wilson's to be the next president of the United States. However in 1936 sampling bias caught up with Literary Digest surveying practice. Esteemed journal falsely reported Alf Landon to be the likely new president over Franklin D. Roosevelt based on information collected from their readership (at the time circulation was estimated at 2.5 million). Simultaneously, George Gallup conducted a far smaller, but more scientifically-based survey, in which he polled a demographically representative sample and correctly predicted Roosevelt's victory. Needles to say shortly after Literary Digest went out of business and the era of “scientific surveys”2 begin. Since in most cases the objective of a modern surveying is inference, it is of outmost importance that the medium of inference, sample, is chosen carefully so that it can be used to represent the population. In fact sampling is a major operational step for anyone creating a statistically valid survey, quality control study, accuracy of records measurements, or any other situation in which conclusion is drawn based on an inspection of a fragment of a population. 1 The word "survey" is used most often to describe a method of gathering information from a sample of population units. 2 The term “scientific survey” is restricted to those studies that produce analytical information about society for the needs of social or economic decision-making, scientific research or international comparisons. 3 However, a potential obstacle to inference, even when the sampling step is completed correctly, is unit non-response (instance in which characteristics of the population units that have responded to a particular survey differ from those present in targeted population). The most efficient way to minimize non-response bias is to perform Weighting (i.e., post-stratification). Weighting is a process of assigning weights to respondents so that marginal totals of the weights on specified characteristics agree with the corresponding totals for the population [1]. Once the weights are applied, the collected data match the overall characteristics of the population. Most of the researchers today routinely apply weights to survey respondents when under-response is not substantial. This is due to the fact that weight adjustments when used judicially bring the overall proportions of respondents in line with the targeted population thereby reducing effect of nonresponse bias. Sampling Strategies Distinction between representative sample and any given subset of population units is best described by defining two major data employment models: descriptive and inferential statistics. Descriptive statistics summarizes a collection of data in a clear and understandable way but it does not attempt to go beyond data-set (sample response) in order to make inferences. On the other hand inferential statistics provides conclusions that extend beyond the data. That is inferential statistics make inferences from the sample about the population from which it was drawn. In order to accomplish this task, members of the sample must accurately reflect the characteristics of the population they represent. Hence, for the purpose of inferential statistics sample can not be just any given subset of population units, it must be representative of the population. Sampling techniques in social science can be divided into two major categories: Probability Sampling - Sampling models that utilize some form of random selection. Non-Probability Sampling - Sampling models in which selection of population units is arbitrary or subjective. Non-probability samples cannot depend upon the rationale of probability theory. 4 Probability Sampling For sampling models to be considered probability sampling models each population unit has to have known probability of being included in a sample. This allows for the statistical projection of characteristics based on the sample to the population of interest. The most common probability sampling models used in an online panel environment are as follows: Figure 1. Simple Random Sample Random Sampling - Any sort of sampling where, in advance of the selection of the sample, each member of the population has a calculable and non-zero chance of selection. Simple Random Sampling - Sampling model in which each member of the population has the same probability of being chosen. Moreover, sample is drawn is such a way that every possible sample of the same size has the same chance of being selected. Stratified Random Sampling - Sampling model in which samples are obtained by grouping the members of the population into non-overleaping sub-groups (i.e., stratums) and than selecting a simple random sample from each sub-group. This method often improves the representatives of the sample by reducing sampling error. 5 Figure 2. Stratified Random Sample Stratified Quotas Sampling – Sampling model in which the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. However, each sub-group is defined by setting quotas on the categorical variables of interest (for example gender, age, employment, etc) to ensure a proper mix of different social groups. Final sample is constituted of simple random samples drawn from each sub-group. Other models used in social science are: Systematic, Clustered and Multistage Sampling. Figures 1, 2 and 3 depict differences among probability sampling techniques. Members of the sample are red colored units. Categorical variable of interest “Province” has 13 categories. (See the top right corner of Figure 2.) Population of interest for the sampling models depicted in Figures 1 and 2 is consisted of Canadian residents, while population of interest for the sampling model depicted in Figure 3 is consisted of potential customers of an imaginary franchise, say “Samples R us”, which has 30% of its locations across BC, 30% of its locations across AL, 15% of its locations across SK and finally 25% of its locations across QC. Sample Frame3 is consisted of Canadian residents. 3 List of members of the population from which one selects the actual sample for the survey. Ideally, the sample frame contains every member from the target population. However more often than not there are substantial differences. 6 Figure 3. Stratified Quotas Sample Note that sample in Figure 1 is not a representative sample. This is because Simple Random Sampling does not provide guarantees of proportionality for finite a population (and sample) size. In fact this model is not recommended if a variable of interest is highly segmented and a difference between sample and population size is large. Comparing Figure 1 and 2 one can see that in this particular example Stratified Sampling has reduced sampling error thereby improving representatives of the sample. Finally, Figure 3 depicts the scenario in which neither of the two traditional models would produce representative sample due to the inherent bias in the sample frame. In this case Stratified Quota Sampling is an appropriate model since it eliminates existing bias. (For further details please refer to the Stratified Quota Sampling section.) Non-probability sampling Non-probability sampling strategies in general cannot be used to infer from the sample to the population. Non-probability sampling models are considerably less expensive but the results obtained from a non-probability study are of limited value and must be taken with caution when utilized for anything but descriptive statistics. The most common non-probability sampling methods used in an online panel environment are as follows: 7 Convenience Sampling - Sampling model in which members of the population are chosen based on their relative ease of access. Snowball Sampling - Sampling model in which the first respondent refers a friend which than refers a friend, and so on. Purposive (Panel Filter) Sampling - Sampling model in which the researcher chooses sample based on who they think would be appropriate for the study. Ad Hoc Quotas Sampling - A quota is established (say 70% men and 30% women) and researchers are free to choose any respondent they wish as long as the quota is met. There are numerous other non-probability sampling models used in social science (for example Modal or Expert sampling). Stratified Quotas Sampling Origins of Stratified Quota Sampling can be traced to the sample-balancing problem defined by Deming back in 1940. Census bureau of the US was required to derive cross-tabulation for the joint distribution of two (or more) variables. However, joint distribution was available only from sample data, while distribution for each variable was available across population. Sample-balancing is a process which adjusts individual cell counts (sample data) to marginal totals (census data). When presented with a form of quota sampling most of the traditional researchers restrict data-analysis to descriptive statistics. This is because in its truest form quota sampling does not require any randomness (see definition for Ad Hoc Quotas Sampling). However, when the sample frame is highly skewed with respect to targeted population (as it is often the case for web based panels), if used sensibly Stratified Quotas actually reduce sampling bias, therefore yielding much more representative sample. To further clarify this point consider the following example: Example 1: Suppose that three categorical variables of interest for our study are “Gender”, “Age” and “Education”. “Gender” has two categories: (Male) and (Female). Let “Age” have three categories: (1-19), (20-64), (65+), and finally let “Education” have five categories: (No High School), (High School), (Trade or Diploma), (Non Degree College) and (Some University); 8 Suppose that distribution among panel members (bold digits) with respect to variables of interest is as follows: Gender: (Male -- 38%, 48%4), (Female -- 62%, 52%); Age: ((1-19) -- 12%, 22%), ((20-64) -- 78%, 46%), ((65+) -- 10%, 32%); Education: (No High School -- 13%, 22%), (High School -- 12%, 24%), (Trade or Diploma -- 18 %, 13 %), (Non Degree College -- 25%, 18%), (Some University -- 32%, 23%); Moreover, suppose that our research requires us to conduct scientifically valid survey where the targeted population is consisted of Canadian residents. If we were to use any standard probability model, it is likely that the resulting sample would be significantly skewed with respect to census data. Since the distribution of any representative sample need to be in accordance with census data not the sample frame, we have to set appropriate quotas on variables of interest (in fact matching census data), thereby successfully eliminating (or at least reducing, for example see Table 1) sampling bias. Moreover if we define outlier as an observation that lies an abnormal distance from other values in a representative sample from a population, properly implemented stratified quota sampling modules automatically isolates and excludes outliers from the sample. Variable Gender Age Education Categories Male Female 1-19 20-64 65+ No High School High School Trade or Diploma Non Degree College Some University Sample Frame Distribution 38% 62% 12% 78% 10% 13% 12% 18% 25% 23% (Quotas) Population Distribution 48% 52% 22% 46% 32% 22% 24% 13% 18% 23% Sample Distribution 50% 50% 22% 45% 33% 20% 26% 12% 20% 22% Table 1: Output from a Stratified Quotas Sampling module. 4 Red digits are census Canada data. 9 It is important to note that in practice unlike Simple Random Sampling and traditional Stratified Sampling, Stratified Quotas Sampling module does not necessarily produce sample of required size and/or requested distribution. It is not hard to imagine sample frame (list of panelists) so dissimilar to targeted population that is highly unlikely representative sample can be devised. Multidimensionality and Tolerance Stratified Quota Sampling produces sample by substituting equal inclusion probabilities of units in the sample frame with unequal inclusion probabilities calculated in respect to specified quotas. In other words Simple Random Sampling model within a stratum is replaced with Random Sapling model. In complex studies number of quotas, (i.e., variables of interests and/or number of categories per variable) tends to increase the complexity of producing a representative sample. Problem of multidimensionality (often referred to as “Curse of Multidimensionality”) is present in many areas of data analyses, such as Segmentation, Conjoint Analyses and Cross-Tabulation. As the number of quotas increases task of calculating proper inclusion probabilities becomes more demanding. Hence, it is important to choose variables of interest wisely. Ideal candidates are those variables which are closely related to key survey outcome variables. (For technical details on multidimensionality issues please refer to Weighting Section, as panelist inclusion probability is closely related to its weight). In practice not all variables influence key survey outcome variables equally. For example variables “Salary”, “Gender”, “Age” and “Education“ are all related to variable, say “Computer Literacy”, but it might be reasonable to assume that “Education” and “Age” are somewhat more related than “Gender” and “Salary”. To ease the burden of multidimensionality Stratified Quota Sampling module allows different tolerance levels to be specified for variables, where tolerance refers to an acceptable difference between sample and a population totals. For example if quotas for categories (Male) and (Female) of variable “Age“ are set at 48% and 52% respectively, and the tolerance level is 5%, distribution in the representative sample must agree with specified quotas within 5%. Consequently, percentage of male population in a representative sample (call it Pm) must be in the range of 53 to 43 while percentage of female population (call it Pf) must be in the range of 57 to 48, where Pm + Pf = 100 (see the rightmost two columns of Table 1). Combining tolerance levels with quotas (especially in the situation where sample frame is significantly skewed 10 with respect to targeted population) allows more flexibility, thereby increasing likelihood of obtaining a representative sample. Mutually Excusive Samples As the Cross-Sectional5-like designs grew in popularity through polling community so did the necessity to develop an automated tool capable of devising multiple mutually exclusive6 representative samples. When such samples are produced sequentially and the sample frame contains large number of population units task is computationally manageable. However when sample frame is limited and representative samples need to be produced simultaneously (sometimes utilizing different sampling models) complexity of the problem soon becomes substantial. In fact producing a single optimal representative sample utilizing Stratified Quota Sampling model is a hard problem (see Weighting Section for details), therefore producing simultaneously multiple mutually exclusive representative samples utilizing Stratified Quota Sampling only adds to the complexity. Since in most cases in an online panel environment sample frame is highly restrictive, if presented with choice, researchers are advised to draw mutually exclusive samples sequentially. (Especially if one of the samples requires Stratified Quota Sampling module.) To see this consider the scenario in which sample frame has cardinality of 10,000 and two mutually exclusive yet similarly defined “Stratified Quota Samples” each of cardinality 1000 are needed. When requested to be drawn in parallel due to the restrictions of the sample frame, obtainable representative samples would most likely have a size less than required (making both samples inadequate for research purposes). However if drawn sequentially there is a much greater chance that at least one of the samples would have required number of units. Frequency of Inclusion Key to maintaining a high response rate in an online panel environment is developing an automated tool which helps researchers maximize the likelihood of respondents accepting an 5 Study that collect measurements on a population over time by repeating the same survey on two or more occasions. During each time period, a separate but comparable representative sample of population units is drawn from the population. 6 Mutually exclusive samples have no population units in common. 11 invitation to participate while minimizing respondent burden in going through study. To increase likelihood of acceptance goal is not too overburdened population units with research requests but at the same time to maintain high coverage of the sample frame. An automated tool needs to continuously monitor panelists activity in order to adjust their inclusion probability accordingly (for example more recent behavioral pattern odd to be weighted more heavily than the past ones due to evolving and dynamic nature of a web based panels). This is an essential feature since it relieves researches of responsibility to filter out population units based on their inclusion frequency (and/or response rate) prior to sampling, allowing them to shift their focus toward question development and project management. Sample Size and Estimates Sample size is the number of observations included in the sample in order to make inference about targeted population with required precision. If the sample is too large it will certainly carry greater precision, but it will also waste resources. Conversely, small samples are less costly but they may produce erroneous results. Therefore it is important to determine the proper size of the sample. In general key survey outcome variables either estimate population total, mean or percentage. Following section describe methods for determining proper sample size with respect common probability sampling models. Simple Random Sampling Simple random sampling model is suitable choice when population of interest is relatively small and where sampling frame is complete and up-to-date. Sample Size for Estimating Population Mean and Population Total The sample of size n that is required to estimate population mean with bound of error B can be found by setting two standard deviations of estimator equal to B: 2 V ( x ) B , where the variance of the estimator, x , is given by V ( x ) 2 N n n ( N 1 ) and N is the size of targeted population. Similarly, the sample size n required to estimate population total is calculated by setting 12 2 V ( Nx ) 2 N V ( x ) B . Solving for n, following formula is obtained: n N 2 , where ( N 1) D 2 2 D B for estimating population mean 4 2 D B 2 for estimating population total. 4N However, the population variance 2 is unknown and it needs to be estimated from the prior knowledge. Example 2: Suppose the average salary of a local newspaper subscriber is to be estimated. There are 25,000 subscribers and the typical salaries range from $30,000 to $70,000, and the error of estimation is $500.The range is frequently estimated as four standard deviations, 4 , giving range $70, 000 $30, 000 $10, 000 . 4 4 Therefore, the required sample size is 4 30,000 (10,000)2 1519 . (30,000 1) (500)2 4 (10,000)2 Example 3: Suppose the total salary of 25,000 local newspaper subscribers is to be estimated with the error of estimation $10,000,000. As in the previous example population variance , is estimated to $10,000. Then, the required sample size is n 4 (25,000)3 (10,000) 2 610 . (25,000 1) (20,000,000) 2 4 (25,000) 2 (10,000) 2 13 Sample Size for Estimating Population Proportion Estimation of population proportion reflects the proportion of targeted population that possess some specified characteristic. Because each population unit can either possess or not possess particular attribute a, this reveals characteristics of binomial experiment where a=1 or a=0 correspond to the presence or not presence of attribute a respectively. Therefore, population proportion p can be observed as the population mean of 1’s and 0’s. The sample of size n that is required to estimate population mean with bound of error B and population variance 2 is given by 4 N 2 . n ( N 1) B 2 4 2 Substituting 2 for p (1 p ) we get 4 Np(1 p) . ( N 1) B 2 4 p(1 p) n Therefore, in order to obtain required sample size, p needs to be estimated (i.e., available from surveys conducted in the past). Value of p=0.5 can be used if no prior knowledge exists. Example 4: Suppose the percentage of local newspaper subscribers with salary $60,000 + is required. There are 25,000 subscribers, and the error of estimation is 0.05. Since no prior knowledge exists, p is set to 0.5 and required sample size is n 4 25, 000 0.5(1 0.5) 394 (25, 000 1) (0.05) 2 4 0.5 (1 0.5) Estimating Population Mean, Population Total, and Bound on Error of Estimation n Estimating population mean ˆ x n 2 s ( N n ) , where s 2 n n 2 x (x x ) i 1 i i 1 n , with bound on error of estimation 2 i n 1 ; 14 n Estimating population total ˆ Nx N xi i 1 n 2 N 2 s ( N n ) , where s 2 n n 2 (x x ) i 1 , with bound on error of estimation n 2 i ; n 1 n Estimating population proportion pˆ x 2 x i 1 n i , with bound on error of estimation ˆ ˆ N n pq ( ). n 1 n Stratified Sampling Stratified sampling model is suitable choice when population of interest is large and when the members of the sample frame can be subdivided into heterogeneous segments (especially if incentives differ among segments). By forming representative groups that parallels the entire population in some key characteristics and adding partial sums rather than individually sampled points, estimates of higher precision are obtained. Sample Size for Estimating Population Mean and Population Total The sample of size n that is required to estimate population mean with bound of error B can be found by setting two standard deviations of estimator equal to B: 2 V ( x ) B . The variance of the estimator, x , for large N can be approximated by 1 V (x ) 2 N Ni ni i2 N ( )( ) , Ni ni i 1 L 2 i where L is the number of strata, N i is the size of the population in stratum i, and N N1 N 2 N L is the population size. Similarly, the sample size n required to estimate population total is calculated by setting 2 V ( Nx ) 2 N V ( x ) B . Let ni n wi , where wi is fraction of sample n in stratum I, then solving for n, the following formula is obtained: 15 L n N i 1 2 i 2 i / wi , where L N D N i 2 i 1 2 i 2 D B for estimating population mean, and 4 2 D B 2 for estimating population total. 4N It is often the case that often there is a different cost ci of observation associated with each stratum L i. In order to minimize cost let wi Ni i / ci L N k 1 k k , which gives n L ( N k k / ck )( Ni i ci ) k 1 i 1 . L N D N i 2 / ck i 1 If the costs are unknown or equal, c1 c2 cL then w1 w2 wL Ni i and L N k 1 k 2 i k L n ( N k k ) 2 k 1 (This method is known as Neyman allocation.). L N D N i 2 i 1 2 i Example 5: Suppose sample across 5 strata is to be chosen. Given that the budget for survey is $2,000, chose the sample size and allocation that minimize V ( x ) . N=2306. i Stratum Ni ci A 220 5.26 $20 B 412 4.72 $5 C 375 6.29 $11 D 778 3.29 $11 E 521 7.35 $5 Table 2. Stratum definition 16 First calculate wi, i=1,2,3,4 and 5 using wi Ni i / ci 5 N k 1 5 N k 1 k k k k . / ck / ck 220 5.26 / 20 412 4.72 / 5 375 6.29 / 11 778 3.29 / 11 521 7.35 / 5 = 4323.91 w1 220 5.26 / 20 412 4.72 / 5 375 6.29 / 11 .06 , w2 .20 , w3 .16 , 4323.91 4323.91 4323.91 w4 778 3.29 / 11 521 7.35 / 5 .18 , w5 .40 4323.91 4323.91 Because the total cost is $2000, it must be c1n1 c2 n2 c3n3 c4 n4 c5n5 $2,000 . Substituting, ni nwi , 20 n (.06) 5 n (.20) 11 n (.16) 11 n (.18) 5 n (.40) 2,000 n 2000 251.8 7.94 In order to keep the cost below $2,000 n 251 is chosen. The allocation per strata is n1 15, n2 50, n3 40, n4 45, n5 101. Sample Size for Estimating Population Proportion Similar to discussion given for Simple Random Sampling estimation of population proportion p can be observed as the population mean of 1’s and 0’s (where a=1/a=0 corresponds to presence/not-presence of attribute a in population unit). The sample size, n, that is required to estimate population mean with bound of error B and population variance 2 is given by L n 4 N i2 i2 / wi i 1 . L N B 4 N i 2 2 i 1 2 i L Substituting 2 for p (1 p ) we get n 4 N i2 pi (1 pi ) / wi i 1 L N B 2 4 N i pi (1 pi ) 2 i 1 17 Accordingly, the fraction sample allocated to stratum i is wi Ni pi (1 pi ) / ci L N k 1 k . pk (1 pk ) / ck Estimating Population Mean, Population Total, and Bound on Error of Estimation Estimating population mean x 1 N L N x i 1 i i , with bound on error of estimation ni N n s 1 N 2 ( i i )( ) , where si2 2 i Ni ni N i 1 2 i L 2 (x ij j 1 xi ) 2 ; ni 1 L Estimating population total ˆ Ni xi , with bound on error of estimation i 1 ni L 2 N i 1 2 i ( Ni ni s )( ) , where si2 Ni ni 2 i (x j 1 Estimating population proportion pˆ 1 N estimation 2 1 N2 L N i 1 2 i ( ij xi ) 2 ; ni 1 L N pˆ , with bound on error of i 1 i i Ni ni pˆ i qˆi )( ). Ni ni 1 Stratified Quota Sampling Stratified quota sampling model is suitable choice when sampling frame is skewed with respect to targeted population. Sample Size for Estimating Population Mean, Population Total and Population Proportion In Stratified Quotas Sampling the number of different stratums (cross-tabulation cells) more often than not dramatically outgrows the cardinality of sample frame. To see this consider say, 7 variables each defined by 5 categories. Effectively sample frame units are dived into 57 = 78125 18 sub-groups, where sub-group is defined by population units of equal inclusion probability. Acquiring a sample of size, say 2000 produces at least 78125 - 2000 = 76125 (or 97.5 %) empty sub-groups. Also, knowing or estimating population variance for each sub-group is unfeasible in an online environment. Instead, it is reasonable to assume that ’s and costs are alike across subgroups. That being the case Neyman allocation formula reduces to allocation formula for Simple Random Sampling when estimating sample size for population mean and population total. Similarly, when calculating sample size for population proportion, it is reasonable to assume that p’s are alike, in which case formula for calculating sample size for estimating population proportion simplifies to the one given for estimating population proportion in Simple Random Sampling. Estimating Population Mean, Population Total, and Bound on Error of Estimation Similar to the argument given in the previous section, estimations are approximated by corresponding formulas in the Simple Random Sampling Section. Final Note: If the calculated sample sizes for the variables of interest are relatively close, the researcher should use the largest calculated value as the sample size (and then perform adjustment in respect to anticipated return rate, see the following section). However if there is a sufficient variation among the calculated values (and research is conducted on limited budget) researchers should relax the desired standard of precision in order to allow the use of a smaller sample size. Adjustment Factors Due to the voluntary nature of the collection methods in social science, response rates are below 100% (usually ranging fro 20% to 80%). Hence, it is a common practice to utilize oversampling in order to obtain a required sample size (not the minimum calculated) and/or to adjust Stratified Quotas Sampling totals so that projected response rate is taken into account. Researchers have utilized different techniques in order to estimate response rate (for example 19 conduct a pilot or a two-step7 study, review the literature for similar population, etc.) but the most common approach is to use response rates from previous studies of the same (and/or similar) population, field window8 and key outcome variables, if such are available. In an online panel environment, it is a job of an automated sampling tool to collect such data and to incorporate them in calculation of the adjusted sample size. In fact once projected response rate are properly estimated, Bayesian theorem simplifies to a single division as shown in the following example: Let required minimum sample size be 1000 and anticipated return rate be 70%. Then sample size, N, adjusted for return rate is N 1000 1430 . .70 Weighting Weighting process is aimed to improve the relation between the sample and the population by adjusting the sampling weights of the population units in the sample so that the marginal totals of the adjusted weights on specified variables of interest agree with the corresponding population totals [1]. The most common applications of the Weighting process are: Reduction of the Sampling Inconsistency (sample frame vs. population) and Non-response and Non-coverage Biases Adjustment. Through out the literature Weighting process is also referred to as Raking or SampleBalancing [1, 3]. Adjustment itself is commonly achieved through various iterative methods, such as Iterative Proportional Fitting (IPF) [2]. The easiest way to explain IPF is by example. Example 6: Consider a sample of cardinality 20 and study return rate of 80%. Goal is to adjust sampling weights (originally set to one for each population unit) to compensate for non-response bias. Suppose that variables of interest Gender and Province are each defined by two categories (Male -- 60%, Female -- 40%) and (BC -- 30%, AL -- 70%) respectively. Consider the cell counts given in the two dimensional cross-tabulation shown in Figure 4. With respect to category (Male) of the variable Gender number of respondents is 10 (refer to blue leftmost digits in “3-digit” cells), while the required total is 20*0.6=12, similarly number of respondents for the category (Female) is 6 while the required total is 20*0.4=8. With respect to 7 Conduct a first step (contact just a small portion of the sample) and use the resulting return rate to estimate the number of responses that is to be expected from the second step. 8 The length of the study. 20 variable Province required category totals are 6 and 14 while the numbers of respondents are 6 and 10 respectively. Formal Definition cwij(0) wij cwij(1) cwij(0) (ti / cwi(0) ) cwij(2) cwij(1) (t j / cw(0)j ) 4.8 * 0.8 = 3.86 6/(7.46)=0.8 where i ={1, 2, …, Total Number Of Columns} and j ={1, 2, …, Total Number Of Rows} 2.66 * 0.8 = 2.14 7.2* 1.12 = 8.14 14/(12.53)=1.12 5.33* 1.12 = 5.86 Example IPF Gender Province (Male) Required Total (Female) (AL) 4 4.8 3.86 2 2.66 2.14 6 (BC) 6 7.2 8.14 4 5.33 5.86 14 12 Required Total 12/10=1.2 4 * 1.2 = 4.8 6 * 1.2 = 7.2 20 8 8/6=1.33 2* 1.33 = 2.66 4* 1.33 = 5.33 Figure 4. Weighting process example IPF calculation progresses one variable at a time. Calculations for each category of the same variable are carried independently. With respect to Example 6, the first step of the procedure proportionally adjusts cell counts of the (Male) category (i.e., column) according to the formulas given in the Formal Definition block. (For details please refer to computation shown underneath the (Male) column). Obtained totals are 4.8 and 7.2, two middle bold digits in “3-digit” cells of the (Male) column. Not that 4.8 + 7.2 = 12, precisely the required column total. Next iteration adjusts the individual cells of the (Female) column to the required total of 8. After the first two iterations, each marginal of the variable Gender perfectly matches required totals, but the rows marginal (variable Province) are still apart from their corresponding totals. Next two iterations (see the computation given above the cross-tabulation) properly adjust rows marginal. Obtained digits are 21 the rightmost red digits inside “3-digit” cells. At this point both rows and columns marginal match corresponding required totals. Final weights are as follows: (Male)(AL) 3.86/4=0.965 (Male)(BC) 8.14/6=1.357 (Female)(AL) 2.14/2=1.07 (Female)(BC) 5.86/4=1.465 In general Weighting process proceeds until “convergence” is achieved, that is until the difference between the required totals and the obtained marginal is within predefined difference (i.e., variable tolerance level), where is usually set at 5%. It is not always the case that convergence is attainable. To see this fact it is helpful to view IPF as method to determine solutions to the system of linear equations. 4·X1 + 6·Y1 = 12 2·X2 + 4·Y2 = 8 4·X1 + 2·X2 = 6 6·Y1 + 4·Y2 = 14 Example 6 yields completely determined system and therefore a unique solution set. However if a researcher infers that 3 rather than 2 dichotomous variables are closely related to key outcome variables, corresponding system of linear equations becomes undetermined. Gender (MALE), (FEMALE) Province (BC), (AL) Employed (YES), (NO) Figure 5. Weighting process with 3 variables 22 Total number of cells in the cube shown in Figure 5 (i.e., the number of unknowns) is equal 23=8 while the number of the required totals is 2·3=6. (Same could be achieved in the 2-variable example by increasing the number of categories in the Province variable from 2 to 4). In other words Weighting can be viewed as an instance of the well known problem of finding a maximum set of nonnegative solution, (i.e., the nonnegative x with most non-zeros satisfying y = Ax.) where y Rd, x Rn, A is a sparse d×n matrix, d<n and y is considered known but x is unknown. Due to the fact that number of unknowns grows exponentially when the number of variables (and/or categories per variable) increases, it is to be expected that iterative methods on large number of variables may fail to converge. It is left to researchers to formulate proper granularity for a particular study (for further guidance please refer to the subsequent section). The importance of the Weighting process can be seen through the following example: Example 7: Suppose that research is conducted in order to conclude ratio of people that are willing to purchase new product A over well established product B. Focusing only on female customers from Example 6 suppose that a single population unit (Female)(AL) answered positively while the rest gave negative response. Without performing the Weighting process one might conclude that 1/6 17% women are willing to purchase “new” product A. However after the non-response adjustment is carried out percentage drops to 1.07/8 13%. If the threshold for the successful launch of product A was set to 15% it is easy to see how non-response bias could steer researcher in the wrong direction. In practice IPF provides a solid building block for adjusting cells values to required totals. Combined with proper heuristics and the computational power of today’s computers, even in an online environment Weighting process can be completed in matter of seconds on ten (and even more) variables. Convergence As noted earlier when research requires specification of ten or more variables of interest, consisting of several categories per variable (for example Province), Weighting process may fail to converge. In such cases it is recommended to perform collapsing of “slow converging” categories of less important variables, change variables tolerance level, or in an extreme case completely remove some of the variables. However deciding which variables are causing non-convergence is not an easy task. Battaglia et al have suggested in [1] that once “non-convergence” is detected it is helpful to view 23 the plot showing logarithm of the absolute value of the difference between the adjusted cells values (categories marginal totals) and the required totals. 25 20 Log10 difference 15 10 5 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 Number of Cycles -5 Figure 7. Converging process Figure 7 portrays a converging process involving 8 variables having on average 6 categories (where each category is represented by different color). X-axis give a cycle number (new cycle commence when adjustment is performed for each category of each variable), and the y-axis is log of the absolute 10 value of the difference between the adjusted marginal totals and the required total according to predefined tolerance level (see Multidimensionality and Tolerance Section for details on tolerance levels). 5 4 Log10 difference 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 Number of Cycles -1 Figure 8. Non-converging process 24 On the other hand Figure 8 shows highly skewed non-converging process involving only 4 variables having on average 4 categories. By inspection non-converging categories are easily singled out (see lines that do not cross x-axis on Figure 8). This information along with variable type (nominal, ordinal, etc) allow researcher to make a decision which categories odd to be collapsed to improve probability for a successful completion once Weighting process is repeated. In practice non-convergence detection algorithm should take not more than few seconds and plot data should be available in a file format supported by readily available statistical packages (for example csv). References [1] Michael P. Battaglia, David Izrael, David C. Hoaglin, and Martin R. Frankel, “Tips and Tricks for Raking Survey Data (a.k.a. Sample Balancing)”, Fifty-Ninth Annual AAPOR Conference Program Public Opin Q.2004; 68: 451-480. [2] Deming, WE and Stephan, FF (1940), “On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known.”, Annals of Mathematical Statistics, 11, 427-444. [3] Izrael, D, Hoaglin, DC, and Battaglia, MP (2000), “A SAS Macro for Balancing a Weighted Sample.”, Paper 258 SUGI (SAS Users Group International) 25. [4] Richard L. Scheaffer, William Mendenhall, Lyman Ott.,” Elementary survey sampling” Boston : Duxbury Press, 1986.3rd ed. 25