Download portable document (.pdf) format

Cutoff vs. Design-Based Sampling and Inference For Establishment Surveys James R. Knaub, Jr. Electric Power Division, Energy Information Administration Key words: classical ratio estimator, conditionality principle, model failure, probability proportionate to size (PPS), randomization principle, regression, skewed data, superpopulation, total survey error Abstract: Most sample surveys, especially household surveys, are design-based, meaning sampling and inference (e.g., means and standard errors) are determined by a process of randomization. That is, each member of the population has a predetermined probability of being selected for a sample. Establishment surveys generally have many smaller entities, and a relatively few large ones. Still, each can be assigned a given probability of selection. However, an alternative may be to collect data from generally the largest establishments – some being larger for some attributes than others – and use regression to estimate for the remainder. For design-based sampling, or even for a census survey, such models are often needed to impute for nonresponse. When such modeling would be needed for many small respondents, generally if sample data are collected on a frequent basis, but regressor (related) data are available for all of the population, then cutoff sampling with regression used for inference may be a better alternative. Note that with regression, one can always calculate an estimate of variance for an estimated total. (For example, see Knaub(1996), and note Knaub(2007d).) 1. Introduction: First, an elementary description of the design-based approach, using the principle of randomization, is provided in section 2. That section ends by mentioning the kind of design-based approaches used for establishment surveys. Section 3 provides some references and a short transition to a cutoff approach with regression estimation, described in section 4. A regression model approach to estimation for a cutoff sample relies on the principle of conditionality (i.e., conditioning results on the portion of the population selected). Following that, section 5 reviews the derivation of regression weights, and section 6 concentrates on three cases described in Royall(1970). Section 7 further explores alternatives for regression weights, searching for an improvement. Finally a brief history of the debate between the principles of randomization and conditionality, that is, between a design-based approach and a model-based approach (see Hansen, Madow, and Tepping(1978)) is given in section 8, with conclusions in section 9. Extensive references are supplied only as a guide to further reading, and to acknowledge sources, but the article may be read without interruption. 2. Design-Based Approach: There are many more complicated variations of sampling (complex sampling) involving multiple stages of sampling, probability proportional to size (PPS) sampling, and cluster sampling, area frames, and network sampling. However, here a brief background will be given for only the simplest design: simple random sampling without replacement, SRSWOR. (Many readers may want to skim ahead.) The main purpose here is to consider when cutoff sampling may be preferable to design-based sampling. 2.1 SRSWOR: Often when this is meant, the abbreviation used is just SRS. Consider a population of 1000 establishments (N = 1000). If a sample of n = 200 is taken at random (no duplicates allowed) then the probability of selection is p=200/1000 = 20% = 0.2. Probabilities of selecting given individuals at any step along the way are more complicated, but overall, 20% of the population is chosen at random in this example, by continuing to randomly select from the population until 20% of it has been chosen. 2.2 SRS Design Weights: SRS design weights are the inverse of the selection probabilities. See Knaub(2007c). Therefore, in this example, the selection probability is 0.2, and the design weight is 1/0.2 = 5. Thus, each observation in this example will represent five units from the population, itself and four others. 2.3 Estimation of Total from SRS: The sum of the n = 200 observations to be made in this example would be written as n 200 i =1 i =1 ∑ yi = ∑ yi = ts ^ ^ Then an estimate, T of T is T = (1000 / 200)t s = 5t s . 2.4 SRS: Standard Error: SRS assumes that the set of data from the population (or category within the population for stratified SRS) from which the random sample is drawn, consists of data that are distributed ‘evenly’ about the mean. That is, the mean and mode are nearly equal. The scatter of data about the mean is measured by the standard error, which is the square root of the mean of the sum of the squared differences between the observations and their means: se = n ⎛ _⎞ 2 ∑ ⎜⎜ yi − y ⎟⎟ / n ⎠ i =1 ⎝ _ n where y = ∑ yi / n i =1 2.5 Stratified Random Sampling Skewed Establishment Survey Data: In the Electric Power Division (EPD) of the Energy Information Administration (EIA), stratified Random Sampling was used, with auxiliary data (design-based ratio estimation) for a short time, nearly two decades ago. See Knaub(1989). Use was made of the 2 Keyfitz method. See Cochran(1977). establishments was highly problematic. Collecting monthly data from the smallest 2.6 Probability Proportionate to Size (PPS) Sampling - Skewed Establishment Survey Data: The EPD also considered PPS sampling for a time. See Knaub(1990). PPS sampling weights small observations greatly. If data quality is low for those points, then the detrimental effect can be substantial. 3. Some General References: For further information on a variety of techniques, see Brewer(2002), Sarndal, Swensson, and Wretman(1992), and Valliant, Dorfman, and Royall(2000), among others, as well as classics such as Cochran(1977). A technique known as “calibration” may be used to adjust survey weights based on auxiliary data, and may be used to account for nonresponse. (See Särndal, and Lundström(2005).) Here, however, we will concentrate on cases where cutoff sampling with model-based ratio or regression prediction may be superior to design-based sampling and inference. Knaub(1999b) is an introduction to cutoff sampling that is found on the EIA website. 4. Cutoff Sampling and Regression for Skewed Establishment Survey Data: One thing that sets establishment surveys off from household surveys is that data are highly skewed: That is, there are a relatively large number of relatively tiny respondents for a given data item; fewer mid-sized respondents, and a very few, very large respondents. 4.1 Skewed Establishment Survey Data: Design-Based sampling often handles these using stratified random sampling or probability proportionate to a measure of size. Cutoff sampling gives no survey design weight to data below the cutoff, but collects data from establishments over a certain size with certainty. 4.2 Cutoff sampling: Data that are collected in a cutoff sample may be collected with substantially less nonsampling error, such as measurement error, because the larger establishments receive more of the attention. However, unless resulting totals are to be biased downward, one should estimate for the data not collected. (When those data are not readily available, estimation for those data may be more accurate than if observations were actually attempted.) See Knaub(2008a). 4.3 Regression: To estimate for data not collected in a cutoff sample, regression can be used. 3 Note that in model-assisted survey techniques, the generalized regression estimator (GREG) uses a combination of calibrated weights from the design, and regression weights. See Särndal, Swensson, and Wretman(1992) and Särndal, C.-E., and Lundström(2005). This may not be all that one may find useful, however. See Knaub(2006). Regression is very useful, because it can take advantage of the sample data, and good related (regressor/auxiliary) data to estimate for both nonresponse and for cases not in the sample, i.e., for all ‘missing’ data. (Note: For model-assisted design-based sampling, we make use of ‘auxiliary data.’ These same data used for strictly model-based or cutoff sampling are called ‘regressor data.’) For cutoff sampling, the idea is to use regression to estimate for data that could not be collected, or at least not reasonably well. Regression can also be used to impute for nonresponse in a census. When there is nonresponse, or data are considered nonresponses because no reasonable explanations for failing edits are offered, then regression can be used to impute for those numbers. For data not-in-sample (left out below the cutoff on purpose) we use regression for ‘mass imputation,’ or ‘prediction.’ If there is sufficient information, then ‘nonresponse’ and ‘not-in-sample’ may be treated differently. If a change has occurred within a reporting establishment (for example, a new fuel is being used, not previously used at that plant for regressor data, such that a model showing growth in use of fuels by plant does not apply to the new data), then the new data will be a unique case that is considered an “add-on.” (See Knaub(2002).) That is, once data are estimated to contribute to a total, the “add-on” is then included. This is undesirable because regression (1) cannot be used to estimate for such data if missing, nor (2) consider its contribution to variance. Standard errors for totals can be found using prediction, just as they can be for means and thus totals for random sampling. The difference is that the variance is about an estimated regression line, not about an estimated mean value. In the case of ordinary least squares regression, OLS, all data points are treated as if they are known equally well. This is not reasonable for most establishment survey data (see Brewer (2002)). Further, with establishment survey applications, it is common that regression should go through the origin (Brewer (2002)). See also Knaub (2005) and Knaub (2007a). So let us consider model-based ratio estimation, which is a specific form of the weighted least squares (WLS) regression. (But then, OLS is a special form of WLS also, where all data have the same regression weights.) Notice that this may be considered a simple application of econometrics. (See Maddala(1992).) We will be using weights of the form wi = xi −2γ (Cochran (1953)), with an intercept of zero (Brewer(2002)). 4 5. Derivation of Regression Weights: γ Here, we use yi = bxi + e0i / wi1 / 2 , or using wi = xi −2γ , we have yi = bxi + e0i xi . This format for yi = bxi + ei indicates heteroscedasticity, where ei , the estimated residual, is γ expressed in terms of a random factor, e0i , and a nonrandom (or systematic) factor, xi , that expresses the degree of heteroscedasticity. (In Brewer(2002), γ , is labeled the “coefficient of heteroscedasticity.”) When γ =0, the variance is constant, so the standard error is the same for all points, and confidence bands about a regression line would be lines parallel to the regression line, so that at a given confidence one might say 1,000,000 ± 500, and also say 10 ± 500, which usually makes little sense with regression through the origin, which is the usual case for survey statistics. (Note an exception to regression through the origin, noted on page 37 of Knaub(2007a).) Later more will be said regarding the fact that some use the format wi = xi −γ , so that γ in that system is twice what is used here. In the above, yi is the variable of interest, and xi is the regressor (or a matrix of γ regressors, in which case the estimated residual is e0i zi , where z i is a measure of size [see Knaub(2003)]), and e0i is the random factor of the residual). Let Q be the sum of squares of the weighted residuals (Abdi(2003), Maddala(1992), et. al.), so here, for one regressor and a zero intercept: n 2 n Q = ∑ wi ( yi − bxi ) = ∑ e02i i =1 i ∂Q = 0 , then we may determine the best form for determining the ‘slope’ ∂b parameter so that the ‘best’ fitting regression line goes through the data points (i.e., the variance of the data points about the line is minimized). However, there will be a weight factor involved. The data themselves can be used to determine the weight that ‘fits’ best (see page 2 of Knaub(1997)), and this may not yield the minimum variance among those ∂Q to be determined by = 0 , but would be more realistic under this model. Lower data ∂b quality near the origin, and the fact that a model is an approximation that will not exactly fit the situation for each application, may make it best if we underestimate γ , as will be mentioned again later. If we set 5 n [ ] 2 So, if Q = ∑ wi 0.5 ( yi − bxi ) , then i =1 [ ][ ] n ∂Q = 0 = 2∑ wi 0.5 ( yi − bxi ) − xi wi0.5 , and we ∂b i =1 will be using weights of the form wi = xi −2γ . (See Cochran (1953)), pages 210-212, with an intercept of zero (Brewer(2002), pages 109-110.) So, n n n n i =1 i =1 i =1 i =1 ∑ wi0.5 ( yi − bxi )( xi wi0.5 ) = 0 , ∑ wi xi ( yi − bxi ) = 0 , ∑ wi xi yi = b∑ wi xi2 , and n ∑ wi xi yi therefore we have b = i =1 n . ∑ wi xi2 i =1 Using wi = xi −2γ , we have the following: n n ∑ wi xi yi ∑ b = i =1 = i =1 n ∑ wi xi2 i =1 n xi− 2γ xi yi ∑ x1i − 2γ yi = i =1 n n ∑ xi− 2γ xi2 ∑ xi2− 2γ i =1 i =1 5.1 One may then arrive at the three cases in Royall (1970) below for regression through the origin: n ∑ xi yi 1) If γ = 0, then b = i =1 n ∑ i =1 , which is the case for ordinary least squares (OLS) xi2 regression, when the intercept is zero. n ∑ yi 2) If γ = 0.5, b = i =1 n , and since a zero intercept is used here, this is the Classical ∑ xi i =1 Ratio Estimator, CRE, form of weighted least squares, WLS. 6 n y ∑ xi 3) If γ = 1, b = i =1 , which is a special case with a design-based equivalent used in n a famous study of Greek elections – Jessen, et.al (1947).) i So, when regression weights are all equal, we have the special case of WLS, where we can set γ = 0, referred to as Ordinary Least Squares (OLS) regression. As will be shown here, for establishment surveys, usually with regression through the origin, this is hardly to be “ordinarily” expected. We will continue considering regression through the origin (intercept zero, see Brewer (2002), pp. 109-110, but note caution in Knaub (2007a), p. 37). We will also continue concentrating on one regressor, but this could be extended to multiple regressors. (See Knaub (2003), pp. 3-5.) Further, we will show that it is difficult to do better than the Classical Ratio Estimator, CRE (see Knaub (2005)), for cutoff sampling of establishment surveys (Knaub (2007a, 2008a)). 6. More on the three cases studied in Royall(1970): Consider the three cases from Royall (1970): n bγ = 0 = i =1 n ∑ xi2 i =1 −2γ n n ∑ xi yi ∑ yi , bγ = 0.5 = i =1 n , ∑ xi y ∑ xi bγ =1 = i =1 n i where yi = bxi + e0i / wi1 / 2 , and i =1 . When γ = 0, variance is constant, but variance increases with x for γ > 0. wi = xi SAS PROC REG produces the STDI (standard error of the individual y-value), the square root of the variance of the prediction error (Maddala(1992)). This is “S1” in Knaub(1999a, 2000, & 2001). Error of the Individual y-value (EI): For γ = 0, yi = bxi + e0i xiγ becomes yi = bγ =0 xi + e0i . For γ = 0.5, we have yi = bγ =0.5 xi + e0i xi0.5 , and ⎛ ⎞ ⎝ ⎠ for γ = 1, yi = bγ =1 xi + e0i xi = ⎜⎜ bγ =1 + e ⎟⎟ xi . 0i We will concentrate on the individual error terms. 7 6.1 Relative Error of the Individual y-value (REI): The random factor of the residual is e0i . Note that the magnitude of this value may change drastically with changes in the model, and perhaps we should at least call it e0iγ , but for simplicity we will call it e0i . However, below, please note that the e0i differ by model. Then for yi = bxi + e0i xiγ = bxi + ei the relative error of the individual y-value, or e relative EI, is i . We will call this the REI. yi Thus ei / yi may now be found for each of the three cases we have been examining, as well as for any other regression weights. ei e0i = REI = means that the relative error of each point, yi bγ = 0 xi + e0i the REI, in general, decreases linearly with x. For γ = 0, Case 1, ei e0i = REI = , means that the REI generally decreases with the yi bγ = 0.5 xi0.5 + e0i square root of x, and For γ = 0.5, e e0i for γ = 1, i = REI = means that the REI for each y-value is not influenced by yi bγ =1 + e0i the value of x. 6.2 STDI vs REI: 1. For Case 1, OLS, the STDI is identical for each y-value, while the REI, decreases linearly with x. 2. For Case 2, CRE, the STDIs increase while the REIs decrease with increasing x. 3. For Case 3 with the more extreme weights, the STDIs increase greatly with increasing x, while the REI for each y-value is not related to x at all. Clearly the CRE represents a compromise between two extremes: (1) a constant standard error (OLS, no matter the y-value, so that one could have 1,000,000 +/- 1,000 and 10 +/1,000), and (2) a relative error of the individual y-value not related to x at all (so that the 8 n y i ∑ slope, bγ =1 = x i =1 i , is a mean of ratios, such as giving the cost per can of one can of n corn equal weight with that of 1000 cans of another brand of corn of a very different unit cost). For survey data with regression through the origin, Brewer(2002) argues that the CRE is at the lower end of expected heteroscedasticity. More follows regarding this. 6.3 Classical Ratio Estimator (CRE) and other degrees of heteroscedasticity: The slope for regression through the origin in the case of the CRE, recall, is n ∑ yi bγ = 0.5 = i n=1 ∑ xi . This ratio seems quite natural for many relationships. However, there i =1 are several ways to estimate the heteroscedasticity (degree of increase in variance for larger x [Knaub (2007b)]). See Sweet and Sigman(1995), p. 493, section 2.6, Carroll and Ruppert(1988), and Knaub(1993, 1997). This normally shows that for establishment surveys, results fall between that of Case 2 and Case 3. See Brewer(2002). The iterated reweighted least squares method (IRLS, Carroll and Ruppert(1988), et.al.) may not converge to find a coefficient of heteroscedasticity (a term in Brewer(2002)), and Knaub(1993, 1997) may only show the value that comes closest, but this can be expected because models are never of exactly the ‘right’ format. 7. Improving Regression Weights for Establishment Surveys: Regression through the origin is often useful for establishment surveys (Brewer(2002)). Further, Case 2, the CRE, has appeared in this author’s experience to often be appropriate, even though the data may demonstrate even higher heteroscedasticity, the CRE seems robust, especially against data quality problems near the origin not already eliminated by cutoff sampling. Holmberg and Swensson(2001) present a case for at least slightly underestimating heteroscedasticity in model-assisted, design-based sampling. Note the adaptations to the usual regression weight format, Cochran(1953), found in Sweet and Sigman(1995), and Steel and Fay(1995), but the CRE, or at least the format wi = xi −2γ , is difficult to better. (In Holmberg and Swensson(2001), and Sarndal, Swensson, and Wretman(1992), the format used is wi = xi −γ . In either case for the CRE we have wi = 1 / xi , with a zero intercept.) In email correspondence with Anders Holmberg (Statistics Sweden), he expressed a view approaching that of the one held in this document, that it may be reasonable to use a smaller value for γ than indicated by an estimation of that value, although Brewer(2002) warns that for survey sampling, one would normally have 0.5 <γ < 1.0. (Estimating for 9 γ , for electric power surveys, using approaches such as those found in Carroll and Ruppert(1988) or Knaub(1993, 1997), confirms this, with many cases between 0.8 and 1.0.) Following is a quote from Anders Holmberg in an email correspondence dated October 22, 2007. “Ken’s interval” refers to Brewer(2002) where as indicated above, it is argued that for establishment surveys, appropriate estimation generally falls between Case 2 and Case 3: “… I am keen to agree with you that it is better/safer to choose 0.5 [Case 2] over a value in the higher end of Ken's interval. I have noticed that the bad effect of guessing it wrong is higher if you overestimate the gamma compared to underestimating it. (At least effects on anticipated variances and variance).” In addition, improvements have been attempted for regression schemes, Sweet and Sigman(1995), Steel and Fay(1995), and Karmel and Jain(1987), for example. There are interesting questions to consider. Should the heteroscedasticity expressed in the model be different for random nonresponse as opposed to predicting for smaller establishments? (See Knaub(1999a).) It seems apparent that the smallest observations often have disproportionately large nonsampling error in cutoff sampling. (See Knaub(2002).) Removal of the worst cases from direct consideration is a major advantage of cutoff sampling, that can make it a more accurate option, not just easier and less expensive. See Knaub(2007a and 2008a). In general, establishment survey data demonstrate a large amount of heteroscedasticity, often closer to Case 3 than to Case 2, except closer to the origin. In the past, some partitioning of the data into strata has been tried, but this did not work well in those particular attempts. Recently, experiments have been performed using weights which approach w = x −1.6 for the larger establishments, while lowering the i i extremely high weights near the origin. (Because the relative weights between larger and smaller establishments are then still larger than with w = x −1.6 , in future experiments i −2 i we may want to use weights which approach wi = xi for the larger establishments.) Although weights should probably be deflated near the origin due to less reliable data, they must still approach infinity (so zero variance) right at the origin due to the fact that we are dealing with data here to which regression through the origin reasonably applies. That is, these are for cases where one expects y→0 when x→0 (unlike what is depicted on page 37 of Knaub(2007)). Following are some weighting formulas that could go directly into SAS PROC REG, but remember that it is the relative weight from on point to the next that matters. For example, for OLS, all weights could be 0.001, or they could all be 1, or they could all be 3,784. The same results would occur, regardless. 10 Adjusted Regression Weights – Figure 1 0.1 0.08 w1=1/x Weight 0.06 w2=15/x**1.6 w3=15/(x**1.6 + 38,600/x**2) 0.04 w4=0.75/(x**1.6) + 15/(X**1.6 + 38,600/X**2) w5=15.75/x**1.6 0.02 0 1 101 201 301 401 501 Measure of Size Note that w2 in Figure 1 is for Case 3. (A double asterisk in the figure means that an exponent follows, which is a computer programming convention.) All weights for w2 are 15 times the nominal formula, but that makes no difference to estimation by regression (i.e., prediction). What it does is to put w2 on the graph in a position more comparable visually to w1. w3 contains an added term to reduce the weight near the origin where data quality is weakest. However, a zero weight means an infinite variance, and that is definitely not true at the origin for regression through the origin. w4 adds yet another term to reverse the trend near the origin. This yields a format that might better fit the variability of the data. However, results have appeared mixed. Consider that a major advantage of the CRE has not been that it appears to perform best in most cases, but that it does not seem to often perform inadequately. Can we improve on that? Note that even when a relative standard error for a total, RSE, appears to be smaller for one alternative over others, it may be less accurate (the variance and/or bias of the variance could be relatively larger), so more work may be desired than the comparisons shown next. (See Knaub(2001).) 11 A relative standard error (estimated) is the estimated standard error of a total (or other parameter) divided by that estimated total (or other parameter), usually expressed as a percent. (Also see “relative variance” in Hansen, et.al.(1953).) In the graphs to follow, programmed by Maren Blackham, Science Applications International Corporation (SAIC), results were shown for relative standard errors (RSEs), and another survey performance measure, relative standard error from a superpopulation (RSESP). (See Cochran(1977), page 158, regarding the term “superpopulation.”) The latter measure, found in Knaub (2002, 2003, 2004, and 2007d) is the model equivalent of removing the finite population correction factor (fpc, see Knaub(2008b)) from a simple random sample. It can be used for overall data variance comparisons, even for a census, which could indicate data quality problems. Lower RSE and RSESP values indicate superior performance over other alternatives. However, the accuracy of these measures might vary substantially. See Knaub(2001). In the first example, Figures 2 and 3, “IndepPP” means we are considering independent power producers of electricity; “Waste” is the fuel used; “FL” is for the State of Florida. Small area estimation is used here, meaning that all such plants using this fuel across southern States are used in the model (estimation group) to estimate what is needed to publish results for Florida. See Knaub(1999a). The other examples, Figures 4 through 7 are for two more examples, aggregated to the national level. The weights used are as follows: wi = ( xi )−1 ( ) −1 wi = xi −1.6 + 20 xi1.6 + cxi −2 , where c = .8 × h 3.6 with h=5th, 10th, and 25th percentile. and c ⎛ ⎞ wi = xi −1.6 + 20⎜ xi1.6 + xi − 2 ⎟ 3 ⎝ ⎠ −1 , where c = .8 × h 3.6 with h=10th percentile. Note that the author chose “c” as a parameter indicating the maximum weight for the second case, which corresponds to w3 in the graph of regression weights given previously. At approximately 3 times c, the alternatives to the CRE nearly converge. Note that the approximation for small area estimation in Knaub(1999) was used and that we are currently working on improving the value used for “delta,” δ, which is a function of the distributional form of the ‘size’ of the establishments. (Perhaps the coefficient of skewness or that and the coefficient of kurtosis may be sufficient parameters.) Adjusting the deltas so that variance matches a more “exact” variance estimate for each “estimation group” (Knaub(1999)) seems feasible. 12 Also note that the size measure used in the regression weights can be used as a single regressor, if the components and their coefficients are made the same for groups of data being compared, so that one slope and its standard error can be used to compare estimation groups to see if collapsing them may be helpful for purposes of small area estimation. Example Result 1: waste-fired gross generation for independent power plants in Florida – Shown in Figures 2 and 3. Figure 2 13 Figure 3 14 Example Result 2: natural gas-fired gross generation Figure 4 15 Figure 5 16 Example Result 3: wind created gross generation Figure 6 17 Figure 7 Although the CRE appears to do consistently better in the cases above, there were cases in the many examples investigated, often for smaller (area) publication groups, for which 18 this was not always true. A variance and bias study, such as in Knaub(2001), should be considered in any case since the accuracy of these estimates impacts their comparisons. At any rate, the CRE seems robust. (See Knaub(2005).) 8. History of Survey Statistics: In the United States, there was a debate over the use of probability (design-based) sampling vs. purposive sampling (which would include cutoff sample) in the 1940s. Design-based sampling was decided to be superior, and sampling for Official Statistics became dominated by it. Later, Ken Brewer(1963) and Richard Royall(1970) made a case for the use of models, which was noted in Cochran(1977), but there was little enthusiasm, at least at first, at least in the US, for the approach Royall used which did not include randomization. His thought was that conditionality on a sample was more important. This was not popular. Royall later modified his views to include “balanced sampling” (see “Conclusions”) as a way to avoid “model failure” (i.e., not adequately representing the part of the population not eligible for sampling). Still, many recognize the usefulness of models in survey statistics, while still favoring a design-based approach. Note Sarndal, Swensson, and Wretman(1992). Cutoff sampling has survived, however, perhaps more in Europe and other parts of the world (see Knaub(2007a, 2008a)), although Sarndal, Swensson, and Wretman(1992) have little good to say about cutoff sampling. In Brewer(1995, 2002), Ken Brewer argues for a combined approach, and does not ever suggest abandoning a design-based approach. One of the strongest opponents of purposive sampling was Morris Hansen. If anyone is interested in the debate over model-based approaches, then perhaps the most famous debate over this can be found online on the American Statistical Association website, from the 1978 North American Joint Statistical Meetings (JSM), where, in a session chaired by H.O. Hartley, a paper presented by Hansen, Madow, and Tepping put forth their arguments, which were then commented on by some of the most well-known statisticians of the day, or even today. Hansen, Madow, and Tepping(1978) was commented on by V.P. Godambe, Keith Eberhardt, William G. Cochran, Carl Erik Sarndal, Oscar Kempthorne, and Richard Royall, after which Hansen, Madow, and Tepping responded. It seemed, in this author’s opinion, that the arguments by Hansen, et.al., may be more applicable to household surveys than to establishment surveys. Of course, examples may be found to make either method perform better than the other. Note that this proceedings paper was followed by a journal article: Hansen, Madow, and Tepping(1983). 19 9. Conclusions: For highly skewed establishment survey data where the smallest establishments have difficulty supplying good quality data on a frequent basis (say, monthly), but may supply good regressor data (e.g., from annual census data), cutoff sampling and regression estimation may be a viable alternative to design-based methods. For electric power data, such cutoff sampling has appeared superior in that results compared well to those from a larger, design-based sample. Comparing the totals of 12 monthly sample estimates of totals to corresponding annual census totals taken later has shown that cutoff sampling with regression estimation appears to work quite well. Also see Knaub(2001). Because some data just above cutoff thresholds are still of rather low quality, estimated RSEs are sometimes lower/better with those data deleted, and imputations made, as if the low quality collected data had been nonresponses. This may be the reason the CRE works well when heteroscedasticity appears more extreme. It may represent a compromise. Stratifying the data to apply lower coefficients of heteroscedasticity closer to the origin has seemed problematic, but may still be worth further efforts in the future. (In addition, note that Holmberg and Swensson(2001) found less inaccuracy resulting from underestimating than from overestimating the coefficient of heteroscedasticity.) Another way to think of this is that because there may still be low quality data among the smallest observations collected in a cutoff sample, this may mean that the level of heteroscedasticity apparent in the data overall is probably greater than the level that should be used in the modeling for estimation. (See Holmberg and Swensson(2001) and Knaub(1993, 1997).) That is because the area near the origin would have inflated variance (with actually lower heteroscedasticity that would be overestimated). This tends to cause substantial overestimation of variance for totals. Confidence bands are made wide near the origin and get wider after that, too wide in fact. Consider that cutoff sampling, when regression is used to predict for the out-of-sample establishments, may be more accurate because design-based sampling means collecting some data that may be very inaccurate. We need to consider all sources of survey error (total survey error), including reasonable limits as to how much an estimate might be in error for cutoff sampling, if the part of the population given no chance of selection were not modeled well. (For more on model-failure, see Hansen, Madow, and Tepping(1978)). Model-assisted design-based sampling (e.g., Sarndal, Swensson, and Wretman(1992)) or some other combination of approaches (Brewer(1995, 2002)), may be best, or for a model, balanced sampling (from Royall and Herson(1973a,b), defined well on page 593 in Brewer(1995)) may be useful. (See Chaudhuri and Stenger(1992), pages 74-89, and Cumberland and Royall(1982).) However, establishment survey data are highly skewed, and, again, when the smallest establishments cannot provide data of reasonable quality on 20 a frequent basis, then cutoff sampling supplemented with estimation though regression may provide more accurate results. Note that some problems that smaller establishments might encounter could be (1) not collecting data as frequently as they are requested to provide it, and (2) not having personnel specialized in providing the information. This could often be the case with Official Statistics. For establishment surveys, generally a number of data elements are of interest from a given set of establishments. Thus a ‘cutoff sample’ may actually be a compromise between the best cutoffs for a number of attributes. Further complications arise when small area estimation is necessary. Also, bias and variance need to be studied (Knaub(2001)). All error sources considered, cutoff sampling will often be competitive, sometimes perhaps quite superior to design-based alternatives, from the perspective of accuracy of results. It may also be less costly, easier to implement, and more flexible in the face of small area needs. It may be possible to do better than using the CRE with cutoff sampling, but that is certainly quite often a good choice for establishment surveys, and very simple to implement and relatively easy to check results to avoid data processing errors. Acknowledgments: Thank you to Ken Brewer for help over a number of years, to Nancy Kirkendall for her 1990 presentation on models in survey estimation, and to Anders Holmberg, Joe Sedransk, and others for discussions. Thanks to Maren Blackham for most of the graphs, and thanks to those who provide online graphical calculators, one of which was used for experimenting with regression weight formats. References: Abdi, H. (2003). Least-squares. In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 559-561. Brewer, K.R.W. (1963), "Ratio Estimation in Finite Populations: Some Results Deducible from the Assumption of an Underlying Stochastic Process," Australian Journal of Statistics, 5, pp. 93-105. 21 Brewer, K.R.W. (1995), “Combining Design-Based and Model-Based Inference,” Business Survey Methods, ed. by B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott, John Wiley & Sons, pp. 589-606. Brewer, KRW (2002), Combining survey sampling inferences: Weighing Basu's elephants, Arnold: London and Oxford University Press. Carroll, R.J., and Ruppert, D. (1988), Transformation and Weighting in Regression, Chapman &Hall. Chaudhuri, A. and Stenger, H. (1992), Survey Sampling: Theory and Methods, Marcel Dekker, Inc. Cochran, W.G.(1953), Sampling Techniques, pp. 210-212, 1st ed., John Wiley & Sons. Cochran, W.G.(1977), Sampling Techniques, 3rd ed., John Wiley & Sons. Cumberland, W.G., and Royall, R.M. (1982), “Does SRS Provide Adequate Balance?,” Proceedings of the Survey Research Methods Section, ASA, pp. 226-229, http://www.amstat.org/sections/srms/proceedings/ Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953). Sample Survey Methods and Theory, Volumes I and II. Wiley. Hansen, M.H., Madow, W.G., and Tepping, B.J. (1978), “On Inference and Estimation from Sample Surveys,” with comments from discussants, Proceedings of the Survey Research Methods Section, ASA, pp. 82-107. http://www.amstat.org/sections/srms/proceedings/ Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983), “An Evaluation of ModelDependent and Probability-Sampling Inferences in Sample Surveys: Rejoinder,” Journal of the American Statistical Association, Vol. 78, No. 384 (Dec., 1983), pp. 805807. Holmberg, A., and Swensson, B. (2001), “On Pareto πps Sampling: Reflections on Unequal Probability Sampling Strategies,” Theory of Stochastic Processes, Vol. 7, pp. 142-155. Jessen, R. J., et.al. (1947). "On a Population Sample for Greece," Journal of the American Statistical Association, pp. 357-384. Karmel, T.S., and Jain, M. (1987), “Comparison of Purposive and Random Sampling Schemes for Estimating Capital Expenditure,” Journal of the American Statistical Association, American Statistical Association, 82, pp. 52-57. 22 Kirkendall, et.al. (1990), “Sampling and Estimation: Making Best Use of Available Data,” seminar at the EIA, September 1990. Knaub, J.R., Jr. (1989), "Ratio Estimation and Approximate Optimum Stratification in Electric Power Surveys," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 848-853. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1990), “Some Theoretical and Applied Investigations of Model and Unequal Probability Sampling for Electric Power Generation and Cost,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 748-753. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1993), "Alternative to the Iterated Reweighted Least Squares Method: Apparent Heteroscedasticity and Linear Regression Model Sampling," Proceedings of the International Conference on Establishment Surveys, American Statistical Association, pp. 520-525. Knaub. J.R., Jr. (1996), “Weighted Multiple Regression Estimation for Survey Model Sampling,” InterStat, May 1996, http://interstat.statjournals.net/. (Note that there is a shorter version in the ASA Survey Research Methods Section proceedings, 1996.) Knaub, J.R., Jr. (1997), "Weighting in Regression for Use in Survey Methodology," InterStat, April 1997, http://interstat.statjournals.net/. (Note shorter, but improved version in the 1997 Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 153-157.) Knaub, J.R., Jr. (1999a), “Using Prediction-Oriented Software for Survey Estimation,” InterStat, August 1999, http://interstat.statjournals.net/, partially covered in "Using Prediction-Oriented Software for Model-Based and Small Area Estimation," in ASA Survey Research Methods Section proceedings, 1999, and partially covered in "Using Prediction-Oriented Software for Estimation in the Presence of Nonresponse,” presented at the International Conference on Survey Nonresponse, 1999. Knaub, J.R. Jr. (circa 1999b), “Model-Based Sampling, Inference and Imputation,” EIA web site: http://www.eia.doe.gov/cneaf/electricity/forms/eiawebme.pdf Knaub, J.R., Jr. (2000), “Using Prediction-Oriented Software for Survey Estimation Part II: Ratios of Totals,” InterStat, June 2000, http://interstat.statjournals.net/. (Note shorter, more recent version in ASA Survey Research Methods Section proceedings, 2000.) Knaub, J.R., Jr. (2001), “Using Prediction-Oriented Software for Survey Estimation Part III: Full-Scale Study of Variance and Bias,” InterStat, June 2001, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2001.) 23 Knaub, J.R., Jr. (2002), “Practical Methods for Electric Power Survey Data,” InterStat, July 2002, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2002.) Knaub, J.R., Jr. (2003), “Applied Multiple Regression for Surveys with Regressors of Changing Relevance: Fuel Switching by Electric Power Producers,” InterStat, May 2003, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2003.) Knaub, J.R., Jr. (2004), “Modeling Superpopulation Variance: Its Relationship to Total Survey Error,” InterStat, August 2004, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2004.) Knaub, J.R., Jr. (2005), “Classical Ratio Estimator,” InterStat, October 2005, http://interstat.statjournals.net/. Knaub, J.R., Jr. (2006), Book Review, Journal of Official Statistics, Vol. 22, No. 2, 2006, pp. 351–355, http://www.jos.nu/Articles/article.asp Knaub, J.R., Jr. (2007a), “Cutoff Sampling and Inference,” InterStat, April 2007, http://interstat.statjournals.net/. Knaub, J.R., Jr. (2007b), “Heteroscedasticity and Homoscedasticity” in Encyclopedia of Measurement and Statistics, Editor: Neil J. Salkind, Sage, Vol. 2, pp. 431-432. Knaub, J.R., Jr. (2007c), “Survey Weights” in Encyclopedia of Measurement and Statistics, Editor: Neil J. Salkind, Sage, Vol. 3, p. 981. Knaub, J.R., Jr. (2007d), "Model and Survey Performance Measurement by the RSE and RSESP," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 2730-2736. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (2008a), forthcoming. “Cutoff Sampling.” In Encyclopedia of Survey Research Methods, Editor: Paul J. Lavrakas, Sage, to appear in September 2008. Knaub, J.R., Jr. (2008b), forthcoming. “Finite Population Correction.” In Encyclopedia of Survey Research Methods, Editor: Paul J. Lavrakas, Sage, to appear in September 2008. Maddala, G.S. (1992), Introduction to Econometrics, 2nd ed., Macmillan Pub. Co. Royall, R.M. (1970), "On Finite Population Sampling Theory Under Certain Linear Regression Models," Biometrika, 57, pp. 377-387. Royall, R.M., and Herson, J.(1973a), “Robust Estimation in Finite Populations I,” Journal of the American Statistical Association, Vol. 68, pp. 880-889. 24 Royall, R.M., and Herson, J.(1973b), “Robust Estimation in Finite Populations II: Stratification on a Size Variable,” Journal of the American Statistical Association, Vol. 68, pp. 890-893. Särndal, C.-E., and Lundström, S., (2005), Estimation in Surveys with Nonresponse, Wiley. Särndal, C.-E., Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlag. Sweet, E.M. and Sigman, R.S. (1995), “Evaluation of Model-Assisted Procedures for Stratifying Skewed Populations Using Auxiliary Data,” Proceedings of the Section on Survey Research Methods, Vol. I, American Statistical Association, pp. 491-496. http://www.amstat.org/sections/srms/proceedings/ Steel, P. and Fay, R.E. (1995), “Variance Estimation for Finite Populations with Imputed Data,” Proceedings of the Section on Survey Research Methods, Vol. I, American Statistical Association, pp. 374-379. http://www.amstat.org/sections/srms/proceedings/ Valliant, R., Dorfman, A.H., and Royall, R.M. (2000), Finite Population Sampling and Inference, A Predictive Approach, John Wiley & Sons. 25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download portable document (.pdf) format