Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Cutoff vs. Design-Based Sampling and Inference For Establishment Surveys
James R. Knaub, Jr.
Electric Power Division, Energy Information Administration
Key words: classical ratio estimator, conditionality principle, model failure, probability
proportionate to size (PPS), randomization principle, regression, skewed data,
superpopulation, total survey error
Abstract:
Most sample surveys, especially household surveys, are design-based, meaning sampling
and inference (e.g., means and standard errors) are determined by a process of
randomization. That is, each member of the population has a predetermined probability
of being selected for a sample. Establishment surveys generally have many smaller
entities, and a relatively few large ones. Still, each can be assigned a given probability of
selection. However, an alternative may be to collect data from generally the largest
establishments – some being larger for some attributes than others – and use regression to
estimate for the remainder. For design-based sampling, or even for a census survey, such
models are often needed to impute for nonresponse. When such modeling would be
needed for many small respondents, generally if sample data are collected on a frequent
basis, but regressor (related) data are available for all of the population, then cutoff
sampling with regression used for inference may be a better alternative. Note that with
regression, one can always calculate an estimate of variance for an estimated total. (For
example, see Knaub(1996), and note Knaub(2007d).)
1. Introduction:
First, an elementary description of the design-based approach, using the principle of
randomization, is provided in section 2. That section ends by mentioning the kind of
design-based approaches used for establishment surveys. Section 3 provides some
references and a short transition to a cutoff approach with regression estimation,
described in section 4. A regression model approach to estimation for a cutoff sample
relies on the principle of conditionality (i.e., conditioning results on the portion of the
population selected). Following that, section 5 reviews the derivation of regression
weights, and section 6 concentrates on three cases described in Royall(1970). Section 7
further explores alternatives for regression weights, searching for an improvement.
Finally a brief history of the debate between the principles of randomization and
conditionality, that is, between a design-based approach and a model-based approach (see
Hansen, Madow, and Tepping(1978)) is given in section 8, with conclusions in section 9.
Extensive references are supplied only as a guide to further reading, and to acknowledge
sources, but the article may be read without interruption.
2. Design-Based Approach:
There are many more complicated variations of sampling (complex sampling) involving
multiple stages of sampling, probability proportional to size (PPS) sampling, and cluster
sampling, area frames, and network sampling. However, here a brief background will be
given for only the simplest design: simple random sampling without replacement,
SRSWOR. (Many readers may want to skim ahead.) The main purpose here is to
consider when cutoff sampling may be preferable to design-based sampling.
2.1 SRSWOR:
Often when this is meant, the abbreviation used is just SRS.
Consider a population of 1000 establishments (N = 1000). If a sample of n = 200 is taken
at random (no duplicates allowed) then the probability of selection is p=200/1000 = 20%
= 0.2. Probabilities of selecting given individuals at any step along the way are more
complicated, but overall, 20% of the population is chosen at random in this example, by
continuing to randomly select from the population until 20% of it has been chosen.
2.2 SRS Design Weights:
SRS design weights are the inverse of the selection probabilities. See Knaub(2007c).
Therefore, in this example, the selection probability is 0.2, and the design weight is 1/0.2
= 5. Thus, each observation in this example will represent five units from the population,
itself and four others.
2.3 Estimation of Total from SRS:
The sum of the n = 200 observations to be made in this example would be written as
n
200
i =1
i =1
∑ yi =
∑ yi = ts
^
^
Then an estimate, T of T is T = (1000 / 200)t s = 5t s .
2.4 SRS: Standard Error:
SRS assumes that the set of data from the population (or category within the population
for stratified SRS) from which the random sample is drawn, consists of data that are
distributed ‘evenly’ about the mean. That is, the mean and mode are nearly equal. The
scatter of data about the mean is measured by the standard error, which is the square root
of the mean of the sum of the squared differences between the observations and their
means:
se =
n ⎛
_⎞
2
∑ ⎜⎜ yi − y ⎟⎟ / n
⎠
i =1 ⎝
_
n
where y = ∑ yi / n
i =1
2.5 Stratified Random Sampling Skewed Establishment Survey Data:
In the Electric Power Division (EPD) of the Energy Information Administration (EIA),
stratified Random Sampling was used, with auxiliary data (design-based ratio estimation)
for a short time, nearly two decades ago. See Knaub(1989). Use was made of the
2
Keyfitz method. See Cochran(1977).
establishments was highly problematic.
Collecting monthly data from the smallest
2.6 Probability Proportionate to Size (PPS) Sampling - Skewed Establishment Survey
Data:
The EPD also considered PPS sampling for a time. See Knaub(1990). PPS sampling
weights small observations greatly. If data quality is low for those points, then the
detrimental effect can be substantial.
3. Some General References:
For further information on a variety of techniques, see Brewer(2002), Sarndal, Swensson,
and Wretman(1992), and Valliant, Dorfman, and Royall(2000), among others, as well as
classics such as Cochran(1977). A technique known as “calibration” may be used to
adjust survey weights based on auxiliary data, and may be used to account for
nonresponse. (See Särndal, and Lundström(2005).) Here, however, we will concentrate
on cases where cutoff sampling with model-based ratio or regression prediction may be
superior to design-based sampling and inference. Knaub(1999b) is an introduction to
cutoff sampling that is found on the EIA website.
4. Cutoff Sampling and Regression for Skewed Establishment Survey Data:
One thing that sets establishment surveys off from household surveys is that data are
highly skewed: That is, there are a relatively large number of relatively tiny respondents
for a given data item; fewer mid-sized respondents, and a very few, very large
respondents.
4.1 Skewed Establishment Survey Data:
Design-Based sampling often handles these using stratified random sampling or
probability proportionate to a measure of size. Cutoff sampling gives no survey design
weight to data below the cutoff, but collects data from establishments over a certain size
with certainty.
4.2 Cutoff sampling:
Data that are collected in a cutoff sample may be collected with substantially less
nonsampling error, such as measurement error, because the larger establishments receive
more of the attention. However, unless resulting totals are to be biased downward, one
should estimate for the data not collected. (When those data are not readily available,
estimation for those data may be more accurate than if observations were actually
attempted.) See Knaub(2008a).
4.3 Regression:
To estimate for data not collected in a cutoff sample, regression can be used.
3
Note that in model-assisted survey techniques, the generalized regression estimator
(GREG) uses a combination of calibrated weights from the design, and regression
weights. See Särndal, Swensson, and Wretman(1992) and Särndal, C.-E., and
Lundström(2005). This may not be all that one may find useful, however. See
Knaub(2006).
Regression is very useful, because it can take advantage of the sample data, and good
related (regressor/auxiliary) data to estimate for both nonresponse and for cases not in the
sample, i.e., for all ‘missing’ data. (Note: For model-assisted design-based sampling, we
make use of ‘auxiliary data.’ These same data used for strictly model-based or cutoff
sampling are called ‘regressor data.’) For cutoff sampling, the idea is to use regression to
estimate for data that could not be collected, or at least not reasonably well. Regression
can also be used to impute for nonresponse in a census.
When there is nonresponse, or data are considered nonresponses because no reasonable
explanations for failing edits are offered, then regression can be used to impute for those
numbers. For data not-in-sample (left out below the cutoff on purpose) we use regression
for ‘mass imputation,’ or ‘prediction.’ If there is sufficient information, then
‘nonresponse’ and ‘not-in-sample’ may be treated differently.
If a change has occurred within a reporting establishment (for example, a new fuel is
being used, not previously used at that plant for regressor data, such that a model
showing growth in use of fuels by plant does not apply to the new data), then the new
data will be a unique case that is considered an “add-on.” (See Knaub(2002).) That is,
once data are estimated to contribute to a total, the “add-on” is then included. This is
undesirable because regression (1) cannot be used to estimate for such data if missing,
nor (2) consider its contribution to variance.
Standard errors for totals can be found using prediction, just as they can be for means and
thus totals for random sampling. The difference is that the variance is about an estimated
regression line, not about an estimated mean value.
In the case of ordinary least squares regression, OLS, all data points are treated as if they
are known equally well. This is not reasonable for most establishment survey data (see
Brewer (2002)). Further, with establishment survey applications, it is common that
regression should go through the origin (Brewer (2002)). See also Knaub (2005) and
Knaub (2007a).
So let us consider model-based ratio estimation, which is a specific form of the weighted
least squares (WLS) regression. (But then, OLS is a special form of WLS also, where all
data have the same regression weights.) Notice that this may be considered a simple
application of econometrics. (See Maddala(1992).) We will be using weights of the
form wi = xi −2γ (Cochran (1953)), with an intercept of zero (Brewer(2002)).
4
5. Derivation of Regression Weights:
γ
Here, we use yi = bxi + e0i / wi1 / 2 , or using wi = xi −2γ , we have yi = bxi + e0i xi . This
format for yi = bxi + ei indicates heteroscedasticity, where ei , the estimated residual, is
γ
expressed in terms of a random factor, e0i , and a nonrandom (or systematic) factor, xi ,
that expresses the degree of heteroscedasticity. (In Brewer(2002), γ , is labeled the
“coefficient of heteroscedasticity.”) When γ =0, the variance is constant, so the standard
error is the same for all points, and confidence bands about a regression line would be
lines parallel to the regression line, so that at a given confidence one might say 1,000,000
± 500, and also say 10 ± 500, which usually makes little sense with regression through
the origin, which is the usual case for survey statistics. (Note an exception to regression
through the origin, noted on page 37 of Knaub(2007a).) Later more will be said
regarding the fact that some use the format wi = xi −γ , so that γ in that system is twice
what is used here.
In the above, yi is the variable of interest, and xi is the regressor (or a matrix of
γ
regressors, in which case the estimated residual is e0i zi , where z i is a measure of size
[see Knaub(2003)]), and e0i is the random factor of the residual).
Let Q be the sum of squares of the weighted residuals (Abdi(2003), Maddala(1992), et.
al.), so here, for one regressor and a zero intercept:
n
2
n
Q = ∑ wi ( yi − bxi ) = ∑ e02i
i =1
i
∂Q
= 0 , then we may determine the best form for determining the ‘slope’
∂b
parameter so that the ‘best’ fitting regression line goes through the data points (i.e., the
variance of the data points about the line is minimized). However, there will be a weight
factor involved. The data themselves can be used to determine the weight that ‘fits’ best
(see page 2 of Knaub(1997)), and this may not yield the minimum variance among those
∂Q
to be determined by
= 0 , but would be more realistic under this model. Lower data
∂b
quality near the origin, and the fact that a model is an approximation that will not exactly
fit the situation for each application, may make it best if we underestimate γ , as will be
mentioned again later.
If we set
5
n
[
]
2
So, if Q = ∑ wi 0.5 ( yi − bxi ) , then
i =1
[
][
]
n
∂Q
= 0 = 2∑ wi 0.5 ( yi − bxi ) − xi wi0.5 , and we
∂b
i =1
will be using weights of the form wi = xi −2γ . (See Cochran (1953)), pages 210-212,
with an intercept of zero (Brewer(2002), pages 109-110.)
So,
n
n
n
n
i =1
i =1
i =1
i =1
∑ wi0.5 ( yi − bxi )( xi wi0.5 ) = 0 , ∑ wi xi ( yi − bxi ) = 0 , ∑ wi xi yi = b∑ wi xi2 ,
and
n
∑ wi xi yi
therefore we have b =
i =1
n
.
∑ wi xi2
i =1
Using wi = xi −2γ , we have the following:
n
n
∑ wi xi yi ∑
b = i =1
= i =1
n
∑ wi xi2
i =1
n
xi− 2γ xi yi
∑ x1i − 2γ yi
= i =1
n
n
∑ xi− 2γ xi2
∑ xi2− 2γ
i =1
i =1
5.1 One may then arrive at the three cases in Royall (1970) below for regression through
the origin:
n
∑ xi yi
1) If γ = 0, then b = i =1
n
∑
i =1
, which is the case for ordinary least squares (OLS)
xi2
regression, when the intercept is zero.
n
∑ yi
2) If γ = 0.5, b = i =1
n
, and since a zero intercept is used here, this is the Classical
∑ xi
i =1
Ratio Estimator, CRE, form of weighted least squares, WLS.
6
n
y
∑ xi
3) If γ = 1, b = i =1 , which is a special case with a design-based equivalent used in
n
a famous study of Greek elections – Jessen, et.al (1947).)
i
So, when regression weights are all equal, we have the special case of WLS, where we
can set γ = 0, referred to as Ordinary Least Squares (OLS) regression. As will be shown
here, for establishment surveys, usually with regression through the origin, this is hardly
to be “ordinarily” expected.
We will continue considering regression through the origin (intercept zero, see Brewer
(2002), pp. 109-110, but note caution in Knaub (2007a), p. 37). We will also continue
concentrating on one regressor, but this could be extended to multiple regressors. (See
Knaub (2003), pp. 3-5.) Further, we will show that it is difficult to do better than the
Classical Ratio Estimator, CRE (see Knaub (2005)), for cutoff sampling of establishment
surveys (Knaub (2007a, 2008a)).
6. More on the three cases studied in Royall(1970):
Consider the three cases from Royall (1970):
n
bγ = 0 = i =1
n
∑ xi2
i =1
−2γ
n
n
∑ xi yi
∑ yi
,
bγ = 0.5 = i =1
n
,
∑ xi
y
∑ xi
bγ =1 = i =1
n
i
where
yi = bxi + e0i / wi1 / 2 ,
and
i =1
. When γ = 0, variance is constant, but variance increases with x for γ > 0.
wi = xi
SAS PROC REG produces the STDI (standard error of the individual y-value), the square
root of the variance of the prediction error (Maddala(1992)). This is “S1” in
Knaub(1999a, 2000, & 2001).
Error of the Individual y-value (EI):
For γ = 0, yi = bxi + e0i xiγ becomes yi = bγ =0 xi + e0i .
For γ = 0.5, we have yi = bγ =0.5 xi + e0i xi0.5 , and
⎛
⎞
⎝
⎠
for γ = 1, yi = bγ =1 xi + e0i xi = ⎜⎜ bγ =1 + e ⎟⎟ xi .
0i
We will concentrate on the individual error terms.
7
6.1 Relative Error of the Individual y-value (REI):
The random factor of the residual is e0i . Note that the magnitude of this value may
change drastically with changes in the model, and perhaps we should at least call it e0iγ ,
but for simplicity we will call it e0i . However, below, please note that the e0i differ by
model. Then for yi = bxi + e0i xiγ = bxi + ei the relative error of the individual y-value, or
e
relative EI, is i . We will call this the REI.
yi
Thus ei / yi may now be found for each of the three cases we have been examining, as
well as for any other regression weights.
ei
e0i
= REI =
means that the relative error of each point,
yi
bγ = 0 xi + e0i
the REI, in general, decreases linearly with x.
For γ = 0, Case 1,
ei
e0i
= REI =
, means that the REI generally decreases with the
yi
bγ = 0.5 xi0.5 + e0i
square root of x, and
For γ = 0.5,
e
e0i
for γ = 1, i = REI =
means that the REI for each y-value is not influenced by
yi
bγ =1 + e0i
the value of x.
6.2 STDI vs REI:
1. For Case 1, OLS, the STDI is identical for each y-value, while the REI, decreases
linearly with x.
2. For Case 2, CRE, the STDIs increase while the REIs decrease with increasing x.
3. For Case 3 with the more extreme weights, the STDIs increase greatly with increasing
x, while the REI for each y-value is not related to x at all.
Clearly the CRE represents a compromise between two extremes: (1) a constant standard
error (OLS, no matter the y-value, so that one could have 1,000,000 +/- 1,000 and 10 +/1,000), and (2) a relative error of the individual y-value not related to x at all (so that the
8
n y
i
∑
slope,
bγ =1 =
x
i =1 i
, is a mean of ratios, such as giving the cost per can of one can of
n
corn equal weight with that of 1000 cans of another brand of corn of a very different unit
cost). For survey data with regression through the origin, Brewer(2002) argues that the
CRE is at the lower end of expected heteroscedasticity. More follows regarding this.
6.3 Classical Ratio Estimator (CRE) and other degrees of heteroscedasticity:
The slope for regression through the origin in the case of the CRE, recall, is
n
∑ yi
bγ = 0.5 = i n=1
∑ xi
. This ratio seems quite natural for many relationships. However, there
i =1
are several ways to estimate the heteroscedasticity (degree of increase in variance for
larger x [Knaub (2007b)]). See Sweet and Sigman(1995), p. 493, section 2.6, Carroll and
Ruppert(1988), and Knaub(1993, 1997). This normally shows that for establishment
surveys, results fall between that of Case 2 and Case 3. See Brewer(2002). The iterated
reweighted least squares method (IRLS, Carroll and Ruppert(1988), et.al.) may not
converge to find a coefficient of heteroscedasticity (a term in Brewer(2002)), and
Knaub(1993, 1997) may only show the value that comes closest, but this can be expected
because models are never of exactly the ‘right’ format.
7. Improving Regression Weights for Establishment Surveys:
Regression through the origin is often useful for establishment surveys (Brewer(2002)).
Further, Case 2, the CRE, has appeared in this author’s experience to often be
appropriate, even though the data may demonstrate even higher heteroscedasticity, the
CRE seems robust, especially against data quality problems near the origin not already
eliminated by cutoff sampling. Holmberg and Swensson(2001) present a case for at least
slightly underestimating heteroscedasticity in model-assisted, design-based sampling.
Note the adaptations to the usual regression weight format, Cochran(1953), found in
Sweet and Sigman(1995), and Steel and Fay(1995), but the CRE, or at least the format
wi = xi −2γ , is difficult to better.
(In Holmberg and Swensson(2001), and Sarndal,
Swensson, and Wretman(1992), the format used is wi = xi −γ . In either case for the CRE
we have wi = 1 / xi , with a zero intercept.)
In email correspondence with Anders Holmberg (Statistics Sweden), he expressed a view
approaching that of the one held in this document, that it may be reasonable to use a
smaller value for γ than indicated by an estimation of that value, although Brewer(2002)
warns that for survey sampling, one would normally have 0.5 <γ < 1.0. (Estimating for
9
γ , for electric power surveys, using approaches such as those found in Carroll and
Ruppert(1988) or Knaub(1993, 1997), confirms this, with many cases between 0.8 and
1.0.)
Following is a quote from Anders Holmberg in an email correspondence dated October
22, 2007. “Ken’s interval” refers to Brewer(2002) where as indicated above, it is argued
that for establishment surveys, appropriate estimation generally falls between Case 2 and
Case 3:
“… I am keen to agree with you that it is better/safer to choose 0.5 [Case 2] over a
value in the higher end of Ken's interval. I have noticed that the bad effect of
guessing it wrong is higher if you overestimate the gamma compared to
underestimating it. (At least effects on anticipated variances and variance).”
In addition, improvements have been attempted for regression schemes, Sweet and
Sigman(1995), Steel and Fay(1995), and Karmel and Jain(1987), for example. There are
interesting questions to consider. Should the heteroscedasticity expressed in the model
be different for random nonresponse as opposed to predicting for smaller establishments?
(See Knaub(1999a).) It seems apparent that the smallest observations often have
disproportionately large nonsampling error in cutoff sampling. (See Knaub(2002).)
Removal of the worst cases from direct consideration is a major advantage of cutoff
sampling, that can make it a more accurate option, not just easier and less expensive. See
Knaub(2007a and 2008a). In general, establishment survey data demonstrate a large
amount of heteroscedasticity, often closer to Case 3 than to Case 2, except closer to the
origin. In the past, some partitioning of the data into strata has been tried, but this did not
work well in those particular attempts. Recently, experiments have been performed using
weights which approach w = x −1.6 for the larger establishments, while lowering the
i
i
extremely high weights near the origin. (Because the relative weights between larger and
smaller establishments are then still larger than with w = x −1.6 , in future experiments
i
−2
i
we may want to use weights which approach wi = xi for the larger establishments.)
Although weights should probably be deflated near the origin due to less reliable data,
they must still approach infinity (so zero variance) right at the origin due to the fact that
we are dealing with data here to which regression through the origin reasonably applies.
That is, these are for cases where one expects y→0 when x→0 (unlike what is depicted
on page 37 of Knaub(2007)).
Following are some weighting formulas that could go directly into SAS PROC REG, but
remember that it is the relative weight from on point to the next that matters. For
example, for OLS, all weights could be 0.001, or they could all be 1, or they could all be
3,784. The same results would occur, regardless.
10
Adjusted Regression Weights – Figure 1
0.1
0.08
w1=1/x
Weight
0.06
w2=15/x**1.6
w3=15/(x**1.6 + 38,600/x**2)
0.04
w4=0.75/(x**1.6) + 15/(X**1.6 + 38,600/X**2)
w5=15.75/x**1.6
0.02
0
1
101
201
301
401
501
Measure of Size
Note that w2 in Figure 1 is for Case 3. (A double asterisk in the figure means that an
exponent follows, which is a computer programming convention.) All weights for w2 are
15 times the nominal formula, but that makes no difference to estimation by regression
(i.e., prediction). What it does is to put w2 on the graph in a position more comparable
visually to w1.
w3 contains an added term to reduce the weight near the origin where data quality is
weakest. However, a zero weight means an infinite variance, and that is definitely not
true at the origin for regression through the origin.
w4 adds yet another term to reverse the trend near the origin. This yields a format that
might better fit the variability of the data. However, results have appeared mixed.
Consider that a major advantage of the CRE has not been that it appears to perform best
in most cases, but that it does not seem to often perform inadequately. Can we improve
on that? Note that even when a relative standard error for a total, RSE, appears to be
smaller for one alternative over others, it may be less accurate (the variance and/or bias of
the variance could be relatively larger), so more work may be desired than the
comparisons shown next. (See Knaub(2001).)
11
A relative standard error (estimated) is the estimated standard error of a total (or other
parameter) divided by that estimated total (or other parameter), usually expressed as a
percent. (Also see “relative variance” in Hansen, et.al.(1953).) In the graphs to follow,
programmed by Maren Blackham, Science Applications International Corporation
(SAIC), results were shown for relative standard errors (RSEs), and another survey
performance measure, relative standard error from a superpopulation (RSESP). (See
Cochran(1977), page 158, regarding the term “superpopulation.”) The latter measure,
found in Knaub (2002, 2003, 2004, and 2007d) is the model equivalent of removing the
finite population correction factor (fpc, see Knaub(2008b)) from a simple random
sample. It can be used for overall data variance comparisons, even for a census, which
could indicate data quality problems.
Lower RSE and RSESP values indicate superior performance over other alternatives.
However, the accuracy of these measures might vary substantially. See Knaub(2001). In
the first example, Figures 2 and 3, “IndepPP” means we are considering independent
power producers of electricity; “Waste” is the fuel used; “FL” is for the State of Florida.
Small area estimation is used here, meaning that all such plants using this fuel across
southern States are used in the model (estimation group) to estimate what is needed to
publish results for Florida. See Knaub(1999a). The other examples, Figures 4 through 7
are for two more examples, aggregated to the national level.
The weights used are as follows:
wi = ( xi )−1
(
)
−1
wi = xi −1.6 + 20 xi1.6 + cxi −2 , where c = .8 × h 3.6 with h=5th, 10th, and 25th percentile.
and
c
⎛
⎞
wi = xi −1.6 + 20⎜ xi1.6 + xi − 2 ⎟
3
⎝
⎠
−1
, where c = .8 × h 3.6 with h=10th percentile.
Note that the author chose “c” as a parameter indicating the maximum weight for the
second case, which corresponds to w3 in the graph of regression weights given
previously. At approximately 3 times c, the alternatives to the CRE nearly converge.
Note that the approximation for small area estimation in Knaub(1999) was used and that
we are currently working on improving the value used for “delta,” δ, which is a function
of the distributional form of the ‘size’ of the establishments. (Perhaps the coefficient of
skewness or that and the coefficient of kurtosis may be sufficient parameters.) Adjusting
the deltas so that variance matches a more “exact” variance estimate for each “estimation
group” (Knaub(1999)) seems feasible.
12
Also note that the size measure used in the regression weights can be used as a single
regressor, if the components and their coefficients are made the same for groups of data
being compared, so that one slope and its standard error can be used to compare
estimation groups to see if collapsing them may be helpful for purposes of small area
estimation.
Example Result 1: waste-fired gross generation for independent power plants in Florida
– Shown in Figures 2 and 3.
Figure 2
13
Figure 3
14
Example Result 2: natural gas-fired gross generation
Figure 4
15
Figure 5
16
Example Result 3: wind created gross generation
Figure 6
17
Figure 7
Although the CRE appears to do consistently better in the cases above, there were cases
in the many examples investigated, often for smaller (area) publication groups, for which
18
this was not always true. A variance and bias study, such as in Knaub(2001), should be
considered in any case since the accuracy of these estimates impacts their comparisons.
At any rate, the CRE seems robust. (See Knaub(2005).)
8. History of Survey Statistics:
In the United States, there was a debate over the use of probability (design-based)
sampling vs. purposive sampling (which would include cutoff sample) in the 1940s.
Design-based sampling was decided to be superior, and sampling for Official Statistics
became dominated by it. Later, Ken Brewer(1963) and Richard Royall(1970) made a
case for the use of models, which was noted in Cochran(1977), but there was little
enthusiasm, at least at first, at least in the US, for the approach Royall used which did not
include randomization. His thought was that conditionality on a sample was more
important. This was not popular. Royall later modified his views to include “balanced
sampling” (see “Conclusions”) as a way to avoid “model failure” (i.e., not adequately
representing the part of the population not eligible for sampling). Still, many recognize
the usefulness of models in survey statistics, while still favoring a design-based approach.
Note Sarndal, Swensson, and Wretman(1992). Cutoff sampling has survived, however,
perhaps more in Europe and other parts of the world (see Knaub(2007a, 2008a)),
although Sarndal, Swensson, and Wretman(1992) have little good to say about cutoff
sampling. In Brewer(1995, 2002), Ken Brewer argues for a combined approach, and
does not ever suggest abandoning a design-based approach.
One of the strongest opponents of purposive sampling was Morris Hansen. If anyone is
interested in the debate over model-based approaches, then perhaps the most famous
debate over this can be found online on the American Statistical Association website,
from the 1978 North American Joint Statistical Meetings (JSM), where, in a session
chaired by H.O. Hartley, a paper presented by Hansen, Madow, and Tepping put forth
their arguments, which were then commented on by some of the most well-known
statisticians of the day, or even today. Hansen, Madow, and Tepping(1978) was
commented on by V.P. Godambe, Keith Eberhardt, William G. Cochran, Carl Erik
Sarndal, Oscar Kempthorne, and Richard Royall, after which Hansen, Madow, and
Tepping responded. It seemed, in this author’s opinion, that the arguments by Hansen,
et.al., may be more applicable to household surveys than to establishment surveys. Of
course, examples may be found to make either method perform better than the other. Note that this proceedings paper was followed by a journal article: Hansen, Madow, and
Tepping(1983).
19
9. Conclusions:
For highly skewed establishment survey data where the smallest establishments have
difficulty supplying good quality data on a frequent basis (say, monthly), but may supply
good regressor data (e.g., from annual census data), cutoff sampling and regression
estimation may be a viable alternative to design-based methods.
For electric power data, such cutoff sampling has appeared superior in that results
compared well to those from a larger, design-based sample.
Comparing the totals of 12 monthly sample estimates of totals to corresponding annual
census totals taken later has shown that cutoff sampling with regression estimation
appears to work quite well. Also see Knaub(2001).
Because some data just above cutoff thresholds are still of rather low quality, estimated
RSEs are sometimes lower/better with those data deleted, and imputations made, as if the
low quality collected data had been nonresponses. This may be the reason the CRE
works well when heteroscedasticity appears more extreme. It may represent a
compromise. Stratifying the data to apply lower coefficients of heteroscedasticity closer
to the origin has seemed problematic, but may still be worth further efforts in the future.
(In addition, note that Holmberg and Swensson(2001) found less inaccuracy resulting
from underestimating than from overestimating the coefficient of heteroscedasticity.)
Another way to think of this is that because there may still be low quality data among the
smallest observations collected in a cutoff sample, this may mean that the level of
heteroscedasticity apparent in the data overall is probably greater than the level that
should be used in the modeling for estimation. (See Holmberg and Swensson(2001) and
Knaub(1993, 1997).) That is because the area near the origin would have inflated
variance (with actually lower heteroscedasticity that would be overestimated). This tends
to cause substantial overestimation of variance for totals. Confidence bands are made
wide near the origin and get wider after that, too wide in fact.
Consider that cutoff sampling, when regression is used to predict for the out-of-sample
establishments, may be more accurate because design-based sampling means collecting
some data that may be very inaccurate. We need to consider all sources of survey error
(total survey error), including reasonable limits as to how much an estimate might be in
error for cutoff sampling, if the part of the population given no chance of selection were
not modeled well.
(For more on model-failure, see Hansen, Madow, and
Tepping(1978)).
Model-assisted design-based sampling (e.g., Sarndal, Swensson, and Wretman(1992)) or
some other combination of approaches (Brewer(1995, 2002)), may be best, or for a
model, balanced sampling (from Royall and Herson(1973a,b), defined well on page 593
in Brewer(1995)) may be useful. (See Chaudhuri and Stenger(1992), pages 74-89, and
Cumberland and Royall(1982).) However, establishment survey data are highly skewed,
and, again, when the smallest establishments cannot provide data of reasonable quality on
20
a frequent basis, then cutoff sampling supplemented with estimation though regression
may provide more accurate results.
Note that some problems that smaller establishments might encounter could be (1) not
collecting data as frequently as they are requested to provide it, and (2) not having
personnel specialized in providing the information. This could often be the case with
Official Statistics.
For establishment surveys, generally a number of data elements are of interest from a
given set of establishments. Thus a ‘cutoff sample’ may actually be a compromise
between the best cutoffs for a number of attributes. Further complications arise when
small area estimation is necessary. Also, bias and variance need to be studied
(Knaub(2001)).
All error sources considered, cutoff sampling will often be competitive, sometimes
perhaps quite superior to design-based alternatives, from the perspective of accuracy of
results. It may also be less costly, easier to implement, and more flexible in the face of
small area needs. It may be possible to do better than using the CRE with cutoff
sampling, but that is certainly quite often a good choice for establishment surveys, and
very simple to implement and relatively easy to check results to avoid data processing
errors.
Acknowledgments:
Thank you to Ken Brewer for help over a number of years, to Nancy Kirkendall for her
1990 presentation on models in survey estimation, and to Anders Holmberg, Joe
Sedransk, and others for discussions. Thanks to Maren Blackham for most of the graphs,
and thanks to those who provide online graphical calculators, one of which was used for
experimenting with regression weight formats.
References:
Abdi, H. (2003). Least-squares. In M. Lewis-Beck, A. Bryman, T. Futing (Eds):
Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage.
pp. 559-561.
Brewer, K.R.W. (1963), "Ratio Estimation in Finite Populations: Some Results
Deducible from the Assumption of an Underlying Stochastic Process," Australian
Journal of Statistics, 5, pp. 93-105.
21
Brewer, K.R.W. (1995), “Combining Design-Based and Model-Based Inference,”
Business Survey Methods, ed. by B.G. Cox, D.A. Binder, B.N. Chinnappa, A.
Christianson, M.J. Colledge, and P.S. Kott, John Wiley & Sons, pp. 589-606.
Brewer, KRW (2002), Combining survey sampling inferences: Weighing Basu's
elephants, Arnold: London and Oxford University Press.
Carroll, R.J., and Ruppert, D. (1988), Transformation and Weighting in Regression,
Chapman &Hall.
Chaudhuri, A. and Stenger, H. (1992), Survey Sampling: Theory and Methods, Marcel
Dekker, Inc.
Cochran, W.G.(1953), Sampling Techniques, pp. 210-212, 1st ed., John Wiley & Sons.
Cochran, W.G.(1977), Sampling Techniques, 3rd ed., John Wiley & Sons.
Cumberland, W.G., and Royall, R.M. (1982), “Does SRS Provide Adequate Balance?,”
Proceedings of the Survey Research Methods Section, ASA, pp. 226-229,
http://www.amstat.org/sections/srms/proceedings/
Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953). Sample Survey Methods and
Theory, Volumes I and II. Wiley.
Hansen, M.H., Madow, W.G., and Tepping, B.J. (1978), “On Inference and Estimation
from Sample Surveys,” with comments from discussants, Proceedings of the Survey
Research Methods Section, ASA, pp. 82-107.
http://www.amstat.org/sections/srms/proceedings/
Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983), “An Evaluation of ModelDependent and Probability-Sampling Inferences in Sample Surveys: Rejoinder,”
Journal of the American Statistical Association, Vol. 78, No. 384 (Dec., 1983), pp. 805807.
Holmberg, A., and Swensson, B. (2001), “On Pareto πps Sampling: Reflections on
Unequal Probability Sampling Strategies,” Theory of Stochastic Processes, Vol. 7, pp.
142-155.
Jessen, R. J., et.al. (1947). "On a Population Sample for Greece," Journal of the
American Statistical Association, pp. 357-384.
Karmel, T.S., and Jain, M. (1987), “Comparison of Purposive and Random Sampling
Schemes for Estimating Capital Expenditure,” Journal of the American Statistical
Association, American Statistical Association, 82, pp. 52-57.
22
Kirkendall, et.al. (1990), “Sampling and Estimation: Making Best Use of Available
Data,” seminar at the EIA, September 1990.
Knaub, J.R., Jr. (1989), "Ratio Estimation and Approximate Optimum Stratification in
Electric Power Surveys," Proceedings of the Section on Survey Research Methods,
American Statistical Association, pp. 848-853.
http://www.amstat.org/sections/srms/proceedings/
Knaub, J.R., Jr. (1990), “Some Theoretical and Applied Investigations of Model and
Unequal Probability Sampling for Electric Power Generation and Cost,” Proceedings of
the Section on Survey Research Methods, American Statistical Association, pp. 748-753.
http://www.amstat.org/sections/srms/proceedings/
Knaub, J.R., Jr. (1993), "Alternative to the Iterated Reweighted Least Squares Method:
Apparent Heteroscedasticity and Linear Regression Model Sampling," Proceedings of the
International Conference on Establishment Surveys, American Statistical Association, pp.
520-525.
Knaub. J.R., Jr. (1996), “Weighted Multiple Regression Estimation for Survey Model
Sampling,” InterStat, May 1996, http://interstat.statjournals.net/. (Note that there is a
shorter version in the ASA Survey Research Methods Section proceedings, 1996.)
Knaub, J.R., Jr. (1997), "Weighting in Regression for Use in Survey Methodology,"
InterStat, April 1997, http://interstat.statjournals.net/. (Note shorter, but improved version
in the 1997 Proceedings of the Section on Survey Research Methods, American
Statistical Association, pp. 153-157.)
Knaub, J.R., Jr. (1999a), “Using Prediction-Oriented Software for Survey Estimation,”
InterStat, August 1999, http://interstat.statjournals.net/, partially covered in "Using
Prediction-Oriented Software for Model-Based and Small Area Estimation," in ASA
Survey Research Methods Section proceedings, 1999, and partially covered in "Using
Prediction-Oriented Software for Estimation in the Presence of Nonresponse,” presented
at the International Conference on Survey Nonresponse, 1999.
Knaub, J.R. Jr. (circa 1999b), “Model-Based Sampling, Inference and Imputation,” EIA
web site: http://www.eia.doe.gov/cneaf/electricity/forms/eiawebme.pdf
Knaub, J.R., Jr. (2000), “Using Prediction-Oriented Software for Survey Estimation Part II: Ratios of Totals,” InterStat, June 2000, http://interstat.statjournals.net/. (Note
shorter, more recent version in ASA Survey Research Methods Section proceedings,
2000.)
Knaub, J.R., Jr. (2001), “Using Prediction-Oriented Software for Survey Estimation Part III: Full-Scale Study of Variance and Bias,” InterStat, June 2001,
http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods
Section proceedings, 2001.)
23
Knaub, J.R., Jr. (2002), “Practical Methods for Electric Power Survey Data,” InterStat,
July 2002, http://interstat.statjournals.net/. (Note another version in ASA Survey
Research Methods Section proceedings, 2002.)
Knaub, J.R., Jr. (2003), “Applied Multiple Regression for Surveys with Regressors of
Changing Relevance: Fuel Switching by Electric Power Producers,” InterStat, May 2003,
http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods
Section proceedings, 2003.)
Knaub, J.R., Jr. (2004), “Modeling Superpopulation Variance: Its Relationship to Total
Survey Error,” InterStat, August 2004, http://interstat.statjournals.net/. (Note another
version in ASA Survey Research Methods Section proceedings, 2004.)
Knaub, J.R., Jr. (2005), “Classical Ratio Estimator,” InterStat, October 2005,
http://interstat.statjournals.net/.
Knaub, J.R., Jr. (2006), Book Review, Journal of Official Statistics, Vol. 22, No. 2, 2006,
pp. 351–355, http://www.jos.nu/Articles/article.asp
Knaub, J.R., Jr. (2007a), “Cutoff Sampling and Inference,” InterStat, April 2007,
http://interstat.statjournals.net/.
Knaub, J.R., Jr. (2007b), “Heteroscedasticity and Homoscedasticity” in Encyclopedia of
Measurement and Statistics, Editor: Neil J. Salkind, Sage, Vol. 2, pp. 431-432.
Knaub, J.R., Jr. (2007c), “Survey Weights” in Encyclopedia of Measurement and
Statistics, Editor: Neil J. Salkind, Sage, Vol. 3, p. 981.
Knaub, J.R., Jr. (2007d), "Model and Survey Performance Measurement by the RSE and
RSESP," Proceedings of the Section on Survey Research Methods, American Statistical
Association, pp. 2730-2736.
http://www.amstat.org/sections/srms/proceedings/
Knaub, J.R., Jr. (2008a), forthcoming. “Cutoff Sampling.” In Encyclopedia of Survey
Research Methods, Editor: Paul J. Lavrakas, Sage, to appear in September 2008.
Knaub, J.R., Jr. (2008b), forthcoming. “Finite Population Correction.” In Encyclopedia of
Survey Research Methods, Editor: Paul J. Lavrakas, Sage, to appear in September 2008.
Maddala, G.S. (1992), Introduction to Econometrics, 2nd ed., Macmillan Pub. Co.
Royall, R.M. (1970), "On Finite Population Sampling Theory Under Certain Linear
Regression Models," Biometrika, 57, pp. 377-387.
Royall, R.M., and Herson, J.(1973a), “Robust Estimation in Finite Populations I,”
Journal of the American Statistical Association, Vol. 68, pp. 880-889.
24
Royall, R.M., and Herson, J.(1973b), “Robust Estimation in Finite Populations II:
Stratification on a Size Variable,” Journal of the American Statistical Association, Vol.
68, pp. 890-893.
Särndal, C.-E., and Lundström, S., (2005), Estimation in Surveys with Nonresponse,
Wiley.
Särndal, C.-E., Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling,
Springer-Verlag.
Sweet, E.M. and Sigman, R.S. (1995), “Evaluation of Model-Assisted Procedures for
Stratifying Skewed Populations Using Auxiliary Data,” Proceedings of the Section on
Survey Research Methods, Vol. I, American Statistical Association, pp. 491-496.
http://www.amstat.org/sections/srms/proceedings/
Steel, P. and Fay, R.E. (1995), “Variance Estimation for Finite Populations with Imputed
Data,” Proceedings of the Section on Survey Research Methods, Vol. I, American
Statistical Association, pp. 374-379.
http://www.amstat.org/sections/srms/proceedings/
Valliant, R., Dorfman, A.H., and Royall, R.M. (2000), Finite Population Sampling and
Inference, A Predictive Approach, John Wiley & Sons.
25