Download using sas to compute variances for stratified samples

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Posters
USING SAS TO COMPUTE VARIANCES FOR STRATIFIED SAMPLES
Lawrence Helbers
..
Statistical Support Solutions
ABSTRACT
the proper VARDEF option is specifIed. The
standard error statistic is an exception to this rule.
The standard error statistic used expresses the
sampling error as deviations from the grand mean
rather than deviations from individual stratum
means. SAS does not incorporate finite population
correction factors either. As a result the estimated
standard errors and resulting confIdence intervals
can be larger than the estimates that would be
generated by the desired estimator. Sampling
variability could be overstated defeating the
purpose of the special sample design efforts made
to utilize stratification.
Standard SAS procedures for computing sampling
variability for numerical variables do not use the
appropriate error model in the case of stratified
samples. The standard error estimate reported by
SAS can be either larger or smaller than the
desired estimate depending on the specific
situation. Of course, SAS's great flexibility allows
analysts to compute the desired statistic, but this
requires multiple steps and programming. Below a
SAS macro is discussed which simplifies this
.computation.
The section below discusses the mathematical
formulas appropriate for statistics from stratified
samples of both finite and infinite populations. A
SAS macro is then developed in the subsequent
section which implements these formulas and
prints the desired standard error estimates.
Following that, a section provides some examples
of how to call the macro and shows output results.
INTRODUCTION
Stratified samples are in common use in social and
market research surveys. Such samples offer
increased sampling efficiency for a given sample
size. They also offer the ability to target sample
selectively to subpopulations of particular interest
to a research study. Such samples subdivide the
sample frame into mutually exclusive groups called
strata. A sample is then selected from each group
at sampling rates specific to each group according
to the sample allocation plan.
ESTIMATORS FOR STRATIFIED SAMPLES
Standard textbooks on sampling (e.g., Cochran,
1977 or Kish, 1965) define estimators for stratified
samples as weighted combinations of stratum
specific summary measures. For example, an
unbiased estimate of the population mean y is the
weighted sum of the stratum means Yh, as follows:
Using the additional information about the
population size of each strata that is generated by
stratiflCation, analysts can produce more efficient
estimates. Such estimates also have a lower risk of
nonresponse bias compared to simple random
samples. Analysts weight the responses by the
inverse of their effective sampling rates considering
both the design sampling rate and the response
rate by stratum.
(1)
SAS's PROC MEANS and PROC SUMMARY
procedures can be used to produce weighted
summary statistics for numerical variables. Sample
statistics calculated by these procedures such as
the mean, variance, and standard deviation are at
least close approximations to the desired
estimators (ie., sample statistic formulas), when
where W h is the proportion of the population in
stratum h or NhIN. SAS's standard weighting is
applied at the case not stratum level, so survey
analysts use case weights that produce an identical
result. Case weights are defmed as the inverse of
thestratumsamplingrateNhln h, where upper case
letters denote the population and lower case letters
625
Posters
be problematic or inconsequential. Using the
VARDEF=DF option with Nh/n h case weights will
be problematic, producing incorrect estimates.
Using the VARDEF=WDF option will be
inconsequential if the population is large, because
dividing by N or N·l is nearly equivalent. If the
sample size is large the BAS generated estimate
should be adequate (i.e., subject to a small bias)
and approximate what is called for by theory.
Analysts will need to decide whether their
particular situation can be handled in this way.
denote sample counts. SAS uses the following
formula for calculating weighted means from
individual observations i:
(2)
Because the right most summation expression is
equivalent to Yh, we see that Equations I and 2 are
equivalent. BAS divides the case weighted
summation of the Yhi by the sum of the case
weights, which in this case is the population total
N (nh observations in the stratum each with a
weight of Nh I nh will sum to Nh and snmming these
across all strata will total N). BAS uses a
procedure equivalent to textbooks for means and is
more general in that the weights can be multiplied
by any arbitrary positive constant and still produce
the same result.
The standard deviation statistic generated by BAS
is simply the square root of the estimated variance.
It is invariant to transforms of the weights like
variance, if the VARDEF=WGT option is specified.
Comments about the limitations due to degree of
freedom adjustments apply here too. The BAS
estimate should generally be a good approximation
of the desired statistic.
The standard error statistic measures sample to
sample variability one can expect for the estimated
population mean. BAS uses the estimator
appropriate for a simple random sample from
inImite populations . That is, it reports the
standard deviation from the grand mean divided by
the square root of sample size. This will often not
be a good estimator for the sample variability of a
stratified sample. It should be replaced with the
formula below that incorporates information about
the stratifIcation and recognizes that some samples
are from Imite populations.
Variance and standard deviation statistics are
measures of the spread of a numeric variable in
the population measured from the population
mean. Deviations from the population mean are
weighted by their frequency of occurrence Nh/n h
and averaged by dividing N. SAS uses a variance
formula equivalent to the following, when you
specify the VARDEF=WGT option in PROC
MEANS:
(4)
(3)
In this formula deviations are expressed from
stratum rather than population means. The more
homogeneous the population within strata the
smaller the standard errors, despite heterogeneity
among strata. For mite populations, the formula
also reflects that as the sampling rate r,. (Le.,
nh/Nh) approaches 100 percent that the sampling
error should fall to zero in that stratum. The (I·r,.)
term is called the finite population correction factor
(fpc). When populations can be subdivided into
homogeneous (low-variance) groups or high
Like the mean, the variance statistic is invariant
to arbitrary positive multiplications of the weights
by a constant, if this VARDEF option is used. It is
not invariant if other options are used.
This statistic is biased because it uses Y, the
sample estimate of the population mean, to
calculate deviations. Using the standard degrees of
freedom adjustment with VARDEF=DF (n-I) or
VARDEF=WDF (sum ofwgts -1) options can either
626
Posters
variance groups can be sampled at high sampling
rates (so that the fpc approaches zero), the
standard error for a stratified sample will be
smaller.
writes them to an output dataset with one record
per stratum. It also provides the overall sum of the
weights (N) and estimates of the population mean
j in its first CTYPE_=O, overall summary) output
record. Case weights equal to Nhlnh must be used
on a SAS WEIGHT statement to obtain these
results.
Unlike the mean, variance, and standard deviation
formulas above there is no simple transformation
of the case weights SAS uses that will achieve or
approximate this formula for the standard error.
So, an alternative computational approach is
needed.
This makes the calculation of Wh WhlN) and t;.
(nhINh) from Eq. 4 possible. The macro below uses
the VAR=, N=, SUMWGT=, MEAN= output
variables from PROC SUMMARY. It uses the
sample count (N=) data to make a degrees of
freedom adjustment to the SAS generated variance
estimates by multiplying by (nh/(nh,l)). The macro
code is as follows:
For infinite populations, such as those subject to
sampling with replacement, the standard error
formula becomes:
Std Err(¥) =
(5)
%. STRATSMP.!iC;
%. macro to calculate standard error and I
V confidence limits for stratified samples of I
V finite and infillite populations I
V;
%* user specifies the name of the
V
dataset and libname
&dsnamel
The fpc reduces to 1 and Wh becomes the expected
proportion expected from stratum h' Like the finite
population formula (Eq. 4), this formula requires
stratum specific statistics to be generated in the
computation.
%.
stratification variable
case weiqht variable
a list a variable names
%*
V
&stratvarl
&wqtvar,
ivars;
%qlobal vars;
%macro stratsmp(dsname,stratvar,wqtvar);
%local numvar dsname stratvar wqtvar;
%let n=%words(&vars)I
MACRO FOR CQMPUTINGSTANDARD ERRORS
The macro developed below reports the standard
error statistics for finite and infinite populations
based on stratified sample data. It also reports 95
percent confidence intervals under a normal
distribution assumption. It allows the user to
calculate these standard errors for several
variables at a time.
1* calculate statistics b¥ stratum ·1
proc summary data=&dsname vardef=wgt;
class &stratvar;
weiqht &wqtvar;
var &vars;
output out=sumstats mean=
var=varl'var&n
n=nl'n&n
sumwqt=wgtl'wqtin;
The standard error equations above (Eqs. 4 and 5)
are functions of the stratum weights, stratum
variance, stratum sample size and population size.
Such statistics can readily be computed at the
stratum level using the CLASS statement for
subpopulations in PROC SUMMARY. The resulting
output can then be combined in a DATA step to
produce population level estimates of standard
errors and confidence intervals.
IUD;
1* Combine toqether'lst record is
data compute;
retain popsuml • popsum&n
finsuml ' finsum&n
infsuml ' infsumin
qblmeanl • qblmean&n,
PROC SUMMARY calculates stratum variances
(S~), sample sizes (nh ) and weighted countsWh ) and
array vmean {.) &varsl
array qmeans {·I qblmeanl . qblmean&D;
array VIDce
varl'varin;
Pop.
set sumstats end=lastreci
I·'
627
SUmmary *1
Posters
array popsum {tl
array finsum It I
array infsum It I
array wqts
{tl
array ns
{tl
II of I40rds Haero . SAS v6 Haero Guide, p. 256;
%* find the nl11Dber of variables in strinq;
\macro words(strinql;
%local count word;
'%let count=l;
%let word=%scan(&strinq,&count,%str( II;
popsuml'popsum&n;
finsuml·finsum&n;
infsuml·infsul&n;
wqU'wqt&n,
nl'n&n,
9;*
do i=1 to &nl
if _N_ eg 1 then do;
%do %while(&word nell
%let count=%eval(&count+ll;
%let word=%scan(&strinq,&count,%str( III
%end,
popsum{il = wqts{ill
gmeans{il=vmeanlil;
finsum{iI=O,
infsum{il=O;
end,
%eval (&count'11
\mend words;
else do;
1* bias adjustment for df *1
vrncelil=vrnce{il*(ns{ill(ns{il·l)I;
1* accumulate weiqbted stratum statistics *1
finsum{il=finsum{il+(wqts{il/popsum{i}lt*2
*(vrnce(il/ns{ill*(l·ns{il/wqts{ill;
infsul (il=infsum{i)+ (wqts (il lpopsum (i}) *'2
*(vrnce(il/ns{ill;
end;
end;
The WORDS macro from 8AS's Guide to Macro
Processing is used simply to provide a count of the
variables specified by the user. This generalizes the
array variable lists and the ranges for looping.
USAGE AND EXAMPLE OUTPUT
Users must specify the dataset to use, the name of
the stratification variable, the name of the case
weight variable and the list of the variables to
analyze. The example code listing below shows how
the macro can be included in code and called.
if lastrec then do; 1* Print the results *1
file print beader=tOPPaqe linesleft=ll notitle;
lenqtb vnam $ 8 vlbl $ 40;
do i= 1 to
in;
stderrf=sqrt(finsum(i)l;
confintf=I.96*stderrf;
stderri=sqrt(infsum(ill;
confinti=1.90*stderri;
It Get label and var name info *1
call label (wean{i) ,ylbl);
call vname (wean {i}, vnam) ;
if vlbl eg vnam then vlbl=' ';
put @2 VIWI $10. vlbl $25. gmeans{i} 12.3
@51 (stderrf stderri confintf confinti) (12.31;
end;
return;
options ps=66 ls=132 nosource21
libname data ' ./sfiles';
%include'stratsmp.mac';
%let varsavar131 var153 var155;
\stratsmp(data. survey, stratum,wqtlI
The macro variable VARS in the %LET statement
specifies the variable list. The call to %STRATSMP
macro specifies the data set, stratum and weight
variables respectively. Note that the data set
macro variable can contain both libname and
member name (e.g., data. survey) references.
toppaqe:
put &130 '
Standard Error Calculations';
put @30 '
for finite and infinite
populations' I/;
put ' Variable
@44 '
Standard Error
95\ Confidence';
put' Name
Description '
@44 'Hean
Fini te Infini te Finite
Infinite' ;
put 100*'_' I;
return;
end;
Figure 1 below shows an example of the output
generated by this macro. Each row reports
statistics for a single variable. Variable names and
user specified labels are used for identmcation. The
weighted estimate of the population mean is
reported for comparison to the estimated sampling
errors.
run;
\mend stratsmp;
628
Posters
Standard Error Calculations
for finite and infinite populations
Variable
Name
NUMADLT
AGE
Q141
mEARS
X
Y
Description
Number of Adults
Age of Respondent
Length of Membership: Rptd
Years of membership: Actual
Variable X
Mean
2.201
54.534
17.861
12.220
20.828
18.675
Standard Error
Finite
Infinite
95% Confidence
Finite
Infinite
0.045
0.705
0.520
0.442
6.706
1.967
0.087
1.383
1.019
0.867
13 .143
3.855
0.045
0.705
0.520
0;442
6.740
1.987
0.OB7
1.383
1.019
0.867
13 .210
3.894
Figure 1: Example Output
This poster developed a macro which simplifies the
calculation of standard errors for stratified
samples. The macro calculates standard errors and
confidence intervals for finite and infinite
populations. Users include the macro in their code
and can call the calculation for several variables at
a time with just two statements.
Variables in the first four rows were from a
population that numbered in the millions. In this
case the sampling rate for each stratum was quite
small, so the reported finite and infinite population
statistics are identical at this level of precision.
The last two rows were extracted from a report
with a higher sampling rate, and we see the finite
population statistics are somewhat lower than the
infinite column.
TRADEMARKS
For these two samples the statistics calculated by
the macro were 30 percent larger than standard
errors generated by default SAS routines. That
resulted because the degrees of freedom
adjustment, mentioned above, can decrease the
denominator for the variance calculation and
increase the error estimate. For heterogeneous
populations or samples with high sampling rates
statistics reported by the macro have been as much
as 50 percent lower due to stratification and the
finite population correction factor.
SAS is a registered trademark of the SAS Institute
Inc. in the USA and other countries. I!J> indicates
USA registration.
SUMMARy
SAS Institute Inc. SAS Guide to Macro Processing,
Reference Version 6, Second Edition. Cary, NC:
SAS Institute Inc., 1991, pp. 256-257.
REFERENCES
Cochran, William G. Sampling Techniques, 3rd
Edition. New York: John Wiley & Sons, 1977.
Kish, L. Survey Sampling, New York: John Wiley
& Sons, 1965.
For stratified samples, standard errors should be
calculated using special estimation procedures. For
other statistics such as the mean, variance, or
standard deviation available SAS procedures
should be satisfactory using the proper weighting
and procedure options.
629