* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download using sas to compute variances for stratified samples
Survey
Document related concepts
Transcript
Posters USING SAS TO COMPUTE VARIANCES FOR STRATIFIED SAMPLES Lawrence Helbers .. Statistical Support Solutions ABSTRACT the proper VARDEF option is specifIed. The standard error statistic is an exception to this rule. The standard error statistic used expresses the sampling error as deviations from the grand mean rather than deviations from individual stratum means. SAS does not incorporate finite population correction factors either. As a result the estimated standard errors and resulting confIdence intervals can be larger than the estimates that would be generated by the desired estimator. Sampling variability could be overstated defeating the purpose of the special sample design efforts made to utilize stratification. Standard SAS procedures for computing sampling variability for numerical variables do not use the appropriate error model in the case of stratified samples. The standard error estimate reported by SAS can be either larger or smaller than the desired estimate depending on the specific situation. Of course, SAS's great flexibility allows analysts to compute the desired statistic, but this requires multiple steps and programming. Below a SAS macro is discussed which simplifies this .computation. The section below discusses the mathematical formulas appropriate for statistics from stratified samples of both finite and infinite populations. A SAS macro is then developed in the subsequent section which implements these formulas and prints the desired standard error estimates. Following that, a section provides some examples of how to call the macro and shows output results. INTRODUCTION Stratified samples are in common use in social and market research surveys. Such samples offer increased sampling efficiency for a given sample size. They also offer the ability to target sample selectively to subpopulations of particular interest to a research study. Such samples subdivide the sample frame into mutually exclusive groups called strata. A sample is then selected from each group at sampling rates specific to each group according to the sample allocation plan. ESTIMATORS FOR STRATIFIED SAMPLES Standard textbooks on sampling (e.g., Cochran, 1977 or Kish, 1965) define estimators for stratified samples as weighted combinations of stratum specific summary measures. For example, an unbiased estimate of the population mean y is the weighted sum of the stratum means Yh, as follows: Using the additional information about the population size of each strata that is generated by stratiflCation, analysts can produce more efficient estimates. Such estimates also have a lower risk of nonresponse bias compared to simple random samples. Analysts weight the responses by the inverse of their effective sampling rates considering both the design sampling rate and the response rate by stratum. (1) SAS's PROC MEANS and PROC SUMMARY procedures can be used to produce weighted summary statistics for numerical variables. Sample statistics calculated by these procedures such as the mean, variance, and standard deviation are at least close approximations to the desired estimators (ie., sample statistic formulas), when where W h is the proportion of the population in stratum h or NhIN. SAS's standard weighting is applied at the case not stratum level, so survey analysts use case weights that produce an identical result. Case weights are defmed as the inverse of thestratumsamplingrateNhln h, where upper case letters denote the population and lower case letters 625 Posters be problematic or inconsequential. Using the VARDEF=DF option with Nh/n h case weights will be problematic, producing incorrect estimates. Using the VARDEF=WDF option will be inconsequential if the population is large, because dividing by N or N·l is nearly equivalent. If the sample size is large the BAS generated estimate should be adequate (i.e., subject to a small bias) and approximate what is called for by theory. Analysts will need to decide whether their particular situation can be handled in this way. denote sample counts. SAS uses the following formula for calculating weighted means from individual observations i: (2) Because the right most summation expression is equivalent to Yh, we see that Equations I and 2 are equivalent. BAS divides the case weighted summation of the Yhi by the sum of the case weights, which in this case is the population total N (nh observations in the stratum each with a weight of Nh I nh will sum to Nh and snmming these across all strata will total N). BAS uses a procedure equivalent to textbooks for means and is more general in that the weights can be multiplied by any arbitrary positive constant and still produce the same result. The standard deviation statistic generated by BAS is simply the square root of the estimated variance. It is invariant to transforms of the weights like variance, if the VARDEF=WGT option is specified. Comments about the limitations due to degree of freedom adjustments apply here too. The BAS estimate should generally be a good approximation of the desired statistic. The standard error statistic measures sample to sample variability one can expect for the estimated population mean. BAS uses the estimator appropriate for a simple random sample from inImite populations . That is, it reports the standard deviation from the grand mean divided by the square root of sample size. This will often not be a good estimator for the sample variability of a stratified sample. It should be replaced with the formula below that incorporates information about the stratifIcation and recognizes that some samples are from Imite populations. Variance and standard deviation statistics are measures of the spread of a numeric variable in the population measured from the population mean. Deviations from the population mean are weighted by their frequency of occurrence Nh/n h and averaged by dividing N. SAS uses a variance formula equivalent to the following, when you specify the VARDEF=WGT option in PROC MEANS: (4) (3) In this formula deviations are expressed from stratum rather than population means. The more homogeneous the population within strata the smaller the standard errors, despite heterogeneity among strata. For mite populations, the formula also reflects that as the sampling rate r,. (Le., nh/Nh) approaches 100 percent that the sampling error should fall to zero in that stratum. The (I·r,.) term is called the finite population correction factor (fpc). When populations can be subdivided into homogeneous (low-variance) groups or high Like the mean, the variance statistic is invariant to arbitrary positive multiplications of the weights by a constant, if this VARDEF option is used. It is not invariant if other options are used. This statistic is biased because it uses Y, the sample estimate of the population mean, to calculate deviations. Using the standard degrees of freedom adjustment with VARDEF=DF (n-I) or VARDEF=WDF (sum ofwgts -1) options can either 626 Posters variance groups can be sampled at high sampling rates (so that the fpc approaches zero), the standard error for a stratified sample will be smaller. writes them to an output dataset with one record per stratum. It also provides the overall sum of the weights (N) and estimates of the population mean j in its first CTYPE_=O, overall summary) output record. Case weights equal to Nhlnh must be used on a SAS WEIGHT statement to obtain these results. Unlike the mean, variance, and standard deviation formulas above there is no simple transformation of the case weights SAS uses that will achieve or approximate this formula for the standard error. So, an alternative computational approach is needed. This makes the calculation of Wh WhlN) and t;. (nhINh) from Eq. 4 possible. The macro below uses the VAR=, N=, SUMWGT=, MEAN= output variables from PROC SUMMARY. It uses the sample count (N=) data to make a degrees of freedom adjustment to the SAS generated variance estimates by multiplying by (nh/(nh,l)). The macro code is as follows: For infinite populations, such as those subject to sampling with replacement, the standard error formula becomes: Std Err(¥) = (5) %. STRATSMP.!iC; %. macro to calculate standard error and I V confidence limits for stratified samples of I V finite and infillite populations I V; %* user specifies the name of the V dataset and libname &dsnamel The fpc reduces to 1 and Wh becomes the expected proportion expected from stratum h' Like the finite population formula (Eq. 4), this formula requires stratum specific statistics to be generated in the computation. %. stratification variable case weiqht variable a list a variable names %* V &stratvarl &wqtvar, ivars; %qlobal vars; %macro stratsmp(dsname,stratvar,wqtvar); %local numvar dsname stratvar wqtvar; %let n=%words(&vars)I MACRO FOR CQMPUTINGSTANDARD ERRORS The macro developed below reports the standard error statistics for finite and infinite populations based on stratified sample data. It also reports 95 percent confidence intervals under a normal distribution assumption. It allows the user to calculate these standard errors for several variables at a time. 1* calculate statistics b¥ stratum ·1 proc summary data=&dsname vardef=wgt; class &stratvar; weiqht &wqtvar; var &vars; output out=sumstats mean= var=varl'var&n n=nl'n&n sumwqt=wgtl'wqtin; The standard error equations above (Eqs. 4 and 5) are functions of the stratum weights, stratum variance, stratum sample size and population size. Such statistics can readily be computed at the stratum level using the CLASS statement for subpopulations in PROC SUMMARY. The resulting output can then be combined in a DATA step to produce population level estimates of standard errors and confidence intervals. IUD; 1* Combine toqether'lst record is data compute; retain popsuml • popsum&n finsuml ' finsum&n infsuml ' infsumin qblmeanl • qblmean&n, PROC SUMMARY calculates stratum variances (S~), sample sizes (nh ) and weighted countsWh ) and array vmean {.) &varsl array qmeans {·I qblmeanl . qblmean&D; array VIDce varl'varin; Pop. set sumstats end=lastreci I·' 627 SUmmary *1 Posters array popsum {tl array finsum It I array infsum It I array wqts {tl array ns {tl II of I40rds Haero . SAS v6 Haero Guide, p. 256; %* find the nl11Dber of variables in strinq; \macro words(strinql; %local count word; '%let count=l; %let word=%scan(&strinq,&count,%str( II; popsuml'popsum&n; finsuml·finsum&n; infsuml·infsul&n; wqU'wqt&n, nl'n&n, 9;* do i=1 to &nl if _N_ eg 1 then do; %do %while(&word nell %let count=%eval(&count+ll; %let word=%scan(&strinq,&count,%str( III %end, popsum{il = wqts{ill gmeans{il=vmeanlil; finsum{iI=O, infsum{il=O; end, %eval (&count'11 \mend words; else do; 1* bias adjustment for df *1 vrncelil=vrnce{il*(ns{ill(ns{il·l)I; 1* accumulate weiqbted stratum statistics *1 finsum{il=finsum{il+(wqts{il/popsum{i}lt*2 *(vrnce(il/ns{ill*(l·ns{il/wqts{ill; infsul (il=infsum{i)+ (wqts (il lpopsum (i}) *'2 *(vrnce(il/ns{ill; end; end; The WORDS macro from 8AS's Guide to Macro Processing is used simply to provide a count of the variables specified by the user. This generalizes the array variable lists and the ranges for looping. USAGE AND EXAMPLE OUTPUT Users must specify the dataset to use, the name of the stratification variable, the name of the case weight variable and the list of the variables to analyze. The example code listing below shows how the macro can be included in code and called. if lastrec then do; 1* Print the results *1 file print beader=tOPPaqe linesleft=ll notitle; lenqtb vnam $ 8 vlbl $ 40; do i= 1 to in; stderrf=sqrt(finsum(i)l; confintf=I.96*stderrf; stderri=sqrt(infsum(ill; confinti=1.90*stderri; It Get label and var name info *1 call label (wean{i) ,ylbl); call vname (wean {i}, vnam) ; if vlbl eg vnam then vlbl=' '; put @2 VIWI $10. vlbl $25. gmeans{i} 12.3 @51 (stderrf stderri confintf confinti) (12.31; end; return; options ps=66 ls=132 nosource21 libname data ' ./sfiles'; %include'stratsmp.mac'; %let varsavar131 var153 var155; \stratsmp(data. survey, stratum,wqtlI The macro variable VARS in the %LET statement specifies the variable list. The call to %STRATSMP macro specifies the data set, stratum and weight variables respectively. Note that the data set macro variable can contain both libname and member name (e.g., data. survey) references. toppaqe: put &130 ' Standard Error Calculations'; put @30 ' for finite and infinite populations' I/; put ' Variable @44 ' Standard Error 95\ Confidence'; put' Name Description ' @44 'Hean Fini te Infini te Finite Infinite' ; put 100*'_' I; return; end; Figure 1 below shows an example of the output generated by this macro. Each row reports statistics for a single variable. Variable names and user specified labels are used for identmcation. The weighted estimate of the population mean is reported for comparison to the estimated sampling errors. run; \mend stratsmp; 628 Posters Standard Error Calculations for finite and infinite populations Variable Name NUMADLT AGE Q141 mEARS X Y Description Number of Adults Age of Respondent Length of Membership: Rptd Years of membership: Actual Variable X Mean 2.201 54.534 17.861 12.220 20.828 18.675 Standard Error Finite Infinite 95% Confidence Finite Infinite 0.045 0.705 0.520 0.442 6.706 1.967 0.087 1.383 1.019 0.867 13 .143 3.855 0.045 0.705 0.520 0;442 6.740 1.987 0.OB7 1.383 1.019 0.867 13 .210 3.894 Figure 1: Example Output This poster developed a macro which simplifies the calculation of standard errors for stratified samples. The macro calculates standard errors and confidence intervals for finite and infinite populations. Users include the macro in their code and can call the calculation for several variables at a time with just two statements. Variables in the first four rows were from a population that numbered in the millions. In this case the sampling rate for each stratum was quite small, so the reported finite and infinite population statistics are identical at this level of precision. The last two rows were extracted from a report with a higher sampling rate, and we see the finite population statistics are somewhat lower than the infinite column. TRADEMARKS For these two samples the statistics calculated by the macro were 30 percent larger than standard errors generated by default SAS routines. That resulted because the degrees of freedom adjustment, mentioned above, can decrease the denominator for the variance calculation and increase the error estimate. For heterogeneous populations or samples with high sampling rates statistics reported by the macro have been as much as 50 percent lower due to stratification and the finite population correction factor. SAS is a registered trademark of the SAS Institute Inc. in the USA and other countries. I!J> indicates USA registration. SUMMARy SAS Institute Inc. SAS Guide to Macro Processing, Reference Version 6, Second Edition. Cary, NC: SAS Institute Inc., 1991, pp. 256-257. REFERENCES Cochran, William G. Sampling Techniques, 3rd Edition. New York: John Wiley & Sons, 1977. Kish, L. Survey Sampling, New York: John Wiley & Sons, 1965. For stratified samples, standard errors should be calculated using special estimation procedures. For other statistics such as the mean, variance, or standard deviation available SAS procedures should be satisfactory using the proper weighting and procedure options. 629