Download Using SAS to Analyze the Summary Data

NESUG 2006 Data Manipulation and Analysis Using SAS to Analyze the Summary Data Zhenyi Xue, Cardiovascular Research Institute, MedStar Health, Inc., Washington, DC ABSTRACT Most of the time, we have the raw data to conduct the necessary statistical analysis in SAS®. However, when there is only summary data available, some additional SAS coding is necessary in order to perform the hypothesis test. SAS procedures, as well as simplified macros (%SUM_TTEST, %SUM_ANOVA, %P_ANOVA, %SUM_CHI) with examples, will be discussed in this paper for chi-square test, two sample t-test, and analysis of variance with summary data as input. INTRODUCTION Before initiating a new study, there is often extensive literature review to retrieve background information, compare existent findings, and support the significance of the study. Since most of the papers only present the summary statistics (sample sizes, means, standard deviations, percentages, etc.), it is very helpful if we have the right tools to generate other statistics in a quick and convenient way. For example, if means and standard deviations are known for two independent groups, it is always interesting to know whether there is statistical significant difference in the mean between the two groups. Similarly, if the number of events and sample size are known for two groups, the relative risk, odds ratio, or risk difference between those groups, could also be calculated. This paper discusses the concept of statistical analysis using the summary data and introduces a series of simplified macros for such analysis. ANALYSIS OF SUMMARY CONTINUOUS DATA The continuous variable is usually reported as sample size ( n ), mean (Y ) and standard deviation ( s ). If there are two or more groups of data, additional hypothesis tests of the existence of differences among groups are sometimes very helpful in understanding the background of the research field. The t and F statistics as well as their degree of freedom can be calculated from the sufficient summary statistics n, Y and s (Table 1 and 2); therefore always as to construct the hypothesis tests without the original individual level data. Table 1. Test Statistic for Independent Two Samples T-TEST Assumption Equal Variance Unequal Variance t Statistic y1 − y 2 (n − 1) s12 + (n 2 − 1) s 22 t= , where s 2 = 1 n1 + n 2 − 2 1 1 s + n1 n 2 t' = y1 − y 2 w1 + w2 , where w1 = 2 1 2 2 s s and w2 = n1 n2 DF n1 + n2 − 2 ( w1 + w2 ) 2 w12 w22 + n1 − 1 n 2 − 1 PROC TTEST can use the summary statistics as the input data. Required by this procedure, this data set should include a character variable called _TYPE_ or _STAT_ which contains the name of the statistic (‘N’, ‘MEAN’,’STD’), a group variable (either numeric or character) for the CLASS statement, and a numeric variable which has the values for the summary statistics. Once the data set is constructed, the 1 NESUG 2006 Data Manipulation and Analysis remaining code for this procedure is no different than the procedure with individual level data. The following is an example to create such a data set: DATA SUMMARY; INPUT GROUP $ _STAT_ $ CONTINUOUS_VAR; DATALINES; GROUP1 N 40 GROUP1 MEAN 1.8 GROUP1 STD 0.8 GROUP2 N 50 GROUP2 MEAN 1.4 GROUP2 STD 0.7 ; Although other variables in the above dataset may vary for different studies, the character variable _STAT_ (or _TYPE_) with its value remains the same. If you don’t want to construct the data set in the specified format as above, the macro %SUM_TTEST introduced in this paper has the minimum coding requirement. The available information n, Y , s will be entered in a sequence for each independent group in the format of N_MEAN_STD=n1 mean1 std1 n2 mean2 std2. Once invoked, it will generate all the necessary output for a two-sample t test. For example, this macro can be invoked by %SUM_TTEST(N_MEAN_STD=40 1.8 0.8 50 1.4 0.7) Table 2. ANOVA Table for One Way Analysis of Variance Source Sum of Squares DF MS SST k −1 SSE k Treatments SST = ∑ ni ( y i − y.. ) 2 k −1 i =1 ni k Error k SSE = ∑∑ ( y ij − y i ) 2 = ∑ (ni − 1) s i2 i =i j =1 i =1 k ∑n i −k i =1 F MST MSE k ∑n i −k i =1 Corrected Total k ni k SS = ∑∑ ( y ij − y.. ) 2 ∑n i −1 i =1 i =1 j =1 Unlike PROC TTEST, there are no SAS procedures that can read the summary statistics data and perform the analysis of variances. In order to test the global null hypothesis H 0 : µ1 = µ 2 = ... = µ k , it is necessary to write additional codes in a DATA step to perform the calculation (Table 2). The macro %P_ANOVA can perform such a task by reading the external number inputted by the analyst. However, with only the summary data, this macro can only perform the global test and provide the p value. Larson (1992) introduced the technique to use the summary sufficient statistics to generate a surrogate data set which can be used for further analysis in PROC GLM. With the surrogate dataset, it can not only calculate the F statistic and p value for the global test, but also generate the post hoc multiple group comparisons controlled by family type II error. By modifying the original macro published on the SAS support website, %SUM_ANOVA presented here further simplifies the coding. Similar to %SUM_TTEST, the user can invoke this macro simply as 2 NESUG 2006 Data Manipulation and Analysis %SUM_ANOVA(N_MEAN_STD=20 71.7 2.1 13 69.9 2.9 9 67.2 5.9) Similar to the one way ANOVA, for multi-way ANOVA, the surrogate dataset can be created from the sufficient statistics for each combined factor level. Therefore this dataset can be analyzed in PROC GLM. ANALYSIS OF SUMMARY COUNT DATA From the summary count data, the raw data can be easily recovered and analyzed. PROC FREQ and PROC LOGISTIC have already developed their functions to analyze such data in the same way as the original individual level data, except SAS requires creating a working data set based on the summary statistics in advance. The following are examples of those data sets: DATA _working1; INPUT event $ trial $ freq; CARDS; Yes Treatment 20 No Treatment 5 Yes Control 5 No Control 20 ; DATA _working2; INPUT event total trial $; CARDS; 20 25 Treatment 5 25 Control ; Once the data is created in the above format, the remaining code for the data analysis, to create frequency tables, perform chi-square test and conduct logistic regression, are straightforward by using the WEIGHT or FREQ statement. The following are the sample codes for PROC FREQ and PROC LOGISTIC using the summary data created above. PROC FREQ DATA=_working1; TABLES trial * event /CHISQ; WEIGHT freq; RUN; PROC LOGISTIC DATA=_working1 DESCENDING; CLASS trial; MODEL event = trial; FREQ freq; RUN; PROC LOGISTIC DATA=_working2 DESCENDING; CLASS trial; MODEL event/total = trial; RUN; In order to avoid repeated SAS coding for this type of analysis, a macro %SUM_CHI for summarized count data was introduced in this paper. This macro requires the input of the number of events and sample size in a sequence for each group in the format EVENT_TOTAL=event_n1 total_N1 event_n2 total_N2……. If there are two groups (2×2 numbers) inputted in this macro, it will generate a 2×2 cross 3 NESUG 2006 Data Manipulation and Analysis tabulation, the chi-square test, Fisher’s exact test, odds ratio, relative risk, and risk difference with their confidence intervals. If there are more than two groups (2×N numbers), it will generate a 2×N cross tabulation and the associated chi-square test and Fisher’s exact test. If the input is not in the above format, it will print an error message. For example, the macro can be invoked as %SUM_CHI(EVENT_TOTAL=20 25 5 25) CONCLUSION This paper introduced the basic concept of statistical analysis in SAS using summary data. A series of macros presented here provides a convenient way to generate hypothesis testing with minimum SAS codes using summary data. Those macros can also be easily modified to add other options supported by SAS/STAT. However, when we want to retrieve more information by pooling the summary results of several studies, a carefully designed meta-analysis is necessary in order to test the same or similar hypothesis. REFERENCES Larson, Davia A. Analysis of Variance with just Summary Statistics as Input. American Statistician, 1992, 46: 151-152. Neter, Kutner, Nachtsheim, and Wasierman. Applied Linear Statistical Models, 4th edition. WCB/McGraw-Hill Publications, Inc. Hamer, Robert M. Simpson, Pippa M. SAS Tools for Meta-Analysis. SUGI 27 proceedings paper 250. SAS/STAT Users Guide. SAS OnlineDoc 9.1.3 http://support.sas.com/onlinedoc/913/docMainpage.jsp ACKNOWLEDGMENTS The author would like to thank David Hellinga for his support and suggestions in the development of this paper. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. If you want an electronic copy of the macros, please contact the author at: Zhenyi Xue Cardiovascular Research Institute Washington Hospital Center MedStar Health, Inc. 110 Irving Street NW, Suite 6D Washington, DC 20010 Work Phone: 202-877-2987 Fax: 202-877-9337 Email: [email protected] 4 NESUG 2006 Data Manipulation and Analysis APPENDIX SUM_TTEST(N_MEAN_STD=); DATA SUMMARY; ARRAY TEMPVALUE(6) (&N_MEAN_STD); ARRAY GP(2) $ ('GROUP1' 'GROUP2'); ARRAY ST(3) $ ('N' 'MEAN' 'STD'); DO I=1 TO 2; DO J=1 TO 3; GROUP=GP(I); _STAT_=ST(J); CONTINUOUS_VAR=TEMPVALUE(J+(I-1)*3); OUTPUT; END; END; PROC TTEST DATA=SUMMARY; CLASS GROUP; VAR CONTINUOUS_VAR; RUN; %MEND SUM_TTEST; %MACRO P_ANOVA(N_MEAN_STD=); DATA _SUMMARY; %LET I=0; %DO %WHILE (%SCAN(&N_MEAN_STD, &I+1, %STR( )) NE %STR( )); %LET I=%EVAL(&I+1); %END; ARRAY TEMPVALUE(&I) (&N_MEAN_STD); TOTAL=0; SSE=0; N_TOTAL=0; SST=0; %DO K=1 %TO &I/3; TOTAL=TOTAL + TEMPVALUE(1+(&K-1)*3)*TEMPVALUE(2+(&K-1)*3); SSE=SSE+(TEMPVALUE(1+(&K-1)*3) -1) * (TEMPVALUE(3+(&K-1)*3)**2); N_TOTAL=N_TOTAL + TEMPVALUE(1+(&K-1)*3); %END; GRAND_MEAN=TOTAL / N_TOTAL; %DO K=1 %TO &I/3; SST=SST + TEMPVALUE(1+(&K-1)*3)*((TEMPVALUE(2+(&K-1)*3)GRAND_MEAN)**2); %END; MST=SST/(&I/3-1); MSE=SSE/(N_TOTAL - &I/3); F=MST/MSE; P_VALUE=1-PROBF(F,&I/3-1,N_TOTAL - &I/3); KEEP N_TOTAL GRAND_MEAN MST MSE F P_VALUE; PROC PRINT DATA=_SUMMARY; TITLE1 'P_ANOVA: P VALUE FOR SUMMARY ANOVA'; TITLE2 "YOU HAVE ENTERED: N_MEAN_STD=&N_MEAN_STD"; RUN; TITLE; %MEND P_ANOVA; %MACRO %MACRO SUM_ANOVA(N_MEAN_STD=); TITLE1 'SUM_ANOVA: ANALYSIS OF VARIANCE ON SUMMARY STATISTICS'; TITLE2 "YOU HAVE ENTERED: N_MEAN_STD=&N_MEAN_STD"; DATA _SUMMARY; %LET I=0; %DO %WHILE (%SCAN(&N_MEAN_STD, &I+1, %STR( )) NE %STR( )); %LET I=%EVAL(&I+1); %END; ARRAY TEMPVALUE(&I) (&N_MEAN_STD); %DO K=1 %TO &I/3; GROUP = "GROUP&K"; 5 NESUG 2006 Data Manipulation and Analysis N_I = TEMPVALUE(%EVAL((&K-1)*3+1)); YBAR_I = TEMPVALUE(%EVAL((&K-1)*3+2)); STD_I = TEMPVALUE(%EVAL((&K-1)*3+3)); YIS = YBAR_I + SQRT((STD_I**2)/N_I); YNS = N_I * YBAR_I - (N_I - 1)*YIS; Y = YIS; FREQ = N_I - 1; OUTPUT; Y = YNS; FREQ=1; OUTPUT; %END; PROC GLM DATA=_SUMMARY; CLASS GROUP; FREQ FREQ; MODEL Y = GROUP; LSMEANS GROUP/ADJ=BON STDERR PDIFF; RUN; TITLE; %MEND SUM_ANOVA; SUM_CHI(EVENT_TOTAL=); DATA _SUMMARY; %LET I=0; %DO %WHILE (%SCAN(&EVENT_TOTAL, &I+1, %STR( )) NE %STR( )); %LET I=%EVAL(&I+1); %END; %DO J=1 %TO &I/2; GROUP = "GROUP&J"; EVENT = 'YES'; COUNT_NUMBER=%EVAL(%SCAN(&EVENT_TOTAL, %EVAL((&J-1)*2+1), %STR( ))); OUTPUT; GROUP = "GROUP&J"; EVENT = 'NO'; COUNT_NUMBER= %EVAL(%SCAN(&EVENT_TOTAL, %EVAL((&J-1)*2+2), %STR( ))) %EVAL(%SCAN(&EVENT_TOTAL, (&J-1)*2+1, %STR( ))); OUTPUT; %END; %IF &I>4 AND %SYSFUNC(MOD(&I,2))=0 %THEN %DO; PROC FREQ DATA=_SUMMARY; TITLE1 'SUM_CHI: N*2 WAY CHI-SQUARE TEST ON SUMMARY STATISTICS'; TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL"; WEIGHT COUNT_NUMBER; TABLES GROUP * EVENT /CHISQ FISHER NOCOL NOPERCENT; RUN; %END; %ELSE %IF &I=4 %THEN %DO; PROC FREQ DATA=_SUMMARY; TITLE1 'SUM_CHI: 2*2 WAY CHI-SQUARE TEST ON SUMMARY STATISTICS'; TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL"; WEIGHT COUNT_NUMBER; TABLES GROUP * EVENT /CHISQ FISHER NOCOL NOPCT RELRISK RISKDIFF; RUN; %END; %ELSE %DO; PROC PRINT DATA=_SUMMARY; TITLE1 'ERROR: NOT VALID INPUT VALUE!! PLEASE CHECK!'; TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL"; RUN; %END; TITLE; %MEND SUM_CHI; %MACRO 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using SAS to Analyze the Summary Data