Download Using SAS to Analyze the Summary Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Analysis of variance wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
NESUG 2006
Data Manipulation and Analysis
Using SAS to Analyze the Summary Data
Zhenyi Xue, Cardiovascular Research Institute, MedStar Health, Inc., Washington, DC
ABSTRACT
Most of the time, we have the raw data to conduct the necessary statistical analysis in SAS®. However,
when there is only summary data available, some additional SAS coding is necessary in order to perform
the hypothesis test. SAS procedures, as well as simplified macros (%SUM_TTEST, %SUM_ANOVA,
%P_ANOVA, %SUM_CHI) with examples, will be discussed in this paper for chi-square test, two
sample t-test, and analysis of variance with summary data as input.
INTRODUCTION
Before initiating a new study, there is often extensive literature review to retrieve background
information, compare existent findings, and support the significance of the study. Since most of the
papers only present the summary statistics (sample sizes, means, standard deviations, percentages, etc.),
it is very helpful if we have the right tools to generate other statistics in a quick and convenient way. For
example, if means and standard deviations are known for two independent groups, it is always
interesting to know whether there is statistical significant difference in the mean between the two
groups. Similarly, if the number of events and sample size are known for two groups, the relative risk,
odds ratio, or risk difference between those groups, could also be calculated. This paper discusses the
concept of statistical analysis using the summary data and introduces a series of simplified macros for
such analysis.
ANALYSIS OF SUMMARY CONTINUOUS DATA
The continuous variable is usually reported as sample size ( n ), mean (Y ) and standard deviation ( s ). If
there are two or more groups of data, additional hypothesis tests of the existence of differences among
groups are sometimes very helpful in understanding the background of the research field. The t and F
statistics as well as their degree of freedom can be calculated from the sufficient summary statistics
n, Y and s (Table 1 and 2); therefore always as to construct the hypothesis tests without the original
individual level data.
Table 1. Test Statistic for Independent Two Samples T-TEST
Assumption
Equal Variance
Unequal Variance
t Statistic
y1 − y 2
(n − 1) s12 + (n 2 − 1) s 22
t=
, where s 2 = 1
n1 + n 2 − 2
1
1
s
+
n1 n 2
t' =
y1 − y 2
w1 + w2
, where w1 =
2
1
2
2
s
s
and w2 =
n1
n2
DF
n1 + n2 − 2
( w1 + w2 ) 2
w12
w22
+
n1 − 1 n 2 − 1
PROC TTEST can use the summary statistics as the input data. Required by this procedure, this data set
should include a character variable called _TYPE_ or _STAT_ which contains the name of the statistic
(‘N’, ‘MEAN’,’STD’), a group variable (either numeric or character) for the CLASS statement, and a
numeric variable which has the values for the summary statistics. Once the data set is constructed, the
1
NESUG 2006
Data Manipulation and Analysis
remaining code for this procedure is no different than the procedure with individual level data. The
following is an example to create such a data set:
DATA SUMMARY;
INPUT GROUP $ _STAT_ $ CONTINUOUS_VAR;
DATALINES;
GROUP1 N
40
GROUP1 MEAN 1.8
GROUP1 STD 0.8
GROUP2 N
50
GROUP2 MEAN 1.4
GROUP2 STD 0.7
;
Although other variables in the above dataset may vary for different studies, the character variable
_STAT_ (or _TYPE_) with its value remains the same. If you don’t want to construct the data set in the
specified format as above, the macro %SUM_TTEST introduced in this paper has the minimum coding
requirement. The available information n, Y , s will be entered in a sequence for each independent group
in the format of N_MEAN_STD=n1 mean1 std1 n2 mean2 std2. Once invoked, it will generate all the
necessary output for a two-sample t test. For example, this macro can be invoked by
%SUM_TTEST(N_MEAN_STD=40 1.8 0.8 50 1.4 0.7)
Table 2. ANOVA Table for One Way Analysis of Variance
Source
Sum of Squares
DF
MS
SST
k −1
SSE
k
Treatments
SST = ∑ ni ( y i − y.. ) 2
k −1
i =1
ni
k
Error
k
SSE = ∑∑ ( y ij − y i ) 2 = ∑ (ni − 1) s i2
i =i j =1
i =1
k
∑n
i
−k
i =1
F
MST
MSE
k
∑n
i
−k
i =1
Corrected
Total
k
ni
k
SS = ∑∑ ( y ij − y.. ) 2
∑n
i
−1
i =1
i =1 j =1
Unlike PROC TTEST, there are no SAS procedures that can read the summary statistics data and
perform the analysis of variances. In order to test the global null hypothesis H 0 : µ1 = µ 2 = ... = µ k , it is
necessary to write additional codes in a DATA step to perform the calculation (Table 2). The macro
%P_ANOVA can perform such a task by reading the external number inputted by the analyst. However,
with only the summary data, this macro can only perform the global test and provide the p value. Larson
(1992) introduced the technique to use the summary sufficient statistics to generate a surrogate data set
which can be used for further analysis in PROC GLM. With the surrogate dataset, it can not only
calculate the F statistic and p value for the global test, but also generate the post hoc multiple group
comparisons controlled by family type II error. By modifying the original macro published on the SAS
support website, %SUM_ANOVA presented here further simplifies the coding. Similar to
%SUM_TTEST, the user can invoke this macro simply as
2
NESUG 2006
Data Manipulation and Analysis
%SUM_ANOVA(N_MEAN_STD=20 71.7 2.1 13 69.9 2.9 9 67.2 5.9)
Similar to the one way ANOVA, for multi-way ANOVA, the surrogate dataset can be created from the
sufficient statistics for each combined factor level. Therefore this dataset can be analyzed in PROC
GLM.
ANALYSIS OF SUMMARY COUNT DATA
From the summary count data, the raw data can be easily recovered and analyzed. PROC FREQ and
PROC LOGISTIC have already developed their functions to analyze such data in the same way as the
original individual level data, except SAS requires creating a working data set based on the summary
statistics in advance. The following are examples of those data sets:
DATA _working1;
INPUT event $ trial $ freq;
CARDS;
Yes
Treatment
20
No
Treatment
5
Yes
Control
5
No
Control
20
;
DATA _working2;
INPUT event total trial $;
CARDS;
20
25
Treatment
5
25
Control
;
Once the data is created in the above format, the remaining code for the data analysis, to create
frequency tables, perform chi-square test and conduct logistic regression, are straightforward by using
the WEIGHT or FREQ statement. The following are the sample codes for PROC FREQ and PROC
LOGISTIC using the summary data created above.
PROC FREQ DATA=_working1;
TABLES trial * event /CHISQ;
WEIGHT freq;
RUN;
PROC LOGISTIC DATA=_working1 DESCENDING;
CLASS trial;
MODEL event = trial;
FREQ freq;
RUN;
PROC LOGISTIC DATA=_working2 DESCENDING;
CLASS trial;
MODEL event/total = trial;
RUN;
In order to avoid repeated SAS coding for this type of analysis, a macro %SUM_CHI for summarized
count data was introduced in this paper. This macro requires the input of the number of events and
sample size in a sequence for each group in the format EVENT_TOTAL=event_n1 total_N1 event_n2
total_N2……. If there are two groups (2×2 numbers) inputted in this macro, it will generate a 2×2 cross
3
NESUG 2006
Data Manipulation and Analysis
tabulation, the chi-square test, Fisher’s exact test, odds ratio, relative risk, and risk difference with their
confidence intervals. If there are more than two groups (2×N numbers), it will generate a 2×N cross
tabulation and the associated chi-square test and Fisher’s exact test. If the input is not in the above
format, it will print an error message. For example, the macro can be invoked as
%SUM_CHI(EVENT_TOTAL=20 25 5 25)
CONCLUSION
This paper introduced the basic concept of statistical analysis in SAS using summary data. A series of
macros presented here provides a convenient way to generate hypothesis testing with minimum SAS
codes using summary data. Those macros can also be easily modified to add other options supported by
SAS/STAT. However, when we want to retrieve more information by pooling the summary results of
several studies, a carefully designed meta-analysis is necessary in order to test the same or similar
hypothesis.
REFERENCES
Larson, Davia A. Analysis of Variance with just Summary Statistics as Input. American Statistician,
1992, 46: 151-152.
Neter, Kutner, Nachtsheim, and Wasierman. Applied Linear Statistical Models, 4th edition.
WCB/McGraw-Hill Publications, Inc.
Hamer, Robert M. Simpson, Pippa M. SAS Tools for Meta-Analysis. SUGI 27 proceedings paper 250.
SAS/STAT Users Guide. SAS OnlineDoc 9.1.3 http://support.sas.com/onlinedoc/913/docMainpage.jsp
ACKNOWLEDGMENTS
The author would like to thank David Hellinga for his support and suggestions in the development of
this paper.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and
product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. If you want an electronic copy of the macros,
please contact the author at:
Zhenyi Xue
Cardiovascular Research Institute
Washington Hospital Center
MedStar Health, Inc.
110 Irving Street NW, Suite 6D
Washington, DC 20010
Work Phone: 202-877-2987
Fax: 202-877-9337
Email: [email protected]
4
NESUG 2006
Data Manipulation and Analysis
APPENDIX
SUM_TTEST(N_MEAN_STD=);
DATA SUMMARY;
ARRAY TEMPVALUE(6) (&N_MEAN_STD);
ARRAY GP(2) $ ('GROUP1' 'GROUP2');
ARRAY ST(3) $ ('N' 'MEAN' 'STD');
DO I=1 TO 2;
DO J=1 TO 3;
GROUP=GP(I);
_STAT_=ST(J);
CONTINUOUS_VAR=TEMPVALUE(J+(I-1)*3);
OUTPUT;
END;
END;
PROC TTEST DATA=SUMMARY; CLASS GROUP; VAR CONTINUOUS_VAR; RUN;
%MEND SUM_TTEST;
%MACRO
P_ANOVA(N_MEAN_STD=);
DATA _SUMMARY;
%LET I=0;
%DO %WHILE (%SCAN(&N_MEAN_STD, &I+1, %STR( )) NE %STR( ));
%LET I=%EVAL(&I+1);
%END;
ARRAY TEMPVALUE(&I) (&N_MEAN_STD);
TOTAL=0; SSE=0; N_TOTAL=0; SST=0;
%DO K=1 %TO &I/3;
TOTAL=TOTAL + TEMPVALUE(1+(&K-1)*3)*TEMPVALUE(2+(&K-1)*3);
SSE=SSE+(TEMPVALUE(1+(&K-1)*3) -1) * (TEMPVALUE(3+(&K-1)*3)**2);
N_TOTAL=N_TOTAL + TEMPVALUE(1+(&K-1)*3);
%END;
GRAND_MEAN=TOTAL / N_TOTAL;
%DO K=1 %TO &I/3;
SST=SST + TEMPVALUE(1+(&K-1)*3)*((TEMPVALUE(2+(&K-1)*3)GRAND_MEAN)**2);
%END;
MST=SST/(&I/3-1);
MSE=SSE/(N_TOTAL - &I/3);
F=MST/MSE;
P_VALUE=1-PROBF(F,&I/3-1,N_TOTAL - &I/3);
KEEP N_TOTAL GRAND_MEAN MST MSE F P_VALUE;
PROC PRINT DATA=_SUMMARY;
TITLE1 'P_ANOVA: P VALUE FOR SUMMARY ANOVA';
TITLE2 "YOU HAVE ENTERED: N_MEAN_STD=&N_MEAN_STD";
RUN;
TITLE;
%MEND P_ANOVA;
%MACRO
%MACRO
SUM_ANOVA(N_MEAN_STD=);
TITLE1 'SUM_ANOVA: ANALYSIS OF VARIANCE ON SUMMARY STATISTICS';
TITLE2 "YOU HAVE ENTERED: N_MEAN_STD=&N_MEAN_STD";
DATA _SUMMARY;
%LET I=0;
%DO %WHILE (%SCAN(&N_MEAN_STD, &I+1, %STR( )) NE %STR( ));
%LET I=%EVAL(&I+1);
%END;
ARRAY TEMPVALUE(&I) (&N_MEAN_STD);
%DO K=1 %TO &I/3;
GROUP = "GROUP&K";
5
NESUG 2006
Data Manipulation and Analysis
N_I = TEMPVALUE(%EVAL((&K-1)*3+1));
YBAR_I = TEMPVALUE(%EVAL((&K-1)*3+2));
STD_I = TEMPVALUE(%EVAL((&K-1)*3+3));
YIS = YBAR_I + SQRT((STD_I**2)/N_I);
YNS = N_I * YBAR_I - (N_I - 1)*YIS;
Y = YIS; FREQ = N_I - 1; OUTPUT;
Y = YNS; FREQ=1;
OUTPUT;
%END;
PROC GLM DATA=_SUMMARY;
CLASS GROUP;
FREQ FREQ;
MODEL Y = GROUP;
LSMEANS GROUP/ADJ=BON STDERR PDIFF;
RUN;
TITLE;
%MEND SUM_ANOVA;
SUM_CHI(EVENT_TOTAL=);
DATA _SUMMARY;
%LET I=0;
%DO %WHILE (%SCAN(&EVENT_TOTAL, &I+1, %STR( )) NE %STR( ));
%LET I=%EVAL(&I+1);
%END;
%DO J=1 %TO &I/2;
GROUP = "GROUP&J"; EVENT = 'YES';
COUNT_NUMBER=%EVAL(%SCAN(&EVENT_TOTAL, %EVAL((&J-1)*2+1), %STR( )));
OUTPUT;
GROUP = "GROUP&J"; EVENT = 'NO';
COUNT_NUMBER= %EVAL(%SCAN(&EVENT_TOTAL, %EVAL((&J-1)*2+2), %STR( ))) %EVAL(%SCAN(&EVENT_TOTAL, (&J-1)*2+1, %STR( )));
OUTPUT;
%END;
%IF &I>4 AND %SYSFUNC(MOD(&I,2))=0 %THEN %DO;
PROC FREQ DATA=_SUMMARY;
TITLE1 'SUM_CHI: N*2 WAY CHI-SQUARE TEST ON SUMMARY STATISTICS';
TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL";
WEIGHT COUNT_NUMBER;
TABLES GROUP * EVENT /CHISQ FISHER NOCOL NOPERCENT;
RUN;
%END;
%ELSE %IF &I=4 %THEN %DO;
PROC FREQ DATA=_SUMMARY;
TITLE1 'SUM_CHI: 2*2 WAY CHI-SQUARE TEST ON SUMMARY STATISTICS';
TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL";
WEIGHT COUNT_NUMBER;
TABLES GROUP * EVENT /CHISQ FISHER NOCOL NOPCT RELRISK RISKDIFF;
RUN;
%END;
%ELSE %DO;
PROC PRINT DATA=_SUMMARY;
TITLE1 'ERROR: NOT VALID INPUT VALUE!! PLEASE CHECK!';
TITLE2 "YOU HAVE ENTERED: EVENT_TOTAL=&EVENT_TOTAL";
RUN;
%END;
TITLE;
%MEND SUM_CHI;
%MACRO
6