Download Think Outside the Box: Analysis of Categorical Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Think Outside the Box: Analysis of Categorical Data
Margaret Ann Goetz, Quintiles, Inc. Arlington, VA
Row and column marginal totals (n1+, n2+, n+2) and an
overall total sample size (N), are also calculated.
ABSTRACT
Analysis of categorical data has many applications in
pharmaceutical research. Outcomes such as presence or
absence of a tumor, or a positive or negative reaction to a
particular medication, may be considered an endpoint of
interest in clinical trials. While many researchers are
familiar with the 2x2 contingency table, there are other
analyses to consider when presenting results based on a
discrete outcome.
This paper will present a brief overview of some of the
statistical distributions commonly encountered in these
analyses. Discussion of analysis strategies will include an
overview of the chi-square test, Fisher’s Exact test, and
the Cochran-Mantel-Haenszel test for binomial
proportions. In addition, an introduction to van Elteren’s
Test for ordinal variables will be presented. Discussion of
each test highlights common applications for each test.
Examples using SAS® along with references for further
study, are provided.
INSIDE THE BOX
In the analysis of pharmaceutical data, the basic 2x2
contingency table can be used to present some of the
most commonly-requested presentations of categorical
data.
Questions such as “how many patients present with a
successful response to a particular drug,” or “is there a
difference in the number of patients who survive
following treatment?” can be presented using the format:
Figure 1
Trt A
Trt B
Total
Outcome 1
n12
n12
n1+
Outcome 2
n21
n22
n2+
Total
n+1
n+2
N
The outcome is typically a response or some measure of
relative success, for which the underlying assumption is of
a mutually exclusive categorization (e.g., survival vs
death). The counts in each cell are represented by nxy.
However, in a typical controlled clinical trial, there may
be more than 2 treatment arms. Treatment A, for instance,
might be the group receiving study drug at dose level 1,
Treatment B might be the group receiving study drug at
dose level 2, and a third arm might represent a standard of
care or placebo treatment.
Many of the familiar statistics applied to 2x2 tables can be
readily applied to more than two treatment groups, or
more than two response categorizations. This type of
table will be referred to as an s x r table, indicating the
possibility of more than 2 treatment groups or more than 2
categories of response, although examples will focus
primarily on the analysis of a dichotomous outcome
(success/failure) when applied to more than two treatment
groups.
The following discussions of these techniques and their
underlying assumptions are far from exhaustive, but are
designed to encourage researchers to think beyond the 2x2
contingency table for analysis of their data. With the
exception of van Elteren’s test, this paper will be limited
to situations where the response levels of a s x r table need
not be ordinal, so as to avoid overextending the
discussion.
BEHIND THE BOX: SAMPLING DISTRIBUTIONS
While it is easy to visualize the proportion of patients
which fall into each of the 4 cells in a 2x2 contingency
table, it is less intuitive but equally important to consider
the underlying assumptions behind the sampling
distribution which created these cell counts. In addition,
the most common distributions for discrete data can be
extended to s x r tables. While a complete exploration
would require more extensive discussion, the following is
an introduction to the ideas behind applying distributions
to s x r tables.
Binomial Distribution: The familiar ‘heads or tails’ coin
example is often used to depict an application of the
binomial distribution.
In a binomial outcome, the
underlying assumption is one of independence: each of
the individuals under consideration can be included in
only one of two independent outcomes or responses, often
denoted as p and 1-p, where p is the probability of success
in each of n independent Bernoulli trials, and 1-p, or q, is
the probability of failure. The binomial distribution counts
the number of p successes in n trials. Each individual in a
given outcome or response has an equal chance of being
included under the opposite outcome or response.
The binomial random variable can be approximated by the
normal random variable with mean np and standard
deviation (npq)1/2, provided npq >5 and 0.1 ≤ p ≤ 0.9 or if
min(np, nq) > 10. [Evans, 1993]. This relationship can
prove critical to a clinical trial researcher who is
considering the analysis of a binomial outcome.
Beyond the 2x2 table: A generalization of the binomial
distribution is the multinomial distribution, which allows
patients to be categorized to more than two mutually
exclusive response groupings. The categories must
continue to be mutually exclusive and exhaustive, each
with probabilities pi, {I=1,. . ., k}. The marginal
distributions are also multinomial. When N is large and
all variances are large, then the multinomial will
approximate the multivariate normal distribution (see
Zelterman, 1999, for a complete description of the
properties of this distribution).
Hypergeometric distribution:
The hypergeometric
distribution is frequently used in instances where data with
small sample sizes are being analyzed, and the number of
successes out of a total N are being considered. Most
frequently, the hypergeometric distribution is applied in
the generation of exact tests of hypothesis for count data
using a 2x2 table, in which every possible sample
outcome can be presented for a particular set of count data
[Zelterman, 1999]. The analysis is constrained based on
fixed row and column marginal totals ((n1+, n2+, n+2 as
calculated in Figure 1). A series of probabilities can be
calculated given all possible row and column totals:
Pr{nij} = n1 +!n2 + !n+1! n+2!
n! n 11 ! n12 ! n21 ! n22!
[Stokes, Davis, and Koch, 1995, p. 23]
Beyond the 2x2 table: The multivariate hypergeometric
distribution is the extension of the hypergeometric
distribution to tables larger than 2x2, and can be used to
provide exact inference on an s x r table conditional on
marginal totals.
Poisson distribution: The Poisson distribution is often
described based on the limit of the binomial distribution.
It is frequently applied in cases where N is considered
large and p is very small (e.g., the rate of a rare disease
under study in a large population). The researcher may
encounter an application of this distribution in the use of
Poisson regression, in which the errors in the model take
on a Poisson distribution rather than a normal distribution
as in linear regression. This technique can be modeled
using PROC GENMOD in SAS, where DIST =
POISSON, as well as the log link function, must be
specified to model these count data.
ANALYSIS STRATEGIES USING SAS®
As depicted in Figure 1, the analysis of clinical trial data
may involve a presumed association between a subject’s
random assignment into a treatment group (study drug or
placebo) and the outcome of the trial (success or failure).
To appropriately determine if such an association is
present, certain assumptions for the data and for the
analytic techniques being applied must be met: data must
be assumed to be drawn using appropriate randomization
techniques; distributional assumptions must be met; and
sample and individual cell size must be sufficient.
PROC FREQ: In SAS®, PROC FREQ can be used to
generate the following test statistics of interest in
conjunction with a 2x2 table. In each instance, given
certain assumptions and other criteria as stated, these
statistics can be generalized to an s x r table.
Pearson chi-square statistic: This test statistic is based on
the difference between the observed and expected values
in each cell of a 2x2 crosstabulation. A standard rule of
thumb for application of this association is that the
expected values in each cell should exceed five. The
calculation of the difference between observed and
expected values can be extended to each cell of an s x r
crosstabulation, in which the response levels do not need
to be ordinal. In the SAS® output, this statistic is labeled
‘Chi-Square’ and includes the value of the test statistic,
degrees of freedom, and the p-value associated with the
test statistic.
The following is an example of crosstabulation output for
a 2 x 3 table. “Yes” vs “No” is a response indicating a
particular outcome of interest.
In the statistical output associated with Figure 2 (above),
there is not an issue of small expected cell counts to
consider. Fisher’s exact test is associated with a p-value
of 0.470.
Figure 2
Frequency|
Treatment Group
Row Pct |
Col Pct |
1|
2|
3|
---------+--------+--------+--------+
Yes
|
13 |
12 |
16 |
| 31.71 | 29.27 | 39.02 |
| 65.00 | 60.00 | 80.00 |
---------+--------+--------+--------+
_No
|
7 |
8 |
4 |
| 36.84 | 42.11 | 21.05 |
| 35.00 | 40.00 | 20.00 |
---------+--------+--------+--------+
Total
20
20
20
Total
41
19
60
The statistical testing associated with the 2 x 3 output
above confirms what is clear from a review of the
crosstabulation. There is no indication allowing the
researcher to reject a null hypothesis of no association
between treatment group and outcome for these data. The
Pearson chi-square statistic is compared to the critical chisquare value with (s-1) x (r-1) degrees of freedom, and
has a relatively small value of 2.003 (p=0.367).
STATISTICS FOR TABLE OF OUTCOME BY TRTGRP
Statistic
DF
Value
Prob
--------------------------------------------Chi-Square
2
2.003
0.367
Likelihood Ratio
Chi-Square
2
2.085
0.353
Mantel-Haenszel
Chi-Square
1
1.022
0.312
Fisher’s Exact Test (2-Tail)
Phi Coefficient
Contingency Coefficient
Cramer’s V
For tables larger than 2x2, use of the EXACT option
following the TABLES statement will include Fisher’s
exact test as part of the output, which is produced using
the network algorithm given by Mehta and Patel [See
SAS/STAT User’s Guide, Volume 1, page 889, for
reference information on this computational algorithm].
0.470
0.183
0.180
0.183
Sample Size = 60
Fisher’s Exact Test: Fisher’s exact test utilizes the
hypergeometric distribution to output a p-value which is
actually the sum of the probability of observing the
current crosstabulation, or all possible more extreme row–
column combinations. The use of Fisher’s exact test
should be considered if the expected frequencies of each
cell in the crosstabulation are not at least five. Note that
Fisher’s exact test is always appropriate, even when the
sample size is large [Stokes, Davis, and Koch, 1995], and
that Fisher’s exact test is considered a non-parametric test
[Walker, 1997].
In SAS®, this test statistic and its associated probability
value will be printed automatically for all 2x2 tables. For
2x2 tables where the expected frequencies of each cell are
not at least five, SAS will also output a warning message
indicating that chi-square may not be a valid test.
Notes on presentation: Briefly, PROC FREQ and the
TABLES statement can be used to display the 2x2
crosstabulation table, showing the familiar combined
frequencies for the two variables TRTGRP (study drug)
and OUTCOME, which are separated by an asterisk:
proc freq;
tables outcome*trtgrp;
run;
If a two-way TABLES statement is requested with no
options specified, the default will print cell frequencies,
cell percentages of the total frequency, cell percentages of
row frequencies, and cell percentages of column
frequencies. As certain percentages are not always useful
in interpreting an s x r table, percentages can also be
suppressed: cell percentages of the total frequency
(NOPERCENT); column percentages (NOCOL option);
or row percentages (NOROW option).
The following code was used to generate the 2 x 3 table
output in Figure 2. Included in the output request is the
EXACT option, which will generate Fisher’s exact test for
the 2 x 3 table. Since no dataset is available, the
WEIGHT statement is used in conjunction with PROC
FREQ to populate the cells with counts:
data test;
input trtgrp outcome $ count @@;
cards;
1 Yes 13 1 _No 7
2 Yes 12 2 _No 8
3 Yes 16 3 _No 4
;
run;
proc freq data=test;
tables outcome*trtgrp/ nopercent exact;
weight count;
run;
If analysis alone is the priority, or for very large s x r
tables, another useful option associated with PROC FREQ
is the NOPRINT option. This will suppress printing of
the table itself, but will continue to generate the statistics
requested. Statistics can also be output to a new dataset
for display in a table or other format using the OUT=
option.
Other statistics generated by PROC FREQ using the
CHISQ option which have not been previously discussed
include:
•
•
•
Mantel-Haenszel chi-square, a measure of
significance for the linear relationship between two
ordinal variables;
Likelihood ratio chi-square, which computes chisquare based on maximum likelihood estimation;
Phi coefficient, contingency coefficient, Cramer’s V
all measure of association derived from the chi-square
statistic.
If response data were ordinal, it would be important to
take this characteristic into account when selecting the
appropriate test statistic. The Mantel-Haenszel chi-square
would be better suited than the Pearson chi-square statistic
to detect changes in the means across the levels of the row
variable [Stokes, Davis, and Koch, 1995]. For more
information on these tests, please see SAS/STAT®
Software, page 866.
Cochran-Mantel-Haenszel test: Another extension of the
chi-square test and PROC FREQ, the Cochran-MantelHaenszel (CMH) test can be used to compare the
association between drug treatment and a binomial
outcome or response within a specified strata of interest to
the clinical trial. Often, researchers will need to assess
for differences in study center in a multi-site clinical trial.
Other strata of interest could be gender or disease status
(active, inactive) at onset of treatment.
Using the following code, Figure 2 can be broken into x
individual 2x3 tables where x represents the number of
sites. Each 2x3 table generated represents treatment and
response at each of 3 sites.
proc freq data=test;
tables site*outcome*trtgrp/cmh nopercent;
weight count;
run;
Individual crosstabulations at each site are presented,
along with a Summary Statistics section displaying three
summary test statistic controlling for site, their degrees of
freedom, and associated p-values: the Mantel-Haenszel
correlation statistic (labeled “Nonzero correlation’), the
ANOVA statistic (labeled ‘General Mean Scores Differ’)
and the general association statistic (labeled ‘General
Association’).
For a 2x2 table, each of these statistics are interpreted in
the same manner. For the 2 x r table in Figure 2, the
general association statistic will have degrees of freedom
(s-1) x (r-1) and is always interpretable because it does
not require an ordinal scale for either row or column.
[SAS/STAT User’s Guide]. Therefore, we can look at the
test of general association to determine if there is an
association between treatment group and outcome,
controlling for site.
If a response variable was on an ordinal scale, other
applications of the /CMH option in PROC FREQ would
be the ANOVA statistic, which would correspond to a
stratum-adjusted ANOVA or Kruskal-Wallis test
[SAS/STAT User’s Guide]. Order of variables in the
TABLES statement must be taken into consideration when
interpreting this test.
Van Elteren’s Test: A lesser-known option for s x r tables
with ordinal response data is van Elteren’s test. Similar to
the CMH test described above, van Elteren’s test will
assess the significance of the difference between treatment
groups in the distribution of an ordinal response variable,
adjusting for study center.
Van Elteren’s test can be applied in the case where there
are more than two treatment groups (as in the previous
examples) and where the response variable is ordinally
scaled (var NEW_OUT in the following example).
proc freq data=test;
tables site*new_out*trtgrp/cmh scores=modridit;
run;
The code will output individual s x r tables for each study
site, followed by a presentation of overall summary
statistics displaying each test statistic controlling for site,
their degrees of freedom, and associated p-values: the
Mantel-Haenszel correlation statistic (labeled “Nonzero
correlation’), the ANOVA statistic (labeled ‘General
Mean Scores Differ’) and the general association statistic
(labeled ‘General Association’). Total sample size is also
displayed.
As with the CMH test for ordinal variables, order of
variables in the TABLES statement must be taken into
consideration when interpreting this test. Selection of
SCORES=modridit, which creates a nonparametric
analysis, represents the expected value of the withinstratum order statistics and are derived from rank scores
[SAS/STAT User’s Guide].
For more extensive information on these procedures, and
the underlying assumptions, please see the References
section.
CONCLUSION
There are many instances where clinical trial data analysis
will yield categorical outcomes appropriate for either 2 x
2 or s x r crosstabulation procedures. PROC FREQ can
very flexibly present the association between a binary
response outcome variable and a multiple treatment group
situation, as was demonstrated above, in general tests of
association and in non-parametric exact tests. Additional
options in PROC FREQ for ordinal variables, such as
SCORES, and special situations such as controlling for
SITE, are also options available to the clinical trials
investigator - outside the traditional 2 x 2 ‘box’.
REFERENCES
Agresti A. (1990). Categorical Data Analysis. New
York: John Wiley.
Evans M, Hastings N, Peacock B. (1993). Statistical
Distributions. 2nd edition. New York: John Wiley.
Zelterman D. (1999). Models for Discrete Data. Oxford:
Oxford University Press.
SAS Institute Inc. (1997). SAS Technical Report P243. SAS/STAT® Software: The GENMOD Procedure.
Cary, NC: SAS Institute Inc.
SAS Institute Inc.
(1997).
SAS/STAT® Software:
Changes and Enhancements through Release 6.12. Cary,
NC: SAS Institute Inc.
SAS Institute Inc. (1990). SAS/STAT® User’s Guide:
Volume 1, Version 6. 4th edition. Cary, NC: SAS
Institute Inc.
Stokes M, Davis C, Koch G. (1995). Categorical Data
Analysis Using the SAS® System. SAS Institute Inc.
Walker G. (1997). Common Statistical Methods for
Clinical Research with SAS® Examples. SAS Institute
Inc.
Key words: Categorical data, binomial distribution, chisquare statistic, PROC FREQ, Fisher’s Exact Test,
Cochran-Mantel-Haenszel.
SAS and SAS/STAT are registered trademarks or
trademarks of SAS Institute, Inc. in the USA and
other countries.  indicates USA registration. Other
brand and product names are registered trademarks
or trademarks of their respective companies.
Margaret Ann Goetz, M.P.H.
Senior Biostatistician
Quintiles, Inc.
1300 North 17th Street, Suite 300
Arlington, VA 22209-3801
email: [email protected]