Download When Chi-Square Does Not Hold: How to Analyze Tables With Low Cell Frequencies With SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistics
WHEN cm-SQUARE DOES NOT HOLD: HOW TO ANALYZE TABLES
WITH LOW CELL FREQUENCIES WITH SAS
Lothar Tremmel, Ph.D.
Bio-Pharm Clinical Senices
Example 1: One 2x2 table
Abstract
Detennining if two proportions are significantly different is
a common problem in statistics. For low sample sizes, the
common chi-square tests may give the wrong answer,
because normal approximation is no longer valid. Exact
tests such as Fisher's exact tests need to be used instead.
The randomization model for such a test is developed, and
it is demonstrated how to obtain this test with SAS
procedures and functions.
failure
success
% success
Placebo
10
I
10%
Active
7
7
50%
The question is: Are the two success rates 1()O/O and 50%
"significantly" different? - "Significantly" meaning that the
difference cannot reasonably be explained by random
chance. To answer this question, we will develop an idea
how a random process could have generated the data, and
we will investigate the question: How likely is a difference
as the observed one to emerge just by random chance?
Oftentimes, one has a set of pairs of proportions, such as
the proportion of responders under Active treatment vs. the
proportion under Placebo for males, and the corresponding
proportions for females. 'Gender' is the stratification
variable, and 'male' and 'female' are the 'strata'. It is often
desirable not to pool these data into a single 2x2 table, but
to estimate the treatment effect in each stratum separately
and to combine the estimates. An exact test is developed,
and it is shown how to calculate the test with the new SAS
procedure MUL1TEST.
The fIXed-margin null model: A statistical model is an
idea about how the data were generated. We will build a
"null modef', i.e., a model which assumes no treatment
effect, and we will try to explain the data with it. For 2x2
tables, a common null model isfixed-margin null model: It
assumes that all the marginal sums of the table are fixed
and given before the experiment:
Extensions for sets ofk proportions are also discussed.
One 2x2 table
Suppose 25 patients were emoled in a clinical trial. Of
these, 14 were randomly chosen to receive active treatment;
the other 11 patients received placebo (sham treatment). At
the end of the trial, 7 out of the 14 actively treated patients
were cured, as compared to only one placebo patient This
result can be summarized in the 2x2 table shown in
Example I:
871
Failure
Success
Sum
Placebo
0
11-0
n.1=11
Active
17-0
14-(17-0)
n.2=14
Sum
n1.=17
n2.=8
n.. =25
Statistics
This randomization model can be described as follows:
There are 25 subjects of which 8 will be successes. We
randomly assign 11 of them to Placebo.
This is like fishing II fishes from a lake with 25 fishes of
two different kinds: 17 blue and 8 red, in the example.
Sample Space: The fIXed-margin null model imposes a
rigid structure on the table: If one cell is known, the whole
table is determined. For instance, if there are 0=10 placebo
failures, there must be 11-10=1 placebo success for the row
to add up to II, the placebo sample size. Likewise. there
must be 17-10=7 treatment failures for the column total to
be 17, and 8 trealment successes for the second row and
column to add up to their fixed margins. The set of all
possible tables is called the sample space of the model; it is
the set of all possible outcomes. An example of another
element of the samaple space is the table with 9 placebo
failures (and hence 11-9=2 placebo successes, 17-9=8
active successes, and 14-8=6 active failures). On the other
hand, a table with 12 placebo failures is not an element of
the sample space because there are only 11 placebo
patients.
.
~"
~
0.02
0.00
3458'II,OU
Figure 1: Exact probability distribution for o=the number
of placebo failures from Example I, under the null model
The formula for p(0) can be found in all basic statistical
textbooks, such as Hoel (1084, p. 69). The hypergeometric
distribution has the following properties:
Expected and Observed Value: The overall failure rate
is 17/25=0.68. Based on the null model of no treatmeilt
effect, we would expect a placebo failure rate of about 0.68
as well. This would be the case if the number of placebo
failures were 7.5, which is called the expected value E for
the number of placebo failures. However, our observed
value 0 is 10, which is considerably larger.
•
•
Point Probability: Under the null mode~ the table with E
placebo failures is the most likely table. The Null model
determines precisely how likely this outcome is. It also
determines the probability of each other possible table.
These outcome probabilities are called point probabilities.
Since one cell determines the whole table, the point
probabilities are a function of a single cell, say o=Placebo
failures. The distribution of the point probabilities over the
values of 0 is called the hypergeometric distribution of o. It
is shown in Figure I. It shows that we would expect to
observe 7 or 8 placebo failures: p(7)=P(8)=O.305.
Observing 10 placebo failures is a rather unlikely event
p(IO)=O.035.
In gen~ the distribution is not symmetric
Writing 0 for the random variable of placebo failures,
E(o)=nl. * (n.lIn), and
Var(o)=(nl.*n2.*n.1 *n.2)/(n2(n-I»
,
Tail probability or 'p value': To answer the question if
the null model is good enough to explain the data, the point
probability of exactly the table obtained is of no interest at
all. What we rea1ly want to know is: What is the probability
under the null model to get a resuh as extreme as the one
we got, or even more extreme? This will tell us how often
we would be wrong if we considered the given evidence as
sufficient to reject the null model although it is true.
This probability is the p value, or tail probability. It is the
sum of the point probabilities of the table obtained and all
more extreme tables. In our example, the p value is p(IO
placebo failures) = 0.0349 + p(11 placebo failures) =
0.00278, which adds up to 0.03768. This P value is called
Fisher's exact p value.
872
Statistics
One tailed vs. two tailed p values: Since we looked only
in one direction to sum up the more extreme tables, the
calculated p value is a one-sided p value. Two-sided p
values are obtained by defming 'more extreme' in a nondirectional way, and summing up both tails of the
distribution. Since in general the hypergeometric
distribution is not symmetric,
there are different
possibilities of defining 'more extreme' nondirectionally
with different resulting two-tailed p values. Agresti (1992,
p. 134) gives an overview and references.
given in the PROC FREQ output, simply labeled 'Chi
Square'. For the example, its value is 4.738, for which the
Chi-Square distribution with 1 d.f. returns a p value of
0.030.
Exact tests based on the exact product binomial distribution
can also be constructed (Rice 1988), but they are rarely
used in practice. The Mantel Haenzel statistic and Pearsons'
statistic differ only in a factor nI(n-I); hence, for large
sample sizes, both null models yield almost identical p
values.
Both the one-tailed and a two-tailed Fisher's exact p value
can be obtained with PROC FREQ, using the ICHISQ or
!EXACT option of the TABLES statement More directly,
the SAS-function PROBHYPR can be used to get tail
probabilities from the hypergeometric distribution:
Analysis of Example 1 with PROC FREQ:
proc freq data= one;
weight count;
tables drug*improv I chisq;
p=PROBHYPR(n, ni-, n_l, x)
p=1-PROBHYPR(25, 11, 17, 10-1)
run;
for the example. Note that PROBHYPR always gives the
lower-tail. Therefore, PROBHYPR(25, II, 17, 10) is p(o S
10), or the probability, to obtain 10 or fewer failures in
Placebo. The uppertail p to obtain 10 or more is 1 minus
PROBHYPR(25, 11, 17, 10-1). The subtraction of 1 from
lOis necessary to include the point probability of the given
table in the tail.
------------ OUTPUT: ---------------STATISTICS FOR TABLE OF DRUG BY IMPROV
Approximate tests for 2x2 tables: For large cell sizes, the
hypergeometric distribution approximates the bell curve.
This means that Z = (0 - E(o»)/ ..J(VAR(0» follows
approximately the standard normal distribution, and hence
Z2 the chi-square distribution with 1 d.f. z2 is known as the
Mante1-Haenszel Chi-Square 0.m- It can also be obtained
by PROC FREQ using the TABLES statement with the
ICHISQ option. For the example, the value of QMH is
4.548, for which the Chi-Square distribution with 1 d.f.
returns a p value of 0.033.
Statistic
Value
Prob
Chi-Square
Likelihood Ratio Chi-Square
continuity Adj. Chi-Square
4.738
5.233
3.044
4.548
0.030
0.022
Kantel-Baenszel Chi-Square
Fisher's Exact Test (Leftl
(Rightl
(2-Taill
Notes and Extensions: Besides the fIXed margin model
descnoed so far, a different data model is also common:
The product-binomial model: Each sample is considered an
independent binomial sample. Hence, only the two sample
sizes n.1 and n.2 are considered fixed; the total number of
successes n2. is not precpecified and can vary freely. In the
example, this is like fIrSt fishing 11 fishes from the sea, and
then fishing 14 fishes from the same sea - with no one
knowing the true proportion of blue and red fish in the
ocean. The nonnal-approximation test based on this idea is
the well-known Pearson-Chi-square test, which is also
873
O.OBl
0.033
0.997
0.038
0.042
Statistics
However, there are two objections to this approach:
Several 2x2 tables
Bias: Unbalanced randomization can lead to bias when data
are pooled. In the example: Females respond more easily; a
higher proportion of females was randomized to Placebo
than males. In the combined table, the Placebo group has a
higher proportion of female patients than the Treatment
group: 321 (32+11) vs. 27/(27+14). This introduces a bias
in favor of Placebo.
So far, we have investigated only a part of the sample
dataset in Koch & Edwards (1988). The complete dataset
looks as follows:
Example 2: Stratified 2x2 tables
Power: Even if randomization is balanced, there is still a
case against pooling: If the stratification variable has a main
effect, pooling leads to a power loss (For a simulation
study, see Gansky and Koch 1994).
Stratum 1: Males
failure
success
Placebo
10
1
Treatment
7
7
The alternative to pooling is to estimate the treatment effect
in the separate tables separately, to combine the estimates
into a combined test statistic to measure the overall effect.,
and to derive the distribution of this test statistic. This is
called stratifYing.
Stratum 2: Females
failure
success
Placebo
19
13
Treatment
6
21
Approximate Stratified Tests. A straightforward
combined test statistic is the sum !. 0. of all single
observed values 0, of the single tables. Assuming
independent normal distributions for the 0, in the different
strata, is easy to derive the distribution of !, 0, :
In the examples of SAS code to follow, we will assume that
these data were read in as follows:
z
=
DATA ONE;
T
T
0
6
1
2l
P
P
0
19
1
l3
M
M
M
M
T
T
0
1
7
7
P
P
0
lO
1
1
(!, 0, - E(!, 0,) 1 ..Jvar(!. 0,)
(!, 0, - !. E,) 1 ..J(!, Var(o,»
Again, Z2 is the Mantel-Haenszel Chi-Square Statistic QMH'
It can easily be obtained by PROC FREQ as shown below.
For the example, the value of OMH is 12.59, which yields a
p value of < 0.001.
INPUT
SEX $ DRUG $ IMRPOV COUNT;
CARDS;
F
F
F
F
=
RUN;
Pooling vs. Stratifying. How to derive a combined
assessment of the treatment effect? A straightforward idea
is Pooling. which simply means throwing the data
together, regardless of gender, and to analyze the resulting
single 2x2 table as above.
874
Statistics
of the point probabilities from the cells of Figure 2 over the
two-dimensional sample space.
Analysis of Example 2 with PROC FREQ:
proc freq data= one;
weight count;
tables sex*drug*improv /
run;
u
cmh;
......
OUTPUT: ---------------0.0413
Cochran-Mantel-Haenszel Statistics
(Based on Table Scores)
Statistic
1
2
3
Alternative
DF
Nonzero
1
correlation
Row Mean
1
Scores differ
General
1
AssoeiatioD
Value
PrOD
12.590
0.000
12.590
12.590
~
s:b.
&,
0.000
..
0.000
O.GOOO
k
C£
~
-
~
-....
<..0<000.
iI=t:
r=b,
EI:,
,=>,
'=>.
~7 11
-:-r
7'
~Oo~;;;r
-,: o?~
-o==.:,~
000'00.0.
Qo.
lC2
:.:::70.000
X1
-'"'9,
Figure 3: Exal:t probability distribution fOT x2=number of placebo
failures (males) and xl=number of placebo failures (females)
Problem: If some cell sizes are low « 5), the nonnal
approximation is no longer valid. Hence, the p value of the
Mantel-Haenszel test will be wrong and exact tests are
needed instead. Proc FREQ does not calculate exact p
values for stratified tests. In the next section, we will show
how to obtain exact tests with the new procedure
MULTIEST.
The p value is the sum of all point probabilities of
outcomes which are at least as extreme as the observed one
- which means all pairs of tables which score at least as
extreme as the observed table on the combined test statistic
~ 0 .. For the general case ofk strata, this test is known as
Birch's exact test of conditional independence (Agresti
Exact Stratified Tests. Exact tests for several 2x2 tables
can be developed exactly as for a single 2x2 table: The Null
Model: Every single table is generated by a separate,
independent tixed-margin-null model as described above.
With this, the sample space consists of all possible
combinations of single-table outcomes, and because of the
independence between the tables, the point probability of
each outcome is the product of single table probabilities.
The resulting exact distribution is the Product
1992, p. 128).
Hypergeometric Distribution.
For Example 2, the sample space is the set of all pairs of
possible outcome tables, and the point probabilities are just
the product of the point probabilities of the single tables, as
shown in Figure 2 on the next page. Figure 3 shows a plot
875
Statistics
Figure 2: Schematic of the sample space of the two-tables example, with point probabilities and tail probability.
Males: Failures in Placebo
Females
Xl
P
x2=3
p=0
4
.004
5
.039
6
.16
7
.31
8
.31
9
.16
10
.031
II
.003
0
1
2
3
4
5
6
7
...
....
...
16
.093
17
.041
18
.014
19
.003
20
.001
21
...
...
....
.0016
....
...
x
0
t
X
t
t
x
t
t
t
x
t
t
t
t
X
t
t
t
t
t
x
t
t
t
t
t
t
t
t
t
t
t
t
t
22
23
24
25
x
Note: The '0' denotes the observed result; x and t designate the tail area to derive the exact p value.
876
Statistics
Exact stratified p values with PROC MULITEST:
PROC MULTrEST (SAS Institute, 1992) calculates
approximate and exact p values for the COCHRANARMITAGE linear trend test which is based on data
schemes like the following:
Analysis of Example 2 with PROC MULTrEST:
PROC MULTTEST DATA=ONE OUT=PVALSj
CLASS DRUGj
CONTRAST 'TMT' -0.5 o.sj
FREQ COUNTj
STRATA SEXj
TEST CA (IMPROV /
PERM=lOOO
UPPERTAILED) j
RUNj
------------ OUTPUT: ----------------
Stratum s:
Groupg
Trend Score Sg
Successes Ogo
g=1
SI
0 1,
g=2
S2
~
g=3
S3
0 3,
...
g=k
MULTTEST TABLES
St
Class
<\.
Variable
Strata
IMPROV
f
Statistic
Count
N
Linear trend = ~ , ~ g Sg * (Ogs - Ego)
Percent
=Linear trend 1..J(Variance estimate)
m
Count
N
Percent
which is approximately nonnally distributed.
If one has only two groups and takes +0.5 and -0.5 as the
trend scores, the linear trend turns out to be the numerator
of the Mantel-Haenszel statistic 0-. The denominator
turns out to be the estimate based on the product binomial
null model, which is (n-l)/n times the hypergeometric
variance, for each stratum. Therefore, the p values from the
approximate tests will slightly differ from PROC FREQ's
Mantel-Haenszel test which is based on the hypergeometric
variance. For a single 2x2 table, the resuh will be identical
with the p value from Pearson's chi-square test.
0.0003
21.00
27.00
77.78
1.00
11.00
!1. 09
7.00
14.00
50.00
The exact stratified upper-tailed p value agrees with the
result given in Koch and Edwards (1988, p. 417).
p(exact two-tailed)=
~
tmt diff
13.00
32.00
40.63
Uppertailed keyword: If neither UPPERTAILED nor
LOWERTAILED is specified, two-tailed tests are
calculated.
p(exact uppertailed) =P(r. 0, 2: r. OJ
Us Os) + P<r. 0.
RawJ>
t
Perm= parameter: If in any stratum the marginal sum for
success or failure exceeds the value of the PERM=
parameter, normal approximation will be used. Hence, set
PARM= to a high value if you want an exact test.
Exact p values: Much more interestingly, MULTIEST is
also able to calculate exact p values as follows:
P<r. 0,
Test
p
E ~s 0, - E»
which can be shown to be the exact p values both for the
CA statistic and for the ~H statistic (See appendix).
The SAS code to obtain the exact p values is as follows:
877
Statistics
More than two proportions: kx2 tables
Analysis ofElC81tlple 3:
Exact test of global hypothesis with PROC FREQ:
You may have more than just two proportions, such as
adverse event rates in three treatment groups with rising
doses:
proc freq data=one;
weight count;
tables drug*ae / exact;
run;
Example 3: One k*2 table
Groupg
Trend Score
Headache
%
------------ OUTPUT: ----------------
Placebo (n=27)
0
14/27
52%
STATISTICS FOR TASLE OF DRUG BY AE
5mg(no:7)
5
5/7
71%
IOmg(n=7)
10
6/7
86%
Statistic
DF
Value
Prob
2
3.066
0.216
2
3.336
0.18~
1
2.976
0.084
Chi-Square
Likelihood Ratio
Chi-Square
Mantel-Haenszel
Chi-Square
Fisher's Bxact ~est
In the examples on the right hand side, we assume that
these data were read as follows:
0.263
(2-~ail)
data one;
input drug
ae
cards;
0
1.
1.4
0
1.3
0
5
5
1.
2
5
0
6
10
1.
10
1
0
count;
Analysis ofElC81tlple 3:
Exact test of trend hypothesis with PROe MULTTEST:
proc rnulttest data=one outo:pvals;
class drug;
contrast 'trend' 0 5 10;
freq count;
test CA(ae/perrn=1.0000);
Two different hypotheses can be investigated: (1) Global
hypothesis: Axe there any differences among the k
proportions? (2) Trend hypothesis: Is there a dose-related
trend among the k proportions? PROC FREQ provides
large-sample tests for both questions, but an exact test only
for (I). The second analysis example shows how an exact
trend test (2) can be obtained with PROe MULnEST.
run;
------------ OUTPUT: ---------------MULTTEST TABLBS
Class
Variable
Problems
AS
Statistic
Count
N
Problems with the new procedure MULnEST exist
concerning time and mernory. Both can be prohibitive
when analyzing datasets with moderate or large cell sizes. It
is important to set the PERM= parameter judiciously so that
MULTTEST switches to the far less demanding normal
approximation whenever this makes sense.
Percent
0
5
14.00
27.00
51.85
5.00
7.00
71.43
Exact p values for the table
878
Test
Rawy
trend
0.1011
10
6.00
7.00
85.71
Statistics
Technical comments
References
Note 1: The uppertailed exact p value is the exact p value
for the CA statistic because ZCA is nothing but a linear
transfonnation of L. O. - Hence both statistics ~ and L.
O. order the sample space in the same way. This means the
probability for L. O. to be ~ observed is the same as the
probability of the CA statistic to be greater than observed.
Agresti, A:A Survey of Exact Inference for Contingency
Tables. Statistical Science, 7, No.1, 131-177, 1992
Gansky, SA, Koch, GG: Evaluation of Alternative
Strategies for Managing Stratification Factors in the
Analysis of Randomized Clinical Trials. ASA 1994
Proceedings of the Biopharmaceutical Section
Note 2: The two-sided p value is the exact p value for the
QMH statistic because the unsquared QMH statistic is also a
linear transfonnation of :E. 0.: Writing Z for the unsquared
QMH statistic, 0 for observed L. 0., and 0 for the
corresponding random variable:
P (QMH :.!: (D-EiNar) = p (/ZI :.!: ID-EI/War)
Hoe~ Paul G.: Introduction to Mathematical Statistics. Fifth
ed., Wiley 1984
Koch GG, Edwards S: Clinical Efficacy Trials with
Categorical Data.
In Peace KE: Biopharmaceutical
Statistics for Drug Development NY and Basel 1988
=
p (Z :.!: (O-E)/War)
+
p(Z ~O-E)/War)=
p(o-E:.!:O-E)
+
p(Zo-E ~(O-E)}=
p(o:.!:O)
+
p(o~
Rice, W. R.: A New Probability Model for Detennining
Exact P-Values for 2x2 Contingency Tables When
Comparing Binomial Proportions. Biometrics ~ 1-22,
March 1988
SAS Institute: The MULTTEST procedure. In SAS
Technical Report, Release 6.07, SAS Institute 1992
2E-0)
which is what MUL'ITEST calculates (SAS 1992, The
MUL'ITEST procedure, p. 385).
Author's address:
Lothar Tremmel, Ph.D., Assistant Director, Biostatistics,
Bio-Pbarm Clinical Services, 4 Valley Square, Blue Bell,
PA 19422. Tel.: (215) 283-0770, Fax: (215) 283-0733.
Acknowledgments: The author would like to thank John
Cohen for his helpful comments to an earlier draft of this
paper.
879