Download Chapter 2-4. Comparison of Two Independent Groups

Document related concepts

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Chapter 2-4. Comparison of Two Independent Groups
In this chapter, we consider the situation where we want to compare two groups of subjects.
This is called the “independent groups” situation, because any given subject is in only one group
or the other.
Different statistical tests (often called significance tests) are required for the situation where the
measurements are taken on the same subjects more than once, such as with baseline and postintervention measurements.
Usually, a regression model is used in a study to test the research hypothesis, to control for
confounding variables. The situation where the two independent group significance tests, which
are not regression models, are most frequently used is the Table 1 “Patient Characteristics” table
of an article.
Table 1. Patient Characteristics
Almost every researcher will report the descriptive statistics for a long list of variables, showing
that the study groups (e.g., active drug intervention vs. placebo) are balanced (similarly
distributed) on these variables.
For example, Brady et al (2000) include the following table (only partially shown) in their JAMA
article:
Table 1. Demographic and Clinical Data
Sertraline
Placebo
Variable
(n = 94)
(n = 93)
Sex, %
Female
75.5
71.0
Male
24.5
29.0
Age, mean (SD), y
40.2 (9.6) 39.5 (10.6)
…
P
Value
.48
.54
Referring to this table in their text, they report,
“For the total randomized sample there were no significant differences between
the treatment groups in any of the baseline demographic and clinical
characteristics (TABLE 1).”
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Chapter 2-4 (revision 17 Oct 2011)
p. 1
The argument that Brady is presenting with her Table 1 and that statement that refers to Table 1
is that the list of variables in the Table have been ruled out as potential confounders. She does
this by eliminating the confounder-exposure association (see box), where exposure is the study
drug and the potential confounder is any variable listed in Table 1.
Properties of a confounding factor
A confounding factor must have an effect on disease and it must be imbalanced between the
exposure groups to be compared.
That is, a confounding factor must have two associations:
1) A confounder must be associated with the disease.
2) A confounder must be associated with exposure.
Diagrammatically, the two necessary associations for confounding are:
Confounder
association
association
Exposure
Disease
confounded effect
There is also a third requirement.
A factor that is an effect of the exposure and an intermediate step in the causal pathway from
exposure to disease will have the above associations, but causal intermediates are not
confounders; they are part of the effect that we wish to study.
Thus, the third property of a confounder is as follows:
3) A confounder must not be an effect of the exposure.
Rothman (2002, p.164) criticizes the practice of statistically comparing baseline characteristics
in clinical trials, which researchers do to rule out confounding (i.e., show that a variable is not a
confounder by showing one of the associations required for confounding does not exist).
Rothman argues that the degree of confounding is not dependent upon statistical significant, but
Chapter 2-4 (revision 17 Oct 2011)
p. 2
rather upon the strength of the associations between the confounder and both exposure and
disease. He proposes that a better way to evaluate confounding, in a clinical trial or with any
study design, is to statistically control for the potential confounder (using stratification or
regression analyses, discussed in a later chapter) and determine whether the un-confounded
result differs from the crude (the simple analysis without stratification or regression) potentially
confounded result.
Personally, I think it is still useful to include a Table 1, with p values. It is a convenient way to
alert readers to potential confounding variables. After that, you can go on to evaluate
confounding like Rothman suggests.
In clinical trials, where randomization is used, it is frequently argued that the p values do not
make sense. A p value is normally used to test if a difference exists in the sampled population,
which is the normal observation study interpretation. In randomized clinical trials, bench, or
animal experiments, one starts with the same group, so there is no imbalance in the sampled
population. Any observed balance is obviously due to the randomization, so just what does the p
value mean in this situation? Still, the p value is frequently reported for such studies, because it
alerts the reader and investigator of an imbalance induced by the randomization process, which
in turn could induce confounding that should be controlled for, provided the sample size is large
enough to allow for this.
Asymptotic Tests vs. Exact Tests
Asymptotic test gives accurate p values only for large sample sizes (as n  ), the p value being
based on the Central Limit Theorem, which is discussed below. Exact tests give accurate p
values for any sample size, the p values not being based on the Central Limit Theorem. Thus it
can be argued that exacts tests are always preferable--however, this is controversial, particularly
for the 2  2 crosstabulation table case, which we will see below.
Two Independent Groups Comparison of a Dichotomous Variable
Suppose we have an Active Drug group and a Placebo Group in our clinical trial. We wish to
test if the groups are balanced on our gender variable (equal distributions of males and females in
the two study groups).
The variable being tested is often referred to as the “dependent variable”, and the variable
defining the study groups if often referred to as the “independent variable”. This nomenclature is
consistent with the idea of a deterministic function in algebra, Y = f(X), where the dependent
variable (Y) depends on the value of the independent variable (X). This, however, implies Y is
caused by X, which may not be the case at all. For example, there might be an intermediate
variable, which is not recorded, that is the actual causal factor. For this reason, many
statisticians prefer the terms “outcome” and “predictor” for the Y and X variables, which allows
for simply modeling an association. (Steyerberg, 2009, p.101)
Chapter 2-4 (revision 17 Oct 2011)
p. 3
The most popular test for comparing two groups on a dichotomous dependent variable is the ChiSquare test (frequently called the “Chi-Squared” test). The second most popular test is the
Fisher’s exact test.
There is a third test found in elementary statistics textbooks, called the “Two-Sample Test for
Binomial Proportions (Normal-Theory Test), or two-proportions z test. It is algebraically
identical to the chi-square test (without Yates continuity correction). [see box] Since the chisquare is better known, you should just use that.
Equivalence of Chi-Square Test for 2  2 Table and the two-proportions Z test (Altman,
1991, pp 257-258).
Given a 2  2 table,
Group 1
a
c
a+c = n1
Group 2
b
d
b+d = n2
N= n1+ n2
We have p1= a/(a+c), p2= b/(b+d) , and the pooled proportion is p = (a+b)/N.
Then, the z test for comparing two proportions is given by
z
p1  p2
1 1
p(1  p)   
 n1 n2 

p1  p2
standard error of (p1  p2 )
Substituting, this is equivalent to
z
a
b

ac bd
ab cd  1
1 



N
N ac bd 
which, after some manipulation, gives
N (ad  bc)2
z
(a  b)(a  c)(b  d )(c  d )
 2
Chapter 2-4 (revision 17 Oct 2011)
p. 4
Thus, the chi-square with 1 degree of freedom (the 2  2 table case) is identically the square of
the z test (the square of the standard normal distribution).
Most statistics books provide the following formula for the chi-square test:
2  
(O  E )2
N (ad  bc)2

E
(a  b)(a  c)(b  d )(c  d )
, where in the first formula (the theoretical formula) the sum is over all cells of the
crosstabulation table, and
O = observed cell frequency
E = expected cell frequency (defined below)
and the second formula is the quick computational formula that is algebraically
equivalent.
With this formula, it is difficult to see that the test statistic is a signal-to-noise ratio, an idea
introduced in Chapter 2. All statistical tests have the form of a signal-to-noise ratio (Stoddard
and Ring, 1993; Borenstein M, 1997). In the above box, we see that this formula is algebraically
identical to the two-proportions Z test, which is clearly a signal-to-noise ratio (effect divided by
its variability, or standard error).
The Fisher’s exact test is an example of an “exact test”. That is, it gives a legitimate p value
even for small sample sizes. Let’s begin with this test.
We will use the births dataset (see box). This dataset is from a study were the investigators
wanted to test the association between maternal hypertension and a preterm delivery outcome of
the pregnancy.
Births Dataset ( births.dta )
This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short Introduction to
Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002.
http://www.timberlake.co.uk
The dataset concerns 500 mothers who had singleton births in a large London hospital.
Codebook
Variable
id
bweight
lowbw
gestwks
Labels
subject number
birth weight (grams)
birth weight < 2500 g
1=yes, 0=no
gestational age (weeks)
Chapter 2-4 (revision 17 Oct 2011)
p. 5
preterm
matage
hyp
sex
sexalph
gestational age < 37 weeks
1=yes, 0=no
maternal age (years)
maternal hypertension
1=hypertensive, 0=normal
sex of baby
1=male, 2=female
sex of baby (alphabetic coding)
“male”, “female”
Start the Stata program and read in the data,
File
Open
Find the directory where you copied the course CD:
Change to the subdirectory: datasets & do-files
Single click on births.dta
OK
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\Section 2 Biostatistics\
\datasets & do-files\births.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\Section 2 Biostatistics\"
cd "datasets & do files"
use births.dta, clear
In the Births Dataset, births.dta, let’s test whether or not low birth weight deliveries occur more
frequently for mothers with hypertension than for mothers without hypertension. We display the
two variables simultaneously using a contingency table (also called a cross-tabulation table).
Requesting a crosstabulation table with the preterm outcome as the rows and the gender of the
baby as the columns, so column percents are the most useful percentages,
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: preterm
Column variable: hyp
Cell contents: within column relative frequencies
OK
Chapter 2-4 (revision 17 Oct 2011)
p. 6
tabulate preterm hyp, column
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
|
hypertens
pre-term |
0
1 |
Total
-----------+----------------------+---------0 |
375
52 |
427
|
89.50
73.24 |
87.14
-----------+----------------------+---------1 |
44
19 |
63
|
10.50
26.76 |
12.86
-----------+----------------------+---------Total |
419
71 |
490
|
100.00
100.00 |
100.00
We observe that mothers with hypertension more frequently delivered a preterm baby more
frequency that mothers without hypertension.
We can test this hypothesis
H0: phypertenion present = phypertenion absent
i.e. H0: no association between preterm delivery and maternal hypertension
, where p is the population proportion of preterm deliveries
with the Fisher’s exact test,
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: preterm
Column variable: hyp
Cell contents: within column relative frequencies
Test statistics: Fisher’s exact test
OK
tabulate preterm hyp, column exact
Chapter 2-4 (revision 17 Oct 2011)
p. 7
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
|
hypertens
pre-term |
0
1 |
Total
-----------+----------------------+---------0 |
375
52 |
427
|
89.50
73.24 |
87.14
-----------+----------------------+---------1 |
44
19 |
63
|
10.50
26.76 |
12.86
-----------+----------------------+---------Total |
419
71 |
490
|
100.00
100.00 |
100.00
Fisher's exact =
1-sided Fisher's exact =
0.001 <- use this one (2-sided test)
0.000
supporting the conclusion that maternal hypertension is a risk factor for pre-term delivery (p =
0.001).
For the Fisher’s exact test, there is no test statistic--only a p value. The Fisher’s exact test is
simply a direct probability calculation (a p value calculation). The first p value listed is a 2-sided
comparison. Always report the two-sided p value (we’ll see in the next chapter why we do this).
Alternatively, we could test this same hypothesis using the chi-square test.
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: preterm
Column variable: hyp
Cell contents: within column relative frequencies
Test statistics: Pearson’s chi-squared
OK
tabulate preterm hyp, chi2 column
Chapter 2-4 (revision 17 Oct 2011)
p. 8
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
|
hypertens
pre-term |
0
1 |
Total
-----------+----------------------+---------0 |
375
52 |
427
|
89.50
73.24 |
87.14
-----------+----------------------+---------1 |
44
19 |
63
|
10.50
26.76 |
12.86
-----------+----------------------+---------Total |
419
71 |
490
|
100.00
100.00 |
100.00
Pearson chi2(1) =
14.3254
Pr = 0.000
We would report this p value as (p < .001). It is actually p < 0.0005, since it did not round to the
third decimal place, but there is never a reason to show a p value to more than three decimal
places. This is so because the decision about significance is made using two decimal places (a
comparison with 0.05).
Notice that Stata calls the chi-square test the “Pearson” chi-square to distinguish it from other
versions of a chi-square statistic (Likelihood ratio chi-square and Cochran-Mantel-Haenszel chisquare), which can also be computed in Stata. The Pearson chi-square test is simply the test
everyone just calls the “chi-square test”, so you never need to add the “Pearson” qualifier to it
when you publish.
Stata provided two p values for the Fisher’s exact test (a two-tailed and a one-tailed p value).
For the chi-square test, Stata only provides one p value. This is the two-tailed p value. To get a
one-tailed p value (in the unlikely event you need it) you simply divide the p value by 2 (onetailed p = 0.605/2 = 0.303)(Breslow and Day, 1980, p.139). For the Fisher’s exact test, the onetailed p value is not equal to the two-tailed p value, as we’ll see below, so Stata provides the onetailed p value for you.
In the box a few pages up, it was pointed out the there is another test statistic, called the twosample test of proportions, or two-proportions z test. Since this test is algebraically identical to
the chi-square test, the chi-square test is normally reported, being a more widely recognized test.
Just for completeness, let’s compute that test using Stata.
Chapter 2-4 (revision 17 Oct 2011)
p. 9
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group proportions tests
Main tab: Variable name: preterm
Group variable name: hyp
OK
prtest preterm , by(hyp)
Two-sample test of proportion
0: Number of obs =
419
1: Number of obs =
71
-----------------------------------------------------------------------------Variable |
Mean
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------0 |
.1050119
.0149769
.0756578
.1343661
1 |
.2676056
.0525401
.1646289
.3705824
-------------+---------------------------------------------------------------diff | -.1625937
.054633
-.2696725
-.0555149
| under Ho:
.0429586
-3.78
0.000
-----------------------------------------------------------------------------diff = prop(0) - prop(1)
z = -3.7849
Ho: diff = 0
Ha: diff < 0
Pr(Z < z) = 0.0001
Ha: diff != 0
Pr(|Z| < |z|) = 0.0002
Ha: diff > 0
Pr(Z > z) = 0.9999
When we computed the chi-square test above, we got
|
hypertens
pre-term |
0
1 |
Total
-----------+----------------------+---------0 |
375
52 |
427
|
89.50
73.24 |
87.14
-----------+----------------------+---------1 |
44
19 |
63
|
10.50
26.76 |
12.86
-----------+----------------------+---------Total |
419
71 |
490
|
100.00
100.00 |
100.00
Pearson chi2(1) =
14.3254
Pr = 0.000
We cannot tell they are algebraically identical tests from the p values, due to insufficient decimal
places displayed. For a 2 × 2 table, which gives a one degree of freedom chi-square test, the
chi-square statistic is simply the z statistic squared. To see this,
display (-3.7849)*(-3.7849)
14.325468
which we see is identically the chi-square test statistic. You can be confident that the p values
are identical, as well.
Chapter 2-4 (revision 17 Oct 2011)
p. 10
Chi-Square Test with Continuity Correction
There is another form of the chi-square test, called either the “continuity corrected chi-square
test” or “chi-square test with continuity correction” or “chi-square test with Yates continuity
correction”. Stata does not provide this, although it is frequently advocated in statistics
textbooks. It is automatically output in the SPSS statistical software.
Showing an SPSS output for a comparison that is not so significant:
PRETERM * SEX Crosstabulation
PRETERM
0
1
Total
SEX
male
female
225
202
87.9%
86.3%
31
32
12.1%
13.7%
256
234
100.0%
100.0%
Count
% within SEX
Count
% within SEX
Count
% within SEX
Total
427
87.1%
63
12.9%
490
100.0%
Chi-Square Tests
Pears on Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Ass ociation
N of Valid Cas es
Value
.268 b
.146
.267
.267
1
1
1
Asymp. Sig.
(2-s ided)
.605
.702
.605
1
.605
df
Exact Sig.
(2-s ided)
Exact Sig.
(1-s ided)
.686
.351
490
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is
30.09.
We see that the continuity corrected chi-square test has a larger p value than the uncorrected chisquare test. We also see that the continuity corrected chi-square p value is closer to the Fisher’s
exact test p value. Finally, the Fisher’s exact p value is the largest (most conservative) of all.
This illustrates a controversy among statisticians. One camp claims that the continuity correction
should always be applied, because the p value is more accurate and because it is closer to an
exact p value (Fisher’s exact p value). The other camp claims that the continuity correction
should not be applied, because it takes the p value closer to the Fisher’s exact test p value, which
is not a good thing because it is known that the Fisher’s exact p value is conservative (does not
drop below alpha, 0.05, often enough). (Agresti, 1990, p.68)
Stata does not even offer a continuity corrected chi-square test. This is because the camp of
statisticians against the continuity correction have made a sufficiently compelling argument.
Let’s illustrate that the Fisher’s exact test is conservative with a Monte Carlo simulation,
computing the long-run average of 10,000 samples.
Chapter 2-4 (revision 17 Oct 2011)
p. 11
We first compare the power to detect a difference between 10% and 20% using sample sizes of
300 in each group.
*-----------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test
*-----------------------------------------------------------*-- step 1: compare power -*
*
|
Group A
Group B
|
*-----------------+------------------------+----*
Outcome Yes |
30 (10%)
60 (20%) | 90
*
Nos |
270
240
| 510
*-----------------+------------------------+----*
Total |
300
300
| 600
*sampsi .10 .20 , alpha(.05) n1(300) n2(300) --> power = 0.9145
Times observe p<0.05 for chi-squared test: 9352 out of 10,000 samples (93.52%)
Times observe p<0.05 for Fisher's exact test: 9183 out of 10,000 samples (91.83%)
expected answer is power = .9145, or 91.45%
We see that the uncorrected chi-square test is slightly more powerful (93.5% vs 91.8%), about an
absolute 1.5% difference.
Next we will determine if the test is conservative, by sampling from populations with 10% and
10%, so there is no difference to be detected. We expect to get significance by chance 5% of the
time.
*-- step 2: compare alpha -*
*
|
Group A
Group B
|
*-----------------+------------------------+----*
Outcome Yes |
30 (10%)
30 (10%) | 60
*
Nos |
270
270
| 540
*-----------------+------------------------+----*
Total |
300
300
| 600
Times observe p<0.05 for chi-squared test: 496 out of 10,000 samples (4.96%)
Times observe p<0.05 for Fisher's exact test: 350 out of 10,000 samples (3.5%)
expected answer is alpha = .05, or 5%
We see that the Fisher’s exact test in this example is indeed conservative (does not show
significance frequently enough). We again see about a 1.5% absolute difference between the
Fisher’s exact test and the chi-square test, the chi-square test outperforming Fisher’s exact test.
Chapter 2-4 (revision 17 Oct 2011)
p. 12
By some trial and error, we can find the following borderline significant case, which illustrates
the frustration that can arise by limiting yourself to the Fisher’s exact test (Fisher’s test not
significant, but chi-square test is).
tabi 30 46 \ 270 254 , chi2 exact
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
30
46 |
76
2 |
270
254 |
524
-----------+----------------------+---------Total |
300
300 |
600
Pearson chi2(1) =
Fisher's exact =
1-sided Fisher's exact =
3.8570
Pr = 0.050
0.065
0.033
Again, we see a 1.5% absolute difference between the Fisher’s exact test and the chi-square test,
the chi-square test outperforming Fisher’s exact test.
Chapter 2-4 (revision 17 Oct 2011)
p. 13
If you are curious how this simulation (page 11) was run, here is the first part:
*-----------------------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test
*-----------------------------------------------------------*-- step 1: compare power -*
*
|
Group A
Group B
|
*-----------------+------------------------+----*
Outcome Yes |
30 (10%)
60 (20%) | 90
*
Nos |
270
240
| 510
*-----------------+------------------------+----*
Total |
300
300
| 600
*sampsi .10 .20 , alpha(.05) n1(300) n2(300) --> power = 0.9145
clear
set seed 999
scalar chi_signif=0
scalar fish_signif=0
quietly set obs 600
quietly gen group = 0 in 1/300
quietly replace group = 1 in 301/600
quietly gen v1=.
quietly gen outcome=.
forvalues x = 1/10000{
quietly replace v1 = uniform() /* random number between 0 and 1 */
quietly replace outcome = 0
quietly replace outcome=1 if (v1 <= .10) in 1/300
quietly replace outcome=1 if (v1 <= .20) in 301/600
quietly tab outcome group , chi2 exact
if r(p)< 0.05 {
scalar chi_signif=chi_signif+1
}
if r(p_exact)<0.05 {
scalar fish_signif=fish_signif+1
}
}
display "Times observe p<0.05 for chi-squared test: " ///
chi_signif " out of 10,000 samples (" chi_signif/10000*100 "%)"
display "Times observe p<0.05 for Fisher's exact test: " ///
fish_signif " out of 10,000 samples (" fish_signif/10000*100 "%)"
display "expected answer is power = .9145, or 91.45%"
*-- end step 1 --
Chapter 2-4 (revision 17 Oct 2011)
p. 14
and here is the second part:
*-----------------------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test
*-----------------------------------------------------------*-- step 2: compare alpha -*
*
|
Group A
Group B
|
*-----------------+------------------------+----*
Outcome Yes |
30 (10%)
30 (10%) | 60
*
Nos |
270
270
| 540
*-----------------+------------------------+----*
Total |
300
300
| 600
clear
set seed 999
scalar chi_signif=0
scalar fish_signif=0
quietly set obs 600
quietly gen v1=.
quietly gen outcome=.
quietly gen group = 0 in 1/300
quietly replace group = 1 in 301/600
forvalues x = 1/10000{
quietly replace v1 = uniform() /* random number between 0 and 1 */
quietly replace outcome = 0
quietly replace outcome=1 if (v1 <= .10) in 1/300
quietly replace outcome=1 if (v1 <= .10) in 301/600
quietly tab outcome group , col chi2 exact
if r(p)< 0.05 {
scalar chi_signif=chi_signif+1
}
if r(p_exact)<0.05 {
scalar fish_signif=fish_signif+1
}
}
display "Times observe p<0.05 for chi-squared test: " ///
chi_signif " out of 10,000 samples (" chi_signif/10000*100 "%)"
display "Times observe p<0.05 for Fisher's exact test: " ///
fish_signif " out of 10,000 samples (" fish_signif/10000*100 "%)"
display "expected answer is alpha = .05, or 5%"
*-- end step 2 --
Chapter 2-4 (revision 17 Oct 2011)
p. 15
Exact Tests (Permutation Tests)
We will now see an explanation for why the Fisher’s exact test is conservative.
Exact tests are also called permutation tests. An exact p value can be computed for any
nonparametric test (we define nonparametric tests later), if you have the software available (such
as StatXact). Generally, asymptotic (large sample approximation) p values are computed for
most nonparametric tests (the chi-square test is a good example).
The way such tests work are to compute the p value by summing the probabilities of the
observed table along with the probabilities of all permutations of the data that are more extreme.
Matthews and Farewell (1985, pp. 24-26) illustrate this approach for the Fisher’s exact test. The
observed data are shown in the following table:
Tumor activity of two drugs in leukemic mice
Complete remission
yes
no
Total
Methyl GAG
7
3
10
6-MP
2
7
9
Total
9
10
19
Holding the row and column totals fixed (the marginals), we construct all possible permutations
of the data, and compute there probabilities (these called hypergeometric probabilities, which we
will omit learning about).[MStat students: see box]
Table 0
0 10 |10
9 0 | 9
-----+-9 10 |19
p=0.00001
Table 1
1 9 |10
8 1 | 9
-----+-9 10 |19
p=0.0009
Table 2
2 8 |10
7 2 | 9
-----+-9 10 |19
p=0.0175
Table 3
3 7 |10
6 3 | 9
-----+-9 10 |19
p=0.1091
Table 4
4 6 |10
5 4 | 9
-----+-9 10 |19
p=0.2864
Table 5
5 5 |10
4 5 | 9
-----+-9 10 |19
p=0.3437
Table 6
6 4 |10
3 6 | 9
-----+-9 10 |19
p=0.1910
Table 7
7 3 |10
2 7 | 9
-----+-9 10 |19
p=0.0468
Table 8
8 2 |10
1 8 | 9
-----+-9 10 |19
p=0.0044
Table 9
9 1 |10
0 9 | 9
-----+-9 10 |19
p=0.00019
All tables at least as extreme are those will table probabilities less than or equal to the observed
table. So the 2-sided p value is (tables 0,1,2,7,8,9)
display 0.00001+0.0009+0.0175+0.0468+0.0044+0.00019
.0698
The 1-sided p value is (tables 7,8,9)
display 0.0468+0.0044+0.00019
.05139
Chapter 2-4 (revision 17 Oct 2011)
p. 16
For Master of Statistics Students Only -- Hypergeometric Probabilities Computed in
Fisher’s Exact Test
This presentation follows closely that of Rice (1988, pp. 434-436).
We denote the permutations of the data for the Fisher’s exact test as,
N11
N21
n.1
N12
N22
n.2
n1.
n2.
n..
and assume for purposes of probability calculation that the margins of the table are fixed.
For the specific permutation observed in the Matthews and Farewell’s (1985, pp. 24-26) example
above,
Tumor activity of two drugs in leukemic mice
Complete remission
yes
no
Total
Methyl GAG
7
3
10
6-MP
2
7
9
Total
9
10
19
we consider the count N11, the number of leukemic mice treated with Methyl GAG who
experience complete remission. Under the null hypothesis of no association, the distribution of
N11 is that of the number of success in 10 draws (without replacement) from a population of 9
successes and 10 failures. That is, the distribution of N11 induced by chance is hypergeometric,
with probability,
 n.1  n.2 
  
n
n
p( N11  n11 )   11  12 
 n.. 
 
 n1. 
For the observed table, the probability of observed 7 in the N11 cell is
 9 10 
  
7 3
p( N11  7)    
19 
 
10 
Performing the calculation in Stata,
display comb(9,7)*comb(10,3)/comb(19,10)
we get .04676438
Chapter 2-4 (revision 17 Oct 2011)
p. 17
We verify this is how the p values for the Fisher’s exact test are computed in Stata.
tabi 7 3 \ 2 7 , chi2 exact
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
7
3 |
10
2 |
2
7 |
9
-----------+----------------------+---------Total |
9
10 |
19
Pearson chi2(1) =
Fisher's exact =
1-sided Fisher's exact =
4.3372
Pr = 0.037
0.070
0.051
We see that the p values for the 2- and 1-sided Fisher’s exact test agree with what we computed
above. We also see that we missed significance with Fisher’s exact test, but would get it with the
chi-square test. Unfortunately, the data are too sparse to apply the chi-square test (violates rule
of thumb presented below).
Where Does the Conservativeness of Fisher’s Exact Test Come From?
The conservativeness comes entirely from the discreteness of the test statistic. [Cytel, 2001, pp
1058-1061]
An asymptotic test computes it’s p value from integrating the area under the curve of the
sampling distribution (such as the chi-square distribution), and so conceivably one can get a p
value very close to alpha = 0.05. Fisher’s exact test, on the other hand, sums up a discrete
number of probabilities. One sum might be a bit below 0.05, adding one more probability to the
sum might raise the sum above 0.05. Since this sum has to change in discrete steps, it cannot get
smoothly close to 0.05.
Chapter 2-4 (revision 17 Oct 2011)
p. 18
Minimum Expected Frequency Rule for Using Chi-Square Test
The expected frequency of a contingency table cell is calculated as
expected cell frequency = (row total × column total) / grand total.
There is one issue with the chi-square test, which even the continuity correction does not
remove. Being an asymptotic test, the chi-square test requires a sufficiently large sample size.
Just how large the sample size must be is determined by the expected cell frequencies, not the
cell counts themselves (Altman, 1991, p.253).
Daniel (1995, pp.524-526) in his statistics textbook, cites a rule attributable to Cochran (1954):
2 × 2 table: the chi-square test should not be used if n < 20. If 20 < n < 40, the chisquare test should not be used if any expected frequency is less than 5.
When n ≥ 40, three of the expected cell frequencies should be at least 5 and
one expected frequency can be as small as 1.
larger than 2 × 2 table (r × c table):
the chi-square test can be used if no more than 20% of the cells have
expected frequencies < 5 and no cell has an expected frequency < 1.
Rosner (2006, pp. 396, 428), in his statistics textbook, citing Cochran (1954), proposes the
following:
No more than 20% of the cells should have expected frequencies < 5, and no cell should
have an expected frequency < 1. For a 2 × 2 table, no cell should have an expected
frequency < 5.
Altman (1991, pp. 248,253), in his statistics textbook, citing Cochran (1954), proposes the
following:
No more than 20% of the cells should have expected frequencies <5, with no cell having
expected frequency < 1; although for a 2 × 2 table, one cell can have an expected value
slightly lower than 5.
Chapter 2-4 (revision 17 Oct 2011)
p. 19
Stata provides the expected frequencies with the expect option. For the example above,
tabi 7 3 \ 2 7 , expect
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
+--------------------+
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
7
3 |
10
|
4.7
5.3 |
10.0
-----------+----------------------+---------2 |
2
7 |
9
|
4.3
4.7 |
9.0
-----------+----------------------+---------Total |
9
10 |
19
|
9.0
10.0 |
19.0
We see that 75% of the cells have an expected frequency < 5, so that data are too sparse for the
chi-square test to give a sufficiently accurate p value. We are stuck with the non-significant
Fisher’s exact test.
For the first cell, the expected frequency is 4.7. We can verify the calculation in Stata, by
applying the formula,
expected cell frequency = (row total × column total) / grand total.
display 10*9/19
4.7368421
The derivation of the expected cell frequency formula in shown in the following box.
Chapter 2-4 (revision 17 Oct 2011)
p. 20
Expected cell frequency
The expected cell frequency formula comes from the “multiplication rule for independent
events” in probability. If two events, A and B, are independent, or no association between the
row and column variables, then the probability they will both occur is:
P(AB) = P(A)P(B) , where P(AB) = probability both occur
P(A) = probability A will occur
P(B) = probability B will occur
Row
Variable
Yes
No
Total
Column
Variable
Yes
a
c
c1
No
b
d
c2
Total
r1
r2
N
A probability is just the proportion of times an event occurs, so
P(in Yes row) = r1/N , P(in Yes column) = c1/N
and
P(in Yes row and in Yes column) = (r1/N)(c1/N )
To get the expected cell frequency for cell a we multiply the probably by the total sample size,
(r1/N)(c1/N )(N) = (row total)(column total)/(grand total), since the numerator N cancels
with one of the denominator Ns.
The expected cell frequency represents the cell count that would be expected by chance, or
sampling variation.
Chapter 2-4 (revision 17 Oct 2011)
p. 21
To get the expected frequencies when the data are in variables, we use
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: preterm
Column variable: hyp
Cell contents: Expected frequencies
Open
tabulate preterm hyp, expected
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
+--------------------+
|
hypertens
pre-term |
0
1 |
Total
-----------+----------------------+---------0 |
375
52 |
427
|
365.1
61.9 |
427.0
-----------+----------------------+---------1 |
44
19 |
63
|
53.9
9.1 |
63.0
-----------+----------------------+---------Total |
419
71 |
490
|
419.0
71.0 |
490.0
Chapter 2-4 (revision 17 Oct 2011)
p. 22
What to use?
Occasionally, someone will advise researchers to always just use the Fisher’s exact test, rather
than the chi-square test, because the p value is always “accurate.” That is bad advice, since we
saw above that the Fisher’s exact test is conservative, so occasionally significance is needlessly
lost. The generally more powerful approach and more popular approach is to use the uncorrected
chi-square test if the expected frequency rule is met, and use Fisher’s exact test if it is not. (In
some rare cases, the relative power is reversed.)
Protocol/Article
Almost always you will find that authors do not give the details of expected frequencies in their
articles (to save space and because it is an elementary statistics principle) and just state:
Categorical variables were analyzed with chi-square test or Fishers exact test, as
appropriate.
This short version is the way I always state it.
For completeness, you could state the following; but since it is “basic statistics”, the reviewer
will not expect to see this, so I never do this:
Comparisons between the study groups for dichotomous outcomes will be performed
using the chi-square test if the minimum expected cell frequency assumption is met (80%
of the cells have expected frequencies of at least 5 and no cell has an expected frequency
less than 1). Otherwise, Fisher’s exact test will be used.
However, here is an example of some authors who mentioned the minimum expected frequency
rule in their article (Cachel et al, N Engl J Med, 2007),
“Percentages were analyzed using the chi-square test or Fisher’s exact test when expected
cell counts were less than 5.”
Chapter 2-4 (revision 17 Oct 2011)
p. 23
Barnard’s Unconditional Exact Test
There is another exact test, called Barnard’s unconditional exact test, which is available in
StatXact. Using the same data that from above, where we just missed significance with the
Fisher’s exact test:
row |
1
2 |
Total
-----------+----------------------+---------1 |
30
46 |
76
2 |
270
254 |
524
-----------+----------------------+---------Total |
300
300 |
600
Pearson chi2(1) =
Fisher's exact =
1-sided Fisher's exact =
3.8570
Pr = 0.050
0.065
0.033
and entering this table into StatXact-5, we get
BARNARD'S UNCONDITIONAL TEST OF SUPERIORITY USING DIFFERENCE OF TWO BINOMIAL PROPORTIONS
Statistic based on the observed 2 by 2 table :
Results:
------------------------------------------------------------------------1-sided P-value
2-sided P-value
Method
Pr{T .GE. t}
Pr{|T|.GE.|t|}
------------------------------------------------------------------------Asymp
0.0248
0.0495
Exact
0.0268
0.0499
we see that Barnard’s test is just as powerful as the chi-square test and is clearly superior to
Fisher’s exact test.
Let’s see how Barnard’s test performs for the Matthews and Farewell example given above,
which was:
Tumor activity of two drugs in leukemic mice
Complete remission
yes
no
Total
Methyl GAG
7
3
10
6-MP
2
7
9
Total
9
10
19
where the Fisher’s exact test result was p = 0.070, and the chi-square test was significant (p =
0.037) but clearly not appropriate for this sparse of data (3 of 4 cells with expected frequency
less than 5).
BARNARD'S UNCONDITIONAL TEST OF SUPERIORITY USING DIFFERENCE OF TWO BINOMIAL PROPORTIONS
Statistic based on the observed 2 by 2 table :
Observed proportion for population <col1
> : piHat_1
Observed proportion for population <col2
> : piHat_2
Observed difference of proportions : piHat_2-piHat_1
Stderr (pooled estimate of stdev of piHat_2-piHat_1)
Standardized test statistic (t) : (piHat_2-piHat_1)/Stderr
Chapter 2-4 (revision 17 Oct 2011)
=
=
=
=
=
0.7778
0.3000
-0.4778
0.2294
-2.083
p. 24
Results:
------------------------------------------------------------------------1-sided P-value
2-sided P-value
Method
Pr{T .LE. t}
Pr{|T|.GE.|t|}
------------------------------------------------------------------------Asymp
0.0186
0.0373
Exact
0.0260
0.0500
We see that Barnard’s test is significant (p = 0.050).
Unfortunately, Barnard’s test has not received widespread use, perhaps partly because it is only
available in StatXact, but an example of a paper that reports using it is (Gonzalez-Martinez,
2006). It appears to be a perfectly fine test. Whereas Fisher’s exact test is known to be
conservative for 2 × 2 tables (type I error rate actually smaller than alpha) which is entirely
attributable to the discreteness of the test statistic, Barnard’s test does not have that shortcoming
while still maintaining the type I error rate to at most alpha (Cytel, 2001, pp 1058-1061).
One reason Barnard’s test has not received widespread use is that no consensus has been reached
among statisticians about whether a conditional exact test (Fisher’s exact test) or an
unconditional exact test (Barnard’s test is one of the many of these) is more appropriate, even
after half a century of debate (Greenland, 1991).
So, until Barnard’s test gains wider acceptance, you are better off avoiding it, so that you can
stay out of the debate. It is not available in Stata, anyway.
Chapter 2-4 (revision 17 Oct 2011)
p. 25
Two Independent Groups Comparison of a Dichotomous Variable
This was sufficiently discussed above. You use the chi-square test or Fisher’s exact test,
depending on the minimum expected frequency rule.
Two Independent Groups Comparison of a Nominal Variable
Here we are considering a crosstabulation table of size r  c (where r is the number of rows and c
is the number of columns), which is larger than 2  2.
Looking this situation up in the statistical test digest making up Ch 2-3, we see that the chisquare test is suggested. The chi-square test, in this situation, still assumes a sufficiently large
sample size (sufficiently large cells sizes) for the asymptotic p value to be appropriate. The
minimum expected cell frequency rule of thumb, given above, again applies.
When the minimum expected frequency assumption is not meet, you next use the FisherFreeman-Halton test, which being an exact test, does not have that assumption.
Note: In the “old days”, before the Fisher-Freeman-Halton test was available in
statistical software, the researcher had to collapse (combine) rows or columns
until the minimum expected frequency assumption was satisfied.
In Stata, this Fisher-Freeman-Halton test is simply called Fisher’s exact test. Originally, Fisher’s
exact test was only for 2 × 2 tables. Later (1951), Fisher and Freeman extended the test to any
size of contingency table, which became known as the Freeman-Halton test. To give proper
credit, many statisticians call it the Fisher-Freeman-Halton test (the StatXact-5 manual refers to
it as the Fisher-Freeman-Halton test, for example).
Note: Some researchers and editors are still in the old days and do not know this test
exists, so you should always provide a reference for it when you use it.
As an example, we will use the crosstabulation of race with study drug, taken from Brady et al.
(2000) Table 1. After computing the cell frequencies from the percents and entering these data
into Stata, we get
tabi 14 8 \ 76 82 \ 4 3 , col chi2 exact expect
Chapter 2-4 (revision 17 Oct 2011)
p. 26
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
| column percentage |
+--------------------+
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
14
8 |
22
|
11.1
10.9 |
22.0
|
14.89
8.60 |
11.76
-----------+----------------------+---------2 |
76
82 |
158
|
79.4
78.6 |
158.0
|
80.85
88.17 |
84.49
-----------+----------------------+---------3 |
4
3 |
7
|
3.5
3.5 |
7.0
|
4.26
3.23 |
3.74
-----------+----------------------+---------Total |
94
93 |
187
|
94.0
93.0 |
187.0
|
100.00
100.00 |
100.00
Pearson chi2(2) =
Fisher's exact =
2.0018
Pr = 0.368
0.358
We find that 2 cells (2/6=33%) have expected frequencies less than 5. Therefore, the chi-square
test is not appropriate for these data. We should report the Fisher-Freeman-Halton p value
(p=0.358).
We can verify that this statistic (what Stata calls Fisher’s exact test) is actually the FisherFreeman-Halton test by testing it in StatXact-5.
!StatXact-5 (5.0.3)
!Unordered R x C Table:Fisher-Freeman-Halton Test
FISHER'S EXACT TEST
Statistic based on the observed 3 x 2 table(x) (rows/cols with 0 totals are ignored):
P(X) : Hypergeometric Prob. of the table =
0.0203
FI(X) : Fisher statistic
=
2.023
Asymptotic p-value: (based on Chi-Square distribution with 2 df )
Pr { FI(X) .GE.
2.023 } =
0.3637
Exact p-value and point probability :
Pr { FI(X) .GE.
2.023 } =
Pr { FI(X) .EQ.
2.023 } =
0.3583
0.0406
Indeed, we get the same p value. (Note: even StatXact calls it Fisher’s exact test in the output,
calling it the Fisher-Freeman-Halton test only in the heading.)
Besides being more correct, it is a good idea to not call this test the Fisher’s exact test. That
way, you avoid the editor, reviewer, or reader from saying “What are you talking about? The
Fisher’s exact test is only available for 2 × 2 tables.”
Chapter 2-4 (revision 17 Oct 2011)
p. 27
Protocol
You could state:
Comparisons between the study groups for unordered categorical variables will be
performed using the chi-square test if the minimum expected cell frequency assumption
is met (80% of the cells have expected frequencies of at least 5 and no cell has an
expected frequency less than 1). Otherwise, Fisher’s exact test will be used for variables
with two categories and the Fisher-Freeman-Halton test for variables with three or more
categories. The Fisher-Freeman-Halton test is the Fisher’s exact test generalized by
Freeman and Halton to greater than 2 × 2 crosstabulation tables (Conover, 1980).
Fishcer, et al (N Engl J Med, 2009) used something similar, but more brief, in their statistical
methods section,
“The total number of thoracotomies and the number of futile thoracotomies in each group
were compared by means of a chi-square test with a two-sided significance level of 0.05.
When the expected number in any cell was less than five, a Fisher’s exact test for two-bytwo tables and a Fisher-Freeman-Halton test for two-by-k tables for binary comparisons
were used….”
Mid-P Exact Test
Occasionally you will see an “exact mid-p test” reported. For example, you can get this using
the PEPI 4.0 program EXACT2XK.EXE (Abramson and Gahlinger, 2001) when either the row or
column variable has only 2 categories.
Running EXACT2XK.EXE for the above 2 × 3 table, we get
Exact
Exact (mid-P)
p = 0.358
p = 0.338
The Exact row is the Fisher-Freeman-Halton test, which agrees with Stata and StatXact.
The Exact (mid-P) row is a variation of the test, where only 1/2 of the middle probability is
added to the sum. The middle probability is the permutation that was observed. This test was
originally introduced to address the problem of the Fisher exact test being conservative. This test
is legitimate, and you could use it if you wanted to. It never became completely accepted by
statisticians because this approach does not guarantee that the test maintains alpha at 0.05 (may
give significant results too often)(Cytel, 2001, pp.1059-1061).
Chapter 2-4 (revision 17 Oct 2011)
p. 28
Two Independent Groups Comparison of an Ordinal Variable
As listed in the Statistical Test Digest, for this comparison we use the Wilcoxon-Mann-Whitney
test. In statistic textbooks, you will find two tests for this application: 1) Wilcoxon ranksum test,
and 2) Mann-Whitney U test.
Rosner (1995, p. 566) points out
“The Mann-Whitney U test and the Wilcoxon rank-sum test are completely equivalent,
since the same p-value is obtained by applying either test.”
In Siegel and Castellan (1988, p. 128) the test is called the Wilcoxon-Mann-Whitney test. Many
statisticians are now calling it this in order to give all three test developers credit.
In the Stata help for the command ranksum, you will find
“ranksum tests the hypothesis that two independent samples (i.e.,
unmatched data) are from populations with the same distribution by using
the Wilcoxon rank-sum test, which is also known as the Mann-Whitney
two-sample statistic (Wilcoxon 1945; Mann and Whitney 1947).”
---------Mann, H. B., and D. R. Whitney. 1947. On a test whether one of two random
variables is stochastically larger than the other. Annals of Mathematical Statistics
18: 50-60.
Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics
1: 80-83.
Note: Although it is more correct to call it the Wilcoxon-Mann-Whitney test, not everyone has
heard of the test being referred to by this name. Therefore, you might consider providing a
reference.
The Wilcoxon-Mann-Whitney test is always described as “a test of whether two independent
groups have been drawn from the same population” (another test, called the median test, is
specifically a comparison of medians, but is not as powerful as the Wilcoxon-Mann-Whitney
test). By comparing ranks, it is comparing if the “bulk” of the values in the population in one
group are larger than those of the other group, which equates to H0: P(Group 1 > Group 2) = 1/2.
Because of this construction, Siegel and Castellan (1988, p.129) point out that the test equates to
a comparison of medians.
Chapter 2-4 (revision 17 Oct 2011)
p. 29
____________________________________________________________________________
Aside, on what was just said.
Just for sake of completeness, not everyone agrees with Siegel and Castellan that the WilcoxonMann-Whitney (WMW) test equates to a comparison of medians. Bergmann et al (2000) insist
on being strictly precise about what the test does,
“The WMW procedure tests for equality of group mean-ranks, not of group medians.
This is evident from our experimental data (Table 1). However, by providing group
medians or their differences in their outputs, statistics package such as SigmaStat,
Unistat, Stata, and even Arcus QuickStat may mislead investigators into supposing that
the p values refer to the hypothesis that group medians are equal. This common
misapprehension is not unique to statistics packages. It appears in Siegel and Castellan
(1988) and many other elementary texts on statistics.”
____________________________________________________________________________
Exercise Notice in the Sulkowski (2000) article, the Table 1 laboratory variables were
compared using the Wilcoxon-Mann-Whitney test, which Sulkowski refers to as the
“nonparametric Mann-Whitney test” in the Methods Section paragraph just above the
table on page 76.
Although the name Wilcoxon-Mann-Whitney test has been proposed and used for decades, it is
still frequently referred to as the Wilcoxon test or the Mann-Whitney test. An example of a
paper that uses the more correct name that gives all three developers credit is Brown et al (N
Engl J Med 2006) who state in their Statistical Methods,
“Continuous variables were compared with the use of a two-tailed unpaired t-test ... and
ordinal variables with the use of the Wilcoxon-Mann-Whitney test.”
Cytel (2001, p. 709) provides the following example of a two-sample comparison of an ordinal
variable.
“A randomized clinical trial of Interferon versus placebo was conducted on 44 children
infected with childhood chicken pox (varicella)(Arvin, et al., 1982). One of the end
points of the study was to determine whether Interferon is more effective than placebo in
preventing adverse effects. There are four ordinal categories of adverse effects. The
number of children falling in each category, by treatment, is:
Adverse Effect
None
Life Threatening
Death in 2-3 Weeks
Death in Less Than 1 Week
____________
Placebo
15
3
1
2
Interferon
21
0
2
0
Ref: Arvin AM, Kushner JH, Feldman S, et al. (1982). Human leukocyte interferon
for the treatment of varicella in children with cancer. NEJM 306:761-765.”
Chapter 2-4 (revision 17 Oct 2011)
p. 30
We can quickly enter these data using the “expand” trick, by copying the following into the dofile editor and then executing it:
clear
input ae drug count
1
0
15
2
0
3
3
0
1
4
0
2
1
1
21
2
1
0
3
1
2
4
1
0
end
expand count
drop if count==0
// must do this, otherwise it leaves that line in the file
drop count
tab ae drug // check that data match original table
which creates the number of rows of data based on the variable count. We then drop (delete) the
variable count, which we only used as an intermediate variable for the expand command.
Chapter 2-4 (revision 17 Oct 2011)
p. 31
We now compute the Wilcoxon-Mann-Whitney test using
Statistics
Summaries, tables & tests
Nonparametric tests of hypotheses
Wilcoxon rank-sum test
Main tab: Variable: ae
Grouping variable: drug
Open
ranksum ae, by(drug)
(Note: you ask for the “Mann-Whitney” test from the menu, and it shows “Wilcoxon rank-sum
test” on the menu dialog box”. They are the same test.)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
drug |
obs
rank sum
expected
-------------+--------------------------------0 |
21
519.5
472.5
1 |
23
470.5
517.5
-------------+--------------------------------combined |
44
990
990
unadjusted variance
adjustment for ties
adjusted variance
1811.25
-992.93
---------818.32
Ho: ae(drug==0) = ae(drug==1)
z =
1.643
Prob > |z| =
0.1004
<-- report this (which is a two-sided p value)
Protocol Suggestion
If you wanted to be complete, you could state,
Comparisons of two groups for ordered categorical variables (ordinal scale) will be
performed using a Wilcoxon-Mann-Whitney test (many statisticians now refer to the test
by this name since the Wilcoxon rank-sum test and the Mann-Whitney U test are
essential the same and give identical P Values) (Siegel and Castellan, 1988, p. 128).
However, the following shorter version should be sufficient and is recommended since the test
name Wilcoxon-Mann-Whitney is sufficiently common now:
Two group comparisons for ordered categorical variables will be performed using the
Wilcoxon-Mann-Whitney test.
Chapter 2-4 (revision 17 Oct 2011)
p. 32
Definition of Parametric and Nonparametric Tests
The tests introduced thus far are called nonparametric tests. The next test we will introduce is
the Student’s t test, which is an example of a parametric test. Now, then, is a good time to
formally define what parametric and nonparametric tests are. Paraphrasing Siegel and Castellan
(1988, pp.33-34):
A parametric statistical test specifies certain conditions about the distribution of the
dependent variable in the population from which the research sample was drawn. [The
term “parametric” comes from the statistical jargon of referring to population means and
standard deviations as “parameters”, in order to avoid confusion with sample means and
standard deviations, which are referred to as “statistics”.] The most frequent condition is
“normally distributed”. Parametric tests based on the normal distribution require that the
dependent variable is measured in at least an interval scale.
A nonparametric statistical test is based on a model that specifies only very general
conditions and none regarding the specific form of the distribution from which the
sample was drawn. Nonparametric tests do not require that the dependent variable is
measured in at least an interval scale (some requiring an ordinal scale, and some
requiring only a nominal scale).
Nonparametric tests, then, are used when you have nominal or ordinal levels variables.
They are also useful when you have a highly skewed interval level variable, particularly
with small sample sizes (since your data do not look anything like a normally distributed
variable).
Chapter 2-4 (revision 17 Oct 2011)
p. 33
Central Limit Theorem
In Chapter 2 is a presentation of the concept called statistical regularity (which statisticians also
call the Strong Law of Large Numbers). It was illustrated by a simulation involving increasing
sample sizes from a dichotomous variable with a population proportion of 0.5.
*----------------------------------------------------------------* Demonstrate statistical regularity by plotting proportion of 1's
* from a dichotomous variable for increasingly large sample sizes
* when population proportion is 0.5
*-----------------------------------------------------------------
Statistical Regularity for Binomial Variable (p=0.5)
1
.9
Proportion of One's
.8
.7
.6
.5
.4
.3
.2
.1
0
1
50
Sample Size (log scale)
100
200 300400
In statistics, there is an important second form of regularity that occurs with means, which is
called the central limit theorem. Rosner (1995, p.158) provides a simple version of it,
Central-Limit Theorem
The distribution of means from samples of size N from some population with mean  and
variance 2 will have an approximate normal distribution with mean  and variance 2/n
(standard error = / n ), even if the sampled distribution is not normal.
What is remarkable is how fast, that is requiring only small sample sizes, the distribution of
means approaches the normal distribution.
This is illustrated by a Monte Carlo simulation, where we choose samples of size n=10 from a
dichotomous variable (with values 0 and 1) with population parameter p=0.5. Doing this for
1,000 samples, we get the following distribution of means:
Chapter 2-4 (revision 17 Oct 2011)
p. 34
*------------------------------------------------------------------* Demonstrate the central limit theorem by taking samples of
* size n=10 from a dichotomous variable with p=.5
*-------------------------------------------------------------------------Means Computed From 1000 Samples of Size n=10
0
50
100
Frequency
150
200
250
(Sampled From Dichotomous Distribution With p=0.5)
0
.1
.2
.3
.4
.5
Mean
.6
.7
.8
.9
1
We see that the distribution of means is remarkably close to a normal distribution.
When we increase the sample size to n=100 in the otherwise same Monte Carlo experiment, we
get:
*------------------------------------------------------------------* Demonstrate the central limit theorem by taking samples of
* size n=100 from a dichotomous variable with p=.5
*-------------------------------------------------------------------------Means Computed From 1000 Samples of Size n=100
60
40
0
20
Frequency
80
100
(Sampled From Dichotomous Distribution With p=0.5)
.3
Chapter 2-4 (revision 17 Oct 2011)
.4
.5
Mean
.6
.7
p. 35
This simulation illustrates the “even if the sampled distribution is not normal” phrase stated
above in the Central Limit Theorem definition. In the population, the histogram of individual
values is simply two bars, each of equal height, which is a long ways from being a normal
distribution.
Population Distribution for the Above CLT Simulation
1
0
.5
Density
1.5
2
(Binomial Distribution With p=0.5)
-.5
0
.5
Values of Individual Observations
1
1.5
There are many parametric tests, such as the t test and linear regression, which have the
assumption that the data come from a normal distribution. That is simply a convenient way to
express it in introductory statistical texts. The real assumption involves the form of the sampling
distribution, and also the distribution of residuals in linear regression. Rather than go into a
precise description, suffice it to say that the above stated Central Limit Theorem, as well as other
versions of this theorem, assure us that the actual assumption for what needs to be normally
distribution is taken care of if the sample size is “large enough.” It turns out that the Central
Limit Theorem “kicks in” with even small sample sizes. Another way to state this is that the ttest (as well as analysis of variance and linear regression) is very robust to the normality
assumption, providing sufficiently accurate p values regardless of how the data are distributed in
the sampled population. Therefore, you can basically just ignore the normality assumption.
This robustness topic is covered in Chapter 5-10.
Two Independent Groups Comparison of an Interval Variable
The comparison of two groups on an interval scaled variable is done using the independent
sample Student’s t test (the “Student’s” is generally dropped, referring to the test as the
independent sample t test).
Chapter 2-4 (revision 17 Oct 2011)
p. 36
There are two versions of the t test for two independent groups. The test has the assumption that
the variances (and thus the standard deviations) of the two groups being compared are equal.
The alternate version does not have this assumption. The added assumption gives the first
version greater power, and so it is more widely used.
The equal variance assumption is one reason the t test is a parametric test (the assumption is
referring to the variances, which are parameters of the sampled populations).
It is advocated by some to test the assumption of equal variances (also called the homogeneity of
variance assumption) using Levene’s test for equality of variances. If the assumption holds
(Levene’s test is not statistically significant) then the equal variance t test is used. If the
assumption fails, the unequal variance t test is used. This approach is not necessary, though,
since the t test is “robust” to the equal variances assumption. This robustness topic is covered in
Chapter 5-10.
Although I do not advocate it as a needed step, I will now show how to test the homogeneity of
variance assumption, just so you know what others are talking about when they report doing it.
In SPSS, both tests are output at the same time, along with the Levene’s test for equality of
variance, just to make the “advocated” process easier. In Stata, you have to ask for all three tests
separately.
An example dataset, coronary artery data, which is on the SPSS distribution CD, contains the
following variables:
Variable
Time
Group
Label
Treadmill Time
Study Group 1=healthy 2=disease
Comparing treadmill time between the two study groups results in the following independent
sample t test output in SPSS.
Group Statistics
TIME
GROUP
1
2
N
Mean
928.50
764.60
8
10
Std. Deviation
138.121
213.750
Std. Error
Mean
48.833
67.594
Independent Samples Test
Levene's Test for
Equality of Variances
F
TIME
Equal variances
ass umed
Equal variances
not as sumed
Sig.
.137
Chapter 2-4 (revision 17 Oct 2011)
.716
t-tes t for Equality of Means
t
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
1.873
16
.080
163.90
87.524
-21.642
349.442
1.966
15.439
.068
163.90
83.388
-13.398
341.198
p. 37
To perform the same analysis in State, we use the following commands:
ttest depvar , by(groupvar)
-- independent groups t test with equal variances
assumption and confidence intervals
robvar depvar, by(groupvar) -- Levene’s test for equality of variances
ttest depvar , by(groupvar) unequal -- independent groups t test without equal
variances assumption (uses Satterthwaite's
degrees of freedom approximation) and
confidence intervals
Reading in the data,
File
Open
Find the directory where you copied the course CD:
Change to the subdirectory: datasets & do-files
Single click on coronary artery data.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With State\Section 2 Biostatistics\
\datasets & do-files\coronary artery data.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With State\Section 2 Biostatistics\"
cd "datasets & do files"
use "coronary artery data.dta", clear
We might first verify the t test’s assumption of equal variances, using Levene’s Test for Equality
of Variances. In introductory statistics textbooks, you will find the F Test for equality of
variances (the sdtest command in State). The F Test is sensitive to the normality assumption, so
if the data are skewed, it gives an inaccurate p value. Levene’s test, on the other hand, is robust
to the normality assumption; so it provides an accurate p value even if the data are skewed.
Therefore, always use Levene’s test rather than the F test.
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Robust equal variance test
Main tab: Variable: time
Variable defining two comparison groups: group
OK
robvar time, by(group)
Chapter 2-4 (revision 17 Oct 2011)
p. 38
|
Summary of TIME
GROUP |
Mean
Std. Dev.
Freq.
------------+-----------------------------------1 |
928.5
138.12106
8
2 |
764.6
213.7497
10
------------+-----------------------------------Total |
837.44444
197.65306
18
W0
= .1368483
W50 = .17792242
W10 = .0650524
df(1, 16)
df(1, 16)
df(1, 16)
Pr > F = .71628551
Pr > F = .67877762
Pr > F = .80193108
<-W0 is Levene’s test
<-ignore this test
<-ignore this test
Notice that the robvar command gives two alternative tests for equality of variance (W50 and
W10), which you can ignore.
Just by visual expectation, the standard deviations (and hence the variances) seem quite different
(138 vs. 214). Still, the Levene’s test for equality variances was not significant (p = 0.716) so we
cannot reject the hypothesis of equal variances (not sufficient evidence in the data to conclude
that the equal variances assumption was not justified).
Using the “advocated” approach of confirming the assumptions, we have justification, then, to
use the equal variances t test, which we compute next.
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Main tab: Variable name: time
Group variable name: group
OK
ttest time, by(group)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
10
764.6
67.59359
213.7497
611.6927
917.5073
---------+-------------------------------------------------------------------combined |
18
837.4444
46.58727
197.6531
739.1539
935.735
---------+-------------------------------------------------------------------diff |
163.9
87.52394
-21.64246
349.4425
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
1.8726
Ho: diff = 0
degrees of freedom =
16
Ha: diff < 0
Pr(T < t) = 0.9602
Chapter 2-4 (revision 17 Oct 2011)
Ha: diff != 0
Pr(|T| > |t|) = 0.0795
Ha: diff > 0
Pr(T > t) = 0.0398
p. 39
We could stop at this point if we wanted. However, since the p value is so close to 0.05 and was
not significant, we might be especially nervous about the equal variances assumption. So, we
next compute the unequal variances t test just to see what we get.
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Main tab: Variable name: time
Group variable name: group
Unequal variances
OK
ttest time, by(group) unequal
Two-sample t test with unequal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
10
764.6
67.59359
213.7497
611.6927
917.5073
---------+-------------------------------------------------------------------combined |
18
837.4444
46.58727
197.6531
739.1539
935.735
---------+-------------------------------------------------------------------diff |
163.9
83.38808
-13.39825
341.1983
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
1.9655
Ho: diff = 0
Satterthwaite's degrees of freedom = 15.4391
Ha: diff < 0
Pr(T < t) = 0.9662
Ha: diff != 0
Pr(|T| > |t|) = 0.0676
Ha: diff > 0
Pr(T > t) = 0.0338
We still did not get statistical significance, but notice that the p value is smaller. That is not
supposed to happen, in general, since the equal variances t test is a more powerful test.
A close look at the reported standard deviations reveals that the larger standard deviation goes
with the smaller mean. Normally, we would expect the mean to standard deviation ratio to be
similar for both groups, which would lead us to suspect that one group is more skewed than the
other.
The t test itself has another assumption, the assumption that the data for each group have an
approximately Normal distribution. However, this assumption is not as critical for the t test
because the distribution of means (and similarly mean differences), which is what is actually
being compared, is guaranteed to be normally distributed by the Central Limit Theorem for
“sufficiently” large sample sizes.
Chapter 2-4 (revision 17 Oct 2011)
p. 40
Altman (1991, p.199) states these two assumptions and advises,
“The use of the t test is based on the assumption that the data for each group (with
independent samples) or the differences (with paired samples) have an approximately
Normal distribution, and for the two sample case we also require the two groups to have
similar variances. We sometimes find that at least one requirement is not met. When the
data are skewed we can either use a non-parametric method, or try a transformation of the
raw data.”
Not all statisticians agree with Altman, who is taking a very conservative approach to the
assumptions. Other statistics simply trust in the robustness of the t-test to both non-normality
and unequal variances (see Chapter 5-10—I am also nearly finished with a new chapter on this
specific subject, with far more citations and some simulations).
Although I do not personally advocated bothering with the normality assumption, the normally
distributed assumption can be tested using the Shapiro-Wilk test for normality.
NOTE: always test for normality separately for each group (if there is a
difference in means, then the total sample distribution will look like a
bimodal distribution, having two modes, which is clearly not normal).
Statistics
Summaries, tables & tests
Distributional plots & tests
Shapiro-Wilk normality test
Main tab: Variables: time
by/if/in tab: Repeat command by groups:
Variables that define groups: group
OK
by group, sort : swilk time
<or>
bysort group: swilk time
_______________________________________________________________
-> group = 1
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------time |
8
0.92428
1.055
0.086 0.46559
_______________________________________________________________
-> group = 2
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------time |
10
0.74104
3.991
2.773 0.00278
We see that the Shapiro-Wilk test identified non-normality hypothesis in the second group.
Chapter 2-4 (revision 17 Oct 2011)
p. 41
Examining the data with a boxplot,
Graphics
Box plot
Main tab: Variables: time
By tab: Draw subgraphs for unique values of variables: group
OK
600
800
TIME
1,000
1,200
1,400
graph box time , by(group)
1
2
From the boxplot, we have graphically identified an outlier in the second group (Boxplots are
explained in Chapter 1). An outlier is a data value that appears to have not come from the same
population that the rest of sample came from.
In our example, the outlier is for the unhealthy group, which had a treadmill time even higher
than the maximum value in the healthy group. Either it was a sick marathon runner, or a data
coding error (or perhaps a patient attempting to impress the clinician conducting the treadmill
test, even if it meant having a heart attack trying).
Let’s see what happens if we use a Wilcoxon-Mann-Whitney test. After all, that test does not
have the assumption of normally distributed data. Similarly, it is not affected by outliers, since it
simply compares ranks, so that the outlier simply looks like one unit larger (next higher rank)
than the next largest value.
Chapter 2-4 (revision 17 Oct 2011)
p. 42
To see this (just for illustration, you would never do this as part of your analysis),
egen timerank = rank(time)
list group time timerank
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
+-------------------------+
| group
time
timerank |
|-------------------------|
|
1
1014
16 |
|
1
684
5 |
|
1
810
10 |
|
1
990
14 |
|
1
840
11 |
|-------------------------|
|
1
978
13 |
|
1
1002
15 |
|
1
1110
17 |
|
2
864
12 |
|
2
636
3 |
|-------------------------|
|
2
638
4 |
|
2
708
6 |
|
2
786
9 |
|
2
600
2 |
15. |
2
1320
18 |
|-------------------------|
16. |
2
750
7.5 |
17. |
2
594
1 |
18. |
2
750
7.5 |
+-------------------------+
<-- we see that the new score (the rank) is
simply one unit larger than the next largest value
Computing a Wilcoxon-Mann-Whitney test,
Statistics
Summaries, tables & tests
Nonparametric tests of hypotheses
Wilcoxon rank-sum test
Main tab: Variable: time
Grouping variable: group
Open
ranksum time, by(group)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
group |
obs
rank sum
expected
-------------+--------------------------------1 |
8
101
76
2 |
10
70
95
-------------+--------------------------------combined |
18
171
171
unadjusted variance
adjustment for ties
adjusted variance
126.67
-0.13
---------126.54
Ho: time(group==1) = time(group==2)
z =
2.222
Prob > |z| =
0.0263
Chapter 2-4 (revision 17 Oct 2011)
p. 43
We see that this test gives a statistically significant result. That is because it is treating the
outlier the same as if it is just barely larger than the next largest value, thus shrinking the tail of
the distribution back towards the rest of the distribution.
The Wilcoxon-Mann-Whitney test is actually a comparison of ranks, which is why it shows the
“rank sum” column in the above output. We can verify this by,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: timerank
Options: Display additional statistics
by/if/in tab: if (expression): group==1
OK
summarize timerank if group==1, detail
rank of (time)
------------------------------------------------------------Percentiles
Smallest
1%
5
5
5%
5
10
10%
5
11
Obs
8
25%
10.5
13
Sum of Wgt.
8
50%
75%
90%
95%
99%
13.5
15.5
17
17
17
Largest
14
15
16
17
Mean
Std. Dev.
12.625
3.889087
Variance
Skewness
Kurtosis
15.125
-.8502071
2.830671
return list
scalars:
r(N)
r(sum_w)
r(mean)
r(Var)
r(sd)
r(skewness)
r(kurtosis)
r(sum)
r(min)
r(max)
r(p1)
r(p5)
r(p10)
r(p25)
r(p50)
r(p75)
r(p90)
r(p95)
r(p99)
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
8
8
12.625
15.125
3.889087296526011
-.8502070865436711
2.830671207079922
101
5
17
5
5
5
10.5
13.5
15.5
17
17
17
The summarize command does not display the sum, but we got it using return list.
We see that the sum for group 1 is 101, which agrees exactly with the sum for group 1 shown in
the Wilcoxon-Mann-Whitney output.
Chapter 2-4 (revision 17 Oct 2011)
p. 44
One might wonder what would happen if we omitted the outlier from the data and then
performed a t test. Let’s find out.
We could drop the outlier from the dataset by using:
drop if time==1320
However, might find we need that observation later, so let’s keep it just for now. Instead, we
will use:
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Main tab: Variable name: time
Group variable name: group
by/if/in tab: If (expression): time~=1320
OK
ttest time if time~=1320, by(group)
Note: The “~=”, or you can use “!=”, are the State symbols for “not equal to”.
We could also put the “if” expression at the end by adding an extra comma
ttest time, by(group), if time~=1320
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
9
702.8889
30.83488
92.50465
631.7835
773.9943
---------+-------------------------------------------------------------------combined |
17
809.0588
39.18175
161.5505
725.9972
892.1204
---------+-------------------------------------------------------------------diff |
225.6111
56.38805
105.4228
345.7994
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
4.0010
Ho: diff = 0
degrees of freedom =
15
Ha: diff < 0
Pr(T < t) = 0.9994
Ha: diff != 0
Pr(|T| > |t|) = 0.0012
Ha: diff > 0
Pr(T > t) = 0.0006
We see that we now get significance with the t test.
Chapter 2-4 (revision 17 Oct 2011)
p. 45
Just for illustration, let’s also look at the t-test with unequal variances,
ttest time, unequal by(group), if time~=1320
Two-sample t test with unequal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
9
702.8889
30.83488
92.50465
631.7835
773.9943
---------+-------------------------------------------------------------------combined |
17
809.0588
39.18175
161.5505
725.9972
892.1204
---------+-------------------------------------------------------------------diff |
225.6111
57.75352
99.80301
351.4192
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
3.9064
Ho: diff = 0
Satterthwaite's degrees of freedom = 12.0224
Ha: diff < 0
Pr(T < t) = 0.9990
Ha: diff != 0
Pr(|T| > |t|) = 0.0021
Ha: diff > 0
Pr(T > t) = 0.0010
Comparing the two t-test results, we see that the equal variance t test is more powerful than the
unequal variance t test, as it should be if the assumptions are met. Indeed, the data are now
sufficiently normal:
bysort group: swilk time if time~=1320
_______________________________________________________________
-> grou p = 1
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------time |
8
0.92428
1.055
0.086 0.46559
_______________________________________________________________
-> group = 2
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------time |
9
0.93563
0.946
-0.092 0.53677
Chapter 2-4 (revision 17 Oct 2011)
p. 46
Let’s suppose that we could not justify eliminating the outlier. How should we report these data?
As seen by examining the frequency table, reporting the mean does a poor job of describing the
central tendency (average) of the data for group two, since 70% of the data are below the mean.
tab time if group==2
TIME |
Freq.
Percent
Cum.
------------+----------------------------------594 |
1
10.00
10.00
600 |
1
10.00
20.00
636 |
1
10.00
30.00
638 |
1
10.00
40.00
708 |
1
10.00
50.00
750 |
2
20.00
70.00
786 |
1
10.00
80.00
864 |
1
10.00
90.00
1320 |
1
10.00
100.00
------------+----------------------------------Total |
10
100.00
summarize time if group==2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------time |
10
764.6
213.7497
594
1320
No matter what average we report, median or mean, we should report the p value from the
Wilcoxon-Mann-Whitney test if we don’t eliminate the outlier.
Chapter 2-4 (revision 17 Oct 2011)
p. 47
Truncation Approach to Outliers
Another approach to outliers is to set them to the highest biologically plausible value on the high
side, and set them to the lowest biologically plausible value on the low side, leaving all of
plausible values of the variable unchanged. In the treadmill example, the researcher might feel
that the highest plausible value for this population of patients is 900, for example. The value of
1320 would simply be recoded to be 900 before statistical analysis.
Steyerberg (2009, p.168) describes this approach,
“…Another check is on biological plausibility. This judgment requires expert
opinion, and depends on the setting. For example, a systolic blood pressure of 250
mmHg is biologically plausible in the acute care situation for traumatic brain injury
patients, but may not be plausible in an ambulatory care situation. Implausible values
may best be considered as errors and hence set to missing.315
For biologically possible values, various statistical approaches are subsequently
possible. To reduce the influence on the regression coefficients (‘leverage’), we may
consider to transform the variable by ‘truncation.’ Very high and very low values are
shifted to truncation points:
If X>Xmax then X=Xmax;
If X<Mmin then X=Xmin;
else X=X
Here, xmax and xmin are the upper and lower truncation points. These may be defined from
examining distributions, e.g., with box plots and histograms, and the predictor-outcome
relationship.
-----------315
Osborne JW, Overby A. The power of outliers (and why researchers should always
check for them. Pract Assess Res Eval 2004;9(6).
If the truncation approach is used, you could state it like the following in your article:
Suggestion for Statistical Methods Section
For the outcome of treadmill time, we set the one outlier of 1320 to 900, based on
our judgment that a value of 900 was not biologically plausible for this type of
patient. The next largest value in our data for this patient group was 864. This is
known as the truncation approach to outliers (Steyerberg, 2009, p.168), which is
less extreme than simply eliminating outliers from the analysis.
Returning to the example, we will set the 1320 to 900, and then re-run the analysis,
gen time2 = time // make a copy of the variable
replace time2=900 if time>900 & time~=. // truncate to 900
bysort group: sum time time2 // check our work
ttest time, unequal by(group) // with outlier included
ttest time2, unequal by(group) // with outlier truncated to 900
Chapter 2-4 (revision 17 Oct 2011)
p. 48
-> group = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------time |
8
928.5
138.1211
684
1110
time2 |
8
854.25
77.10058
684
900
-> group = 2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------time |
10
764.6
213.7497
594
1320
time2 |
10
722.6
107.1989
594
900
We see that in Group 2, the 1320 maximum was correctly set to 900.
. ttest time, unequal by(group) // with outlier included
Two-sample t test with unequal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
10
764.6
67.59359
213.7497
611.6927
917.5073
---------+-------------------------------------------------------------------combined |
18
837.4444
46.58727
197.6531
739.1539
935.735
---------+-------------------------------------------------------------------diff |
163.9
83.38808
-13.39825
341.1983
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
1.9655
Ho: diff = 0
Satterthwaite's degrees of freedom = 15.4391
Ha: diff < 0
Pr(T < t) = 0.9662
Ha: diff != 0
Pr(|T| > |t|) = 0.0676
. ttest time2, unequal by(group)
Ha: diff > 0
Pr(T > t) = 0.0338
// with outlier truncated to 900
Two-sample t test with unequal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
854.25
27.25917
77.10058
789.7923
918.7077
2 |
10
722.6
33.89926
107.1989
645.9145
799.2855
---------+-------------------------------------------------------------------combined |
18
781.1111
26.93892
114.2921
724.275
837.9473
---------+-------------------------------------------------------------------diff |
131.65
43.49968
39.37361
223.9264
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
3.0265
Ho: diff = 0
Satterthwaite's degrees of freedom = 15.8705
Ha: diff < 0
Pr(T < t) = 0.9960
Ha: diff != 0
Pr(|T| > |t|) = 0.0081
Ha: diff > 0
Pr(T > t) = 0.0040
The t test p = 0.0676 before the truncation of the outlier is now p = 0.0081 after the outlier
truncation.
Chapter 2-4 (revision 17 Oct 2011)
p. 49
How Many Decimals Places to Report
This is described very nicely in the American Medical Association (AMA) Manual of Style
(Iverson et al, AMA Manual of Style, 2007, p.851).
The number of decimal places reported outcomes should match the original precision of the
variable. If the variable has no decimal places, then reported measurements should be rounded
to the nearest integer. If one decimal place exists in the data, which reflects the precision of that
variable, than reported numbers should be rounded to one decimal place. Similarly for
mathematical calculations, the results should be rounded to the same digit of accuracy as the
original variable.
For means and standard deviations, no more than one significant digit beyond the accuracy of the
measurement should be used.
Decimal Places for P Values
This is described very nicely in the American Medical Association (AMA) Manual of Style
(Iverson et al, AMA Manual of Style, 2007, p.851-52):
“Briefly, P values should be expressed to 2 digits to the right of the decimal point
(regardless of whether the P value is significant), unless P < .01, in which case the P
value should be expressed to 3 digits to the right of the decimal point. (One exception to
this rule is when rounding P from 3 digits to 2 digits would result in P appearing
nonsignificant, such as P = 0.046. In this case, expressing the P value to 3 places may be
preferred by the author. The same holds true for rounding confidence intervals that are
significant before rounding but nonsignificant after rounding.) The smallest P value that
should be expressed is P < .001, since additional zeros do not convey useful
information.37
P values should never be rounded up to 1.0 or down to 0. While such a procedure
might be justified arithmetically, the results are misleading. Statistical inference is based
on the assumption that events occur in a probabilistic, rather than deterministic, universe.
P values may approach infinitely close to these upper and lower bounds, but never close
enough to establish that the associated observation was either absolutely predestined (P =
1.0) or absolutely impossible (P = 0) to occur. Thus, very large and very small P values
should be expressed as P > .99 and P <.001, respectively.”
----37
Bailar JC, Mosteller F. Medical Uses of Statistics. 2nd ed. Boston, MA: NEJM
Books,1992.
Exercise. Look at the article by Brady et al (JAMA, 2000). In their Table 3, you will see
examples of this style of reporting p values, p=.003, p=.01, p=.04, p=.07. In their Table 1, you’ll
see p>.99. In their text you will find p<.001.
Chapter 2-4 (revision 17 Oct 2011)
p. 50
Reporting Styles For Two Sample Continuous Outcome Comparisons
Here are some example reporting styles (rounding p=0.026 to p=0.03)
1.
The diseased group had a significant shorter treadmill time than the healthy group
(meanSEM seconds; diseased: 76565 , healthy: 92849, p=0.03).
2.
The diseased group had a significant shorter treadmill time than the healthy group
(diseased: mean 765 seconds, 95% CI, 612-918, healthy: mean 929 s, 95% CI, 813-1044,
, p=0.03).
3.
The diseased group had a significant shorter treadmill time than the healthy group
[mean difference, 164 seconds, 95% CI (-22 , 349) , p=0.03].
For examples 1, 2, and 3, all statistics can be found on the t test output.
4.
The diseased group had a significant shorter treadmill time than the healthy group
[median (interquartile range) seconds; diseased: 729 (627 - 806) , healthy: 984 (818 1011), p=0.03)].
5.
The diseased group had a significant shorter treadmill time than the healthy group
[diseased: median 729, 95% CI, 612-839 , healthy: median 984, 95% CI, 769-1045,
p=0.03)].
6.
The diseased group had a significant shorter treadmill time than the healthy group
[median difference, 222 seconds, 95% CI, 48-360, p=0.03)].
For examples 4 and 5, the p value comes from the Wilcoxon-Mann-Whitney output and the
median, interquartile range, and CI for median comes from the following centile command:
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Centiles with CIs
Main tab: Variables: time
Centiles: 25 50 75
by/if/in tab: Repeat command by groups:
Variables that define groups: group
OK
by group, sort : centile time, centile(25 50 75)
<or>
bysort group: centile time, centile(25 50 75)
Chapter 2-4 (revision 17 Oct 2011)
p. 51
-------------------------------------------------------------------------------> group = 1
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------time |
8
25
817.5
684
991.1952*
|
50
984
769.05
1045.2
|
75
1011
964.2548
1110*
* Lower (upper) confidence limit held at minimum (maximum) of sample
-------------------------------------------------------------------------------> group = 2
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------time |
10
25
627
594
746.2082*
|
50
729
611.68
838.6933
|
75
805.5
711.7918
1320*
* Lower (upper) confidence limit held at minimum (maximum) of sample
In example 6, the 95% CI for the median comes from the following cendif command
cendif time ,by(group)
but you have to first update your State to include the commands somersd and cendif before you
can use the command.
Do this, while connected to the internet, using,
findit somersd
which will display
SJ-6-4
snp15_7 . CIs for rank stat: Percentile slopes, differences, & ratios
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Newson
(help cendif, censlope, censlope_iteration,
mata bcsf_bracketing(), mata blncdtree(), mata somdtransf(),
mata u2jackpseud(), somersd, somersd_mata if installed)
Q4/06
SJ 6(4):497--520
calculates confidence intervals for generalized Theil-Sen
median (and other percentile) slopes (and per-unit ratios)
of Y with respect to X; help files also document supporting
Mata functions
and then click on the snp15_7 link to see
INSTALLATION FILES
(click here to install)
and click on this link to install it.
Chapter 2-4 (revision 17 Oct 2011)
p. 52
If it crashes on you the first time you run it, then repeat this installation step and you will see
----------------------------------------------------------------------------------package installation
----------------------------------------------------------------------------------package name:
from:
snp15_7.pkg
http://www.stata-journal.com/software/sj6-4/
checking snp15_7 consistency and verifying not already installed...
the following files already exist and are different:
c:\ado\plus\c\cendif.ado
c:\ado\plus\c\cendif.hlp
no files installed or copied
Possible things to do:
1.
Forget it
(best choice if any of the above files were written by you and just
happen to have the same name or you do not want the originals changed)
2.
Look for an already-installed package of the same name
(which you might then choose to uninstall)
3.
Search installed packages for the duplicate file(s) by clicking
on the file names above
4.
Force installation replacing already-installed files
(if this is an update, this would be a safe choice; you will end
up with the original and the update apparently installed, but it
doesn't matter; you can even uninstall the original later)
Choice option 4. This will overwrite a couple of files that caused it to crash.
After installing the update and running the following to get the 95% CI for the median in
Example 6
cendif time ,by(group)
Y-variable: time (TIME)
Grouped by: group (GROUP)
Group numbers:
GROUP |
Freq.
Percent
Cum.
------------+----------------------------------1 |
8
44.44
44.44
2 |
10
55.56
100.00
------------+----------------------------------Total |
18
100.00
Transformation: Fisher's z
95% confidence interval(s) for percentile difference(s)
between values of time in first and second groups:
Percent
Pctl_Dif
Minimum
Maximum
50
222
48
360
but you have to first update your State to include the commands somersd and cendif before you
can use the command.
Chapter 2-4 (revision 17 Oct 2011)
p. 53
The command cendif gives a difference in medians of 222, which is different from the observed
medians, 984 – 729 = 255. This is because it’s computation of the median follows a different
formula than the ordinary median, which is rather too complex to get into. If you use this
median difference and confidence interval, then, it is best to not report the individual group
medians because readers will think you made a mistake if they notice the inconsistency.
Protocol
You could state (although I do not recommend it),
Comparisons between two groups for continuous variables will be performed using
independent groups t tests if the equal variance assumption is met, otherwise independent
groups t tests with unequal variances (Satterthwaite’s method) will be used (Rosner,
1995). If the data for either group is skewed sufficiently to not meet the normality
assumption, then a Wilcoxon-Mann-Whitney test will be used. The equality of variance
assumption will be tested using Levene’s test for equality of variances, and the normality
assumption will be tested using Shapiro-Wilks test for normality.
The t test is very robust to the assumptions of normality and homogeneity of variance, so this
approach is unnecessary. It is sufficient to just say,
Comparisons between two groups for continuous variables will be performed using
an independent groups t test.
Fisher-Pitman Permutation Test for Independent Samples
In the above example, we found that the distribution for treadmill time (a continuous variable)
was skewed for one of the two groups being compared. Thus, the normality assumption was not
met for the independent samples t test. In this example, the sample size was too small to rely on
the central limit theorem to provide asymptotic normality of the sample means. The approach
we took was to use the nonparametric Wilcoxon-Mann-Whitney test, which only requires the
data to be an ordinal scale. A more powerful approach is to use a nonparametric test that
requires an interval scale. It would be more powerful because it would use an additional
property of the measurement, the equal interval property.
Such a test is the Fisher-Pitman permutation test for independent samples (Siegel and Castellan,
1988, pp.151-155; Kaiser, 2007), which just became available in State. It uses all the
information in the data, all of the interval scale properties, but does not have any assumptions
about the distribution. That is, it does not assume a normal distribution or equal variances,
which the t test assumes.
Chapter 2-4 (revision 17 Oct 2011)
p. 54
The first time you use it, you have to update your State to include it, since it was a user
contributed procedure. Use the following command, and then click on the st0134 link to install.
findit permtest2
SJ-7-3
st0134 . . Fisher-Pitman perm. tests for paired rep. & indep. samples
(help permtest1, permtest2 if installed) . . . . . . . . . J. Kaiser
Q3/07
SJ 7(3):402-412
exact and Monte Carlo proposals to the nonparametric
Fisher-Pitman tests for paired replicates and independent
Samples
INSTALLATION FILES
st0134/permtest1.ado
st0134/permtest1.hlp
st0134/permtest2.ado
st0134/permtest2.hlp
(click here to install)
This installs two commands,
permtest1 <- Fisher-Pitman permutation test for paired replicates
permtest2 <- Fisher-Pitman permutation test for two independent samples
You can easy verify you are using the right version of the test by looking at the help file,
help permtest2
Using the “Fisher-Pitman permutation test for independent samples” to analyze the treadmill
data,
permtest2 time , by(group)
Fisher-Pitman permutation test for two independent samples
group |
obs
mean
std.dev.
-------------+--------------------------------1 |
8
928.5
138.12106
2 |
10
764.6
213.7497
-------------+--------------------------------combined |
18
837.44444
197.65306
mode of operation:
Progress:
Montecarlo simulation (200000 runs)
|........................................|
Test of hypothesis Ho: time(group==1) >= time(group==2) :
Test of hypothesis Ho: time(group==1) <= time(group==2) :
Test of hypothesis Ho: time(group==1) == time(group==2) :
p=.95916 (one-tailed)
p=.041865 (one-tailed)
p=.08373 (two-tailed)
In general, this test is more powerful than the Wilcoxon-Mann-Whitney test, because it does
arithmetic directly on the observations themselves, rather than the ranks. Using additional
information in the data, it has greater power. In this example, it had a larger p value than the
Chapter 2-4 (revision 17 Oct 2011)
p. 55
Wilcoxon-Mann-Whitney test (Fisher-Pitman, p = 0.08373; Wilcoxon-Mann-Whitney, p =
0.0263). The outlier was in a direction that made the two groups look more alike, so the
Wilcoxon-Mann-Whitney test was to our advantage. The Fisher-Pitman treated the outlier like a
much larger number, making the two groups more alike on average, while the Wilcoxon-MannWhitney test treated the outlier like it was just a tiny bit bigger than the number just smaller than
it, keeping the two groups separated.
Like the Fisher’s exact test, the Fisher-Pitman test is a permutation test. It constructs all the
ways to combine the data into two groups, with the same sample sizes as the original sample, and
then defines the p value as the proportion of all combinations which are more extreme than the
data observed.
That is, letting X be the variable that contains the observations for the first group and Y be the
variable that contains the observations for the second group, it computes:
 X  Y
i
j
It does this for every possible way to combine the observations into the two groups.
Observations that were originally in group X can switch to group Y, and vis-a-versa. The p
value is the proportion of times that this difference of the sums is more extreme than the
observed difference of the sums. It is intuitive that an outlier contributes more to this test than
when using ranks like the Wilcoxon-Mann-Whitney test.
Not many researchers are familiar with this test, so if you use it, always provide a citation.
Article Statistical Methods Section Suggestion
Here is some suggested wording for your statistical methods suggestion, when you use the
Fisher-Pitman permutation test for independent samples. You should always provide a citation
for this test, since it is not well known. (The blue sentence is the minimum you should say.
Adding the green line is recommended since the reader will likely be unfamiliar with the test and
will want to know why you chose to use it.
For univariable group comparisons, unordered categorical variables were compared using
a chi-square test, or Fisher's exact test, as appropriate. For ordered categorical variables, a
Wilcoxon-Mann-Whitney test was used. For continuous variables, an independent
samples Student t test was used if the data were approximately normally distributed. If
the continuous variable was skewed, a nonparametric Fisher-Pitman permutation test for
independent samples was used (Siegel and Castellan, 1988; Kaiser, 2007). The FisherPitman test, which assumes a continuous scaled variable, is as powerful as the
independent groups t test, but without the distributional assumptions. In contrast, a
Wilcoxon-Mann-Whitney test, is less powerful, since it assumes only an ordered
categorical scale, and thus discards information in continuous scaled data. (Siegel and
Castellan, 1988; Kaiser, 2007). For skewed continuous variables, medians and
interquartile ranges (25th and 75th percentiles) are reported in place of means and
standard deviations.
------------Chapter 2-4 (revision 17 Oct 2011)
p. 56
Kaiser J. An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for
paired replicates and for independent samples. The State Journal 2007;7(3):402412.
Siegel S and Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences, 2nd
ed. New York, McGraw-Hill, 1988, pp.151-155.
A shorter version is:
For skewed continuous variables, the two groups were compared using the Fisher-Pitman
permutation test for independent samples. The Fisher-Pitman test, which assumes a
continuous, or interval, scaled variable, is as powerful as the independent groups t test,
but without the distributional assumptions. In contrast, a Wilcoxon-Mann-Whitney test,
is less powerful, since it assumes only an ordered categorical, or ordinal, scale and thus
discards information in continuous scaled data. (Siegel and Castellan, 1988; Kaiser,
2007).
------------Kaiser J. An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for
paired replicates and for independent samples. The State Journal 2007;7(3):402412.
Siegel S and Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences, 2nd
ed. New York, McGraw-Hill, 1988, pp.151-155.
Confidence Intervals
Confidence intervals were used in the reporting styles shown above, which will now be defined.
When we compute an effect, such as the difference between two means, we call that the point
estimate of the effect. The point estimate is our best guess of what is true population effect is.
A confidence interval is called an interval estimate, which is an interval
(lower bound , upper bound)
that we can be confident covers, or straddles, the true population effect with some level of
confidence.
How to Interpret a Confidence Interval
A subtlety that statisticians are careful to make is to keep in mind that the population effect or
parameter is fixed, remaining the same from sample to sample, while the endpoints of the
confidence interval are subject to sampling variation. Statisticians are careful, then, to refer to
the 95% confidence interval as “covering” or “containing” the population effect or parameter,
thereby implying that only the endpoints vary from sample to sample. Statisticians avoid saying
there is a 95% probability that the population effect is contained within the interval, which would
imply that the population effect varies from sample to sample, while the interval is fixed.
Chapter 2-4 (revision 17 Oct 2011)
p. 57
Next, the explanation of a confidence interval, from three different statistics textbooks is
provided.
Meyer (1970, pp.303-304) gives a formula for a confidence interval around a population mean,
the formula being based on the standard normal distribution, which is a normal distribution with
mean 0 and standard deviation equal to 1. In this formula, you can think of 2 ( z )  1 being equal
to 0.95, or 95%, the usual case,
z
z 

2( z )  1  ...  P  X 
X

n
n

This last probability statement must be interpreted very carefully. It does not mean the
probability of the parameter µ falling into the specified interval equals 2 ( z )  1 ; µ is a
parameter and either is or is not in the above interval. Rather, the above should be
interpreated as follows: 2 ( z )  1 equals the probability that the random interval
“
( X  z / n , X  z / n ) contains µ. Such an interval is called a confidence interval
for the parameter µ. Since z is at our disposal, we may choose it so that the above
probability equals, say 1 – α.”
Chow and Liu (2000, pp.83-84), using the subscript T for test group and R for referent group,
with α = 0.025 for a traditional 95% confidence interval around a mean difference, explain,
“A (1 – 2α) × 100% confidence interval for µT - µR is a random interval and its associate
confidence limits are, in fact, random variables. The fundamental concept of a (1 – 2α) ×
100% confidence interval for µT - µR is that if the same study can be repeatedly carried
out many times, say B, then (1 – 2α) × 100% times of the B constructed random intervals
will cover µT - µR (Bickel and Doksum, 1977). In other words, in the long run, a (1 –
2α) × 100% confidence interval will have at least a 1 – 2α chance to cover the true mean
difference….”
Bain and Engelhardt (1992, p.359) give further clarification of why statisticians use “confidence”
rather than “probability” to describe this random interval, in order for the terminology to be
precise. Using an example of a 95% confidence interval around a population parameter, denoted
as θ, which you can think of as the population mean, or else the population mean difference, for
purposes of the present discussion, where the interval has already been computed to be (69.9 ,
130.3),
“…We will refer to this interval as a 95% confidence interval for θ. Because the
estimated interval has known endpoints, it is not appropriate to say that it contains the
true value of θ with probability 0.95. That is, the parameter θ, although unknown, is a
constant, and this particular interval either does or does not contain θ. However, the fact
that the associated random interval had probability 0.95 prior to estimation might lead us
to assert that we are ‘95% confident’ that 69.9 < θ < 130.3.”
Chapter 2-4 (revision 17 Oct 2011)
p. 58
Relationship of Confidence Interval and Significance Testing
There is a direct relationship between testing at the 0.05 level (looking for p<0.05) and
constructing a 95% confidence interval.
If the null effect (mean difference = 0 in this case) is contained within the
confidence interval, than the test statistic will not be statistically significant.
This is true because a confidence interval is algebraically equivalent to the test statistic vs.
“reference range” endpoints inequality.
The formula for this t test (when equal variances are assumed) is:
x1  x2
x x
t
 1 2
s.e.( x1  xx )
s 2p s 2p

n1 n2
(n1  1) s12  (n2  1) s22
, where s 
n1  n2  2
2
p
Statisticians know how this test statistic will be distributed for any given sample size (by
applying the central limit theorem).
Using a “reference range” logic, we construct an inequality:
t(1 / 2), n 2 df  tobserved  t(1 / 2), n 2 df
t(1 / 2), n 2 df
x1  x2

 t(1 / 2), n 2 df
s.e.( x1  xx )
Multiplying each term by the denominator of the middle term,
t(1 / 2), n 2 df s.e.( x1  xx )  x1  x2  t(1 / 2), n 2 df s.e.( x1  xx )
is the formula for the confidence interval around the mean difference.
When we choose the criterion p<0.05 for significance, we say we are testing at the alpha = 0.05
level of significance. To construct the confidence interval, then, we need to know the value of t
for (1-0.05/2) or 97.5 and for n-2 degrees of freedom. For n = 18, this is
display abs(invttail(16, .975))
2.1199053
Chapter 2-4 (revision 17 Oct 2011)
p. 59
Above, we calculated an independent groups t test:
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
8
928.5
48.83317
138.1211
813.0279
1043.972
2 |
10
764.6
67.59359
213.7497
611.6927
917.5073
---------+-------------------------------------------------------------------combined |
18
837.4444
46.58727
197.6531
739.1539
935.735
---------+-------------------------------------------------------------------diff |
163.9
87.52394
-21.64246
349.4425
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
1.8726
Ho: diff = 0
degrees of freedom =
16
Ha: diff < 0
Pr(T < t) = 0.9602
Ha: diff != 0
Pr(|T| > |t|) = 0.0795
Ha: diff > 0
Pr(T > t) = 0.0398
The 95% confidence interval (-21.64246 , 349.4425) around the mean difference 163.9, where
the standard error for this difference was 87.52394, is computed as
display "( " 163.9-2.1199053*87.52394 " , " 163.9+2.1199053*87.52394 " )"
( -21.642464 , 349.44246 )
This interval covers the null effect value of a 0 difference ( H 0 : 1  2 or 1  2  0 ) so our p
value is likewise non-significant (p > 0.05).
Note that we do not have to compute this 95% confidence interval with the display command,
which was only done here to demonstrate the formula. Just take the confidence interval from the
t test output.
Statistical Tests to Identify Outliers
There is an entire class of statistical tests for outlier detection. For example, Rosner
(1995,pp.277-282) describes the Extreme Studentized Deviate (or ESD Statistic), which is not
available in State. Statisticians generally avoid these tests, because it is difficult to argue that the
rules used in these tests are not to some extent arbitrary.
If you can verify that it was due to a data coding error, or a laboratory error, then absolutely.
You can eliminate the outlier without any mention of it, because it was in actuality an error
rather than an outlying value.
The FDA Guidance Document provides some excellent approaches to dealing with outliers.
Exercise Look at the FDA Guidance Document E9 Statistical Principles for Clinical Trials,
section 5.3 paragraph 2.
Chapter 2-4 (revision 17 Oct 2011)
p. 60
Discussing Outlier in Articles
It is rare to see someone discuss outlier exclusion in an article, because the author is concerned
about making readers and editors uncomfortable with the analysis.
This practice seems to reversing somewhat now in the N Engl J Med. McWilliams et al. (N Engl
J Med, 2007) analyzed biennial, or every two year, survey data from the Health and Retirement
Study. The study hypothesis was that previously uninsured adults who enroll in Medicare
programs at the age of 65 years may have greater morbidity, requiring more intensive and
costlier care over subsequent years, than they would if they had been previously insured. In their
Statistical Methods section they state,
“We excluded a small number (<0.1%) of biennial observations that were extreme
outliers (≥50 hospitalizations, ≥300 doctor visits, or total expenditures ≥$2 million).”
Another paper, investigating needle stick injuries among surgical residents, reported in their
Results Section (Makary et al, N Engl J Med, 2007),
“One respondent was excluded from the analysis as an outlier for reporting a range of
more than 100 injuries, and two did not report the number of needlestick injuries.”
Chapter 2-4 (revision 17 Oct 2011)
p. 61
Prespecification of Analysis
The exercise for the interval variable was a good example of the issue of prespecification of
analysis, including the Protocol Suggestion (which was a prespecification of analysis).
Exercise Look at the FDA Guidance Document E9 Statistical Priniciples for Clinical Trials,
section 5.1 Prespecification of Analysis.
Exercise Look at the Fonseca reprint, Statistical Methods section, second paragraph where they
talk about parametric and nonparametric analysis. In this paragraph, Fonseca is stating a
“prespecification of analysis” to convince the reader that bias was not introduced by the choice
of statistics. (That is, without this presentation, the reader might think that Fonseca is simply
choosing to use those tests which produced a significant result.)
Chapter 2-4 (revision 17 Oct 2011)
p. 62
References
Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version
4.0. Salt Lake City, UT, Sagebrush Press.
Agresti A. (1990). Categorical Data Analysis. New York, John Wiley & Sons.
Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman &
Hall/CRC.
Bain LJ, Engelhardt M. (1992). Introduction to Probability and Mathematical Statistics. 2nd ed.
Pacific Grove CA, Duxbury.
Bergmann R, Ludbrook J, and Spooren WPJM. (2000). Different outcomes of the WilcoxonMann-Whitney Test From Different Statistics Packages. The American Statistician \
54(1):72-77.
Bickel PJ, Doksum AD. (1977). Mathematical Statistics. Holden-Day, San Francisco, CA.
Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trails. Annals
Allergy, Asthma, & Immunology 78:5-11.
Borenstein M, Rothstein H, Cohen J. (2001). SamplePower® 2.0. Chicago, SPSS Inc.
software can be purchased at http://www.spss.com
Brady K, Pearlstein T, Asnie GM, et al. (2000). Efficacy and safety of sertraline treatment of
posttraumatic stress disorder: a randomized controlled trial. JAMA 283(14):1837-1844.
Breslow NE, Day NE. (1980). Statistical Methods in Cancer Research: Volume 1 – The Analysis
of Case-Control Studies. Lyon, France, Internal Agency for Research on Cancer
(IARC Scientific Publications No. 32).
Brown KM, Kondeatis E, Vaughan RW, et al (2006). Influence of donor C3 allotype on late
renal-transplantation outcome. N Engl J Med 354;19:2014-23.
Chow S-C, Liu J-P. (2000). Design and analysis of bioavailability and bioequivalence studies.
2nd edition, New York, Marcel Dekker.
Cochran WG. (1954). Some methods for strengthening the common χ2 tests. Biometrics 10:417451.
Conover WJ (1980). Practical Nonparametric Statistics, 2nd ed. New York, John
Wiley & Sons, pp.165-169.
Cuchel M, Bloedon LT, Szapary PO, et al (2007). Inhibition of microsomal triglyceride transfer
protein in familial hypercholesterolemia. N Engl J Med 356(2):148-156.
Chapter 2-4 (revision 17 Oct 2011)
p. 63
Cytel. (2001). StatXact 5® Statistical Software for Exact Nonparametric Inference User Manual
Volume 2. Cambridge MA, CYTEL Software Corporation.
Daniel WW. (1995). Biostatistics: A Foundation for Analysis in the Health Sciences. 6th ed.
New York, John Wiley & Sons.
Fischer B, Lassen U, Mortensen J, et al. (2009). Preoperative staging of lung cancer with
combined PET-CT. N Engl J Med 361(1):32-39.
Gonzalez-Martinez JA, Gupta A, Kotagal P, et al. (2005). Hemispherectomy for catastrophic
epilepsy in infants. Epilepsia 46(9):1518-25.
Greenland S. (1991). On the logical justificaiton of conditional tests for two-by-two contingency
tables. The American Statistician 45(3):248-251.
International Conference on Harmonisation E9 Expert Working Group. (1999). ICH harmonised
tripartite guideline: statistical principles for clinical trials. Stat Med 18(15):1905-42.
Freely available as a guidance document on the FDA website (word for word same
content): Guidance for industry: E9 statistical principles for clinical trials.
http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guid
ances/ucm073137.pdf
Iverson C, Christiansen S, Flanagin A, et al. (2007). AMA Manual of Style: A Guide for Authors
and Editors, 10th ed. New York, Oxford University Press.
Kaiser J. (2007). An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for
paired replicates and for independent samples. The Stata Journal. 7(3):402-412.
Makary MA, Al-Attar A, Holzmueller GC, et al. (2007). Needlestick injuries among surgeons in
training. N Engl J Med 356(26):2693-9.
Matthews DE, Farewell V. (1985). Using and understanding medical statistics. New York,
Karger.
McWilliams JM, Meara E, Zslavsky AM, Ayanian JZ. (2007). Use of health services by
previoulsy uninsured Medicare beneficiaries. N Engl J Med 357(2):143-53.
Meyer PL (1970). Introductory Probability and Statistical Applications, 2nd ed. Reading MA,
Addison-Wesley Publishing Company.
Rice JA. (1988). Mathematical Statistics and Data Analysis. Pacific Grove, California,
Wadsworth & Brooks/Cole.
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed. Belmont CA, Duxbury Press.
Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Duxbury Press.
Chapter 2-4 (revision 17 Oct 2011)
p. 64
Rothman KJ. (2002). Epidemiology: an Introduction. New York, Oxford University Press.
Siegel S and Castellan NJ Jr (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd
ed. New York, McGraw-Hill.
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer.
Stoddard GJ, Ring WH. (1993). How to evaluate study methodology in published clinical
research. J Intravenous Nursing 16(2):110-117.
Chapter 2-4 (revision 17 Oct 2011)
p. 65