Download Chapter 2-5. Basics of Power Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 2-5. Basics of Power Analysis
Statistical Regularity and Signal-to-Noise Ratio
In Chapter 2-3, we saw how the significance test is designed to be consistent with statistical
regularity.
Using Stata to perform a statistical regularity simulation for two groups, we take random samples
of increasing size (ranging from 1 to 200) from two normal populations (1st, mean = 5, standard
deviation = 2.5) (2nd, mean = 4, std. dev. = 2.5), and then plot the sample mean.
The plot of the sample means is:
*--------------------------------------------------------------* Demonstrate statistical regularity by plotting the mean from
* two normal distributions (1st: mean=5, std.dev=2.5;
* 2nd: mean=4, std.dev=2.5) for increasingly larger sample sizes
*---------------------------------------------------------------
6
2
4
Mean
8
10
Statistical Regularity for N(5,2.5) and N(4,2.5) Variables
1
10 15
30
Sample Size (log scale)
50
100
200
Here we see the signal/noise ratio becoming more distinct, so we can more confidently claim a
difference between the two groups, as the sample size increases.
All test statistics are constructed as a ratio of the effect to the variability of the effect, or signalto-noise ratio (Stoddard and Ring, 1993; Borenstein M, 1997), in perfect consistency with
statistical regularity and displayed in this graph.
_______________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 2-5 (revision 16 May 2010)
p. 1
For example, the formula for a two-sample t test, for comparing two means, is
t
X1  X 2
s1
s
 2
n1
n2
where the numerator is the effect, the mean difference, and the denominator is the variability of
the mean difference, which is called the standard error of the mean difference.
The standard error is simply an estimate of the standard deviation of the mean difference that we
could get had we taken a large number of samples, by repeating the clinical trial, and so had a
bunch of mean differences to compute the standard deviation for.
Sample size determination is performed to insure that we take a large enough sample to obtain
enough data to detect a difference if it exists. We can detect a difference in means provided we
are far enough to the right on the statistical regularity graph. By far enough to the right, we
mean a sufficiently large sample, since sample size is the x-axis on this graph.
Power analysis is performed to discover the probability of detecting a difference, if it exists, for
the sample size we have planned.
Definition of Power
Power = Prob(our sample will achieve statistical significance | H A : 1  2 )
for a given sample size and a given difference in means
Power Increases as Sample Size Increases
The power of a statistic increases monotonically (continues to go up) as sample size increases.
In fact, if you make your sample size large enough, you will eventually get a statistically
significant p value every time, regardless of how small the population difference is (as long as
the difference in means or proportions is not zero).
Let’s look again at the statistical regularity graph shown above.
Chapter 2-5 (revision 16 May 2010)
p. 2
6
2
4
Mean
8
10
Statistical Regularity for N(5,2.5) and N(4,2.5) Variables
1
10 15
30
Sample Size (log scale)
50
100
200
We know that this simulation is sampling from two distributions where the means are different, 5
and 4, so the null hypothesis is actually false. Looking at the graph, we observe that with sample
sizes < 50 or so, given the standard deviations of 2.5, we do not seem to get a clear signal-tonoise ratio in this graph. That is, sufficient statistical regularly has not been achieved.
Regardless of the value of power, for this example, the t test does not achieve statistical
significance until n=50 in each group, as seen by the following Stata output.
Chapter 2-5 (revision 16 May 2010)
p. 3
Using the immediate form of the ttest command
Immediate form of two-sample mean comparison test:
ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 [, options]
ttesti 49 5 2.5 49 4 2.5
ttesti 50 5 2.5 50 4 2.5
Two-sample t test with equal variances
-----------------------------------------------------------------------------|
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------x |
49
5
.3571429
2.5
4.281916
5.718084
y |
49
4
.3571429
2.5
3.281916
4.718084
---------+-------------------------------------------------------------------combined |
98
4.5
.256311
2.53735
3.991294
5.008706
---------+-------------------------------------------------------------------diff |
1
.5050763
-.0025685
2.002568
-----------------------------------------------------------------------------diff = mean(x) - mean(y)
t =
1.9799
Ho: diff = 0
degrees of freedom =
96
Ha: diff < 0
Pr(T < t) = 0.9747
Ha: diff != 0
Pr(|T| > |t|) = 0.0506
Ha: diff > 0
Pr(T > t) = 0.0253
Two-sample t test with equal variances
-----------------------------------------------------------------------------|
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------x |
50
5
.3535534
2.5
4.289508
5.710492
y |
50
4
.3535534
2.5
3.289508
4.710492
---------+-------------------------------------------------------------------combined |
100
4.5
.2537596
2.537596
3.996486
5.003514
---------+-------------------------------------------------------------------diff |
1
.5
.0077663
1.992234
-----------------------------------------------------------------------------diff = mean(x) - mean(y)
t =
2.0000
Ho: diff = 0
degrees of freedom =
98
Ha: diff < 0
Pr(T < t) = 0.9759
Ha: diff != 0
Pr(|T| > |t|) = 0.0483
Ha: diff > 0
Pr(T > t) = 0.0241
This property of a test statistic not being significant until a sufficient sample size is achieved, for
a given mean difference and standard deviations of groups, is “designed into” the p value
compution by the developers of the test statistic to be consistent with the statistical regularity
property. Actually, it just works out that way if the development of the test statistic is logically
consistent with statistical theory.
In this example, for n=49, insufficient statistical regularity is achieved. For n=49, the value of
the signal-to-noise ratio, the t test statistic, is still inside the inner 95% of values of the t test
statistic that can occur when the null hypothesis, H 0 : 1  2 , is true (p = 0.0506 > 0.05)
We can immediately see how sample size is designed into the size of the test statistic by looking
at the t test statistic formula.
Chapter 2-5 (revision 16 May 2010)
p. 4
To simplify the illustration, we will assume the two groups have the same sample size. Then, the
t test statistic (the unequal variances version) simplifies to:
t
X1  X 2
X  X 2 X1  X 2
 1

 n
s1
s2
s1  s2
s

s
1
2

n1
n2
n
Thus t always increases as n increases.
Examining the formula, you can see that the same relationship holds even for unequal n’s, since
you are dividing by a smaller denominator as either or both of the two n’s increase.
Similarly, sample size directly affects the p value, since a larger test statistic always produces a
smaller p value (you go further into the tail of the distribution of t).
Using SamplePower® 2.0, we can quickly generate the power curve for our example population
parameters of 1  5, 2  4; 1   2  2.5 . With a little bit of programming, you can get such a
curve in Stata, as well.
We see that power goes up with increasing sample size, which is always the case.
A different curve could be drawn for any different combination of population parameters (the
means and standard deviations), but it will always have this monotonically increasing shape.
Chapter 2-5 (revision 16 May 2010)
p. 5
Decision Errors of Significance Tests
When we use the p value to make a decision about the null hypothesis of a test statistic, we
conceive of probabilities of making errors using a layout analogous to diagnostic test (see box).
Sensitivity/Specificity Layout
For a diagnostic test, we can construct the following table of agreements/disagreements between
the diagnostic test decision and the true state, or gold standard.
Test “probable value”
Gold Standard “true value”
disease present ( + )
disease absent ( - )
disease present ( + )
a (true positives)
b (false negatives)
a+b
disease absent ( - )
c (false positives)
d (true negatives)
c+d
a+c
b+d
For the non-shaded cells, we recognize the familiar terminology (Lilienfeld, 1994, p. 118-124)
for agreements, expressed as percents:
sensitivity = (true positives)/(all those with the disease) = a / (a + b) 100
specificity = (true negatives)/(all those without the disease) = d / (c + d)  100
For a significance test conclusion, we have the same layout as shown in the sensitivity/specificity
box. Here, H0 denotes the hypothesis of no effect.
Conclusion From Test
Reality
population difference present (+)
H0 is false
population difference absent ( -)
H0 is true
H0 is false (+)
H0 is true ( -)
no error
1-beta
beta
(type II error)
(false negatives)
no error
1-alpha
alpha
(type I error)
(false positives)
We describe these probabilities of a decision error as:
alpha = Prob(reject H0 | H0 is true) = Prob(Type I error)
= Prob(false positive conclusion)
and
beta = 1 – Power = Prob(accept H0 | H0 is false) = Prob(Type II error)
= Prob(false negative conclusion)
Chapter 2-5 (revision 16 May 2010)
p. 6
The quantity, 1-alpha, is referred to as the confidence level. Chow et al. (1980, p.15) describe it,
“In practice, the maximum probability of committing a type I error that one can tolerate is
usually considered as the level of significance. The confidence level, 1 – α , then reflects
the probability or confidence of not rejecting the true null hypothesis.”
Type II Error and Sample Size Paragraph in Journal Article
There are two reasons to report a sample size paragraph in your research article.
First, it adds credibility to your paper. Since a sample size determination is a necessary part of
good study design, reporting one suggests that you understand study design principles.
Second, it helps the reader interpret nonsignificant p values. When the reader sees a
nonsignificant p value, the question comes up of whether this is due to an effect not being
present in the sampled population, or whether it was the result of an inadequately powered study.
In orthopaedics, research is frequently done on limited sample sizes, such as explanted artificial
joints. Probably largely motivated by the necessity of small sample sizes, although it is
appropriate for any sample size, the 2009 Instructions to Authors for The Journal of Bone and
Joint Surgery state the following,
“For hypothesis testing scenarios the statement ‘no significant difference was found
between two groups’ must be accompanied by a value describing the power of the study
to detect a Type II error (Designing Clinical Research, eds. Stephen Hulley, Steven
Cummings, 1988, Williams and Wilkins, Baltimore pp 128-49).”
That statement does not make sense, but they are trying to say that reporting the power helps the
reader to determine if the lack of significance is explainable simply by an insufficient sample
size.
( see next page for correct “conclusions of no difference” )
Chapter 2-5 (revision 16 May 2010)
p. 7
Conclusions of Equivalence
Given the definition,
beta = 1 – Power = Prob(accept H0 | H0 is false) = Prob(Type II error)
= Prob(false negative conclusion)
It would seem that we could make a probability argument in support of the null hypothesis of
“no difference” or “no effect” using beta.
That is, suppose we have 95% power to detect a minimal biological significance difference, say
of 1 to be consistent with our example, because our sample size is 165 in each group. Then, we
take our sample and get a nonsignificant p value. Shouldn’t we be able to say that we
demonstrated there is no effect at the beta=1-0.95 = 0.05 level of significance?
This seems reasonable enough, and it is advocated by Jacob Cohen in his classical textbook,
Statistical Power Analysis for the Behavioral Sciences. He first proposed it in a chapter of
another textbook published in 1965 (Cohen, 1965), so the idea has been around for awhile. Until
2001, no one, including Cohen, had ever provided a proof that this approach is logically
consistent.
Cohen’s proposition was later shown to be fallacious, however, by a logic proof presented by
Hoenig and Heisey (2001). Hoenig and Heisey’s proof is described in Chapter 2-15 of the
course manual.
To demonstrate equivalence, or no effect, requires a statistical approach called equivalence
testing. Similarly, to demonstrate “no worse than” requires an analogous approach called
noninferiority testing.
Otherwise, when you get a nonsignificant result, all you can conclude is that there is not
sufficient evidence in your study to demonstrate a difference.
Chapter 2-5 (revision 16 May 2010)
p. 8
Power of a Significance Test
We saw above that a simplified definition of power for our example is:
power = Prob(draw a sample that achieves statistical significance | H1 : 1  5  2  4 )
Power is different for every specified effect ( H1 : 1  5  2  4 , for example)
Power actually depends, or is conditional upon, five things:
1) minimum size of effect one wishes to detect
2) whether you are using a one- or two-sided comparison
3) the choice of alpha
4) the sample size
5) the test statistic you are using
We will now look at each of these five things.
Chapter 2-5 (revision 16 May 2010)
p. 9
Effect of One- or Two-Sided Comparison on Power
For a two-sided comparison, p < 0.05 when the test statistic falls in the shaded rejection region in
either tail of the sampling distribution for the test statistic. For a one-sided comparison, p < 0.05
when the test statistic falls in the shaded rejection region of one tail of sampling distribution for
the test statistic.
two-tail test critical values
2.5%
2.5%
-1.96 SE
0
+1.96 SE
one-tail test critical values
5%
0
+1.645 SE
Since the test statistic does not have to be as large to fall in the 5% rejection region, a one-sided
comparison is more powerful. That is, the one-sided comparison test statistic can exceed the
critical value using a smaller sample size than a two-sided comparison test for any fixed power.
Chapter 2-5 (revision 16 May 2010)
p. 10
The tail-area graphs just shown were created using the following Stata commands:
* -- tail area graphs -* normal tail area graph -- p. 187 of Stata version 9 graphics manual
*
#delimit ;
graph twoway (function y=normden(x), range(-4 -1.96) bcolor(red)
recast(area) plotregion(style(none)))
(function y=normden(x), range(1.96 4) bcolor(red)
recast(area) plotregion(style(none)))
(function y=normden(x), range(-4 4) clstyle(foreground)
plotregion(style(none)))
,yscale(off) legend(off)
xlabel( -1.96 "-1.96 SE" 0 "0" 1.96 "+1.96 SE") xtitle("")
title(two-tail test critical values)
text(.06 -2.5 "2.5%") text(.06 2.5 "2.5%")
saving(twotail, replace)
;
#delimit cr
*
#delimit ;
graph twoway (function y=normden(x), range(1.645 4) bcolor(red)
recast(area) plotregion(style(none)))
(function y=normden(x), range(-4 4) clstyle(foreground)
plotregion(style(none)))
,yscale(off) legend(off)
xlabel( 0 "0" 1.645 "+1.645 SE") xtitle("")
title(one-tail test critical values)
text(.06 2.5 "5%")
saving(onetail, replace)
;
#delimit cr
graph combine twotail.gph onetail.gph , col(1) saving(bothtail.gph)
Chapter 2-5 (revision 16 May 2010)
p. 11
For manuscripts that will be published in the medical literature, you should always use a twotailed (two-sided) comparison, because that is what readers expect to see in publications. Fleiss
(1973) gives a good explanation why:
“…a one-tailed test is called for only when the investigator is not interested in a
difference in the reverse direction from that hypothesized. For example, if he
hypothesizes that P2 > P1, [Fleiss’ example is comparing proportions.] then it will make
no difference to him if either P2 = P1 or P2 < P1. Such an instance is assuredly rare. One
example where a one-tailed test is called for is when an investigator is comparing the
response rate for a new treatment (p2) with the response rate for a standard treatment (p1),
and when he will substitute the new treatment for the standard in his own practice only if
p2 is significantly greater than p1. It will make no difference to him if the two treatments
are equally effective or if the new treatment is actually worse than the standard; in either
case, he will stick with the standard.
If however, the investigator intends to report his results to his professional
colleagues, he is ethically bound to perform a two-tailed test. For if his results indicate
that the new treatment is actually worse than the standard--an inference possible only
with a two-tailed test--he is obligated to report this as a warning to others who might plan
to study the new treatment.
In the vast majority of research undertakings, two-tailed tests are called for. Even
if a theory or a large accumulation of published data suggests that the difference being
studied should be in one direction and not the other, the investigator should nevertheless
guard against the unexpected by performing a two-tailed test. Especially in such cases,
the scientific importance of a difference in the unexpected direction may be greater than
yet another confirmation of the difference being in the expected direction.”
Petrie (2006) offers similar advice for researchers submitting manuscripts to J Bone Joint Surg,
“One-tailed alternatives are rarely used because we have to be absolutely certain for
biological or clinical reasons, in advance of collecting the data, that if H0 is not true that
direction of the difference is known (e.g., that the mean of treatment A is greater than that
of B), and tis is rarely possible.”
Although very rarely done, researchers occasionally advance a justification for using a one-tailed
test. Miller et al (2001, p.844, 3rd paragraph) stated the following in their paper published in
Neurology:
“We used a one-tailed t-test (significance = p < 0.05) to examine the primary outcome
measure because the phase II study documented a decline in arm strength, but a twotailed test was used with all secondary measures.”
This is a weak justification, however, even though it is taught in introductory statistics textbooks,
because they miss the point of why two-tailed tests are used in the medical literature. If the
outcome would have come out in the unexpected direction, where arm strength increased, the
reader would be interested in knowing this. It would be hard to imagine that a result in the
opposite direction in this study would be an equivalent clinical effect as “no difference”, which is
what a one-tailed test assumes.
Chapter 2-5 (revision 16 May 2010)
p. 12
Noninferiority Studies The one exception to the use of a two-tailed test for manuscripts
published in the medical literature is the noninferiority study. For that study, a one-tail test is
appropriate, although many still advocate a two-tailed test. This is discussed in Chapter 2-15.
Effect of Choice of Alpha on Power
If we let alpha = .01, instead of .05, we have to go further out in the tails which requires a larger
t. If we let alpha 0.10, we do not have to go as far into the tails. Therefore, the choice of alpha
affects our power and how large of a sample size we need to achieve power.
Effect of Choice of Minimum Detectable Effect Size on Power
All test statistics are a sample size weighted signal-to-noise ratio. Examining the formula for a t
test, assuming equal sized groups,
t
X1  X 2
X  X 2 X1  X 2
 1

 n
s1
s
s1  s2
s1  s2
 2
n1
n2
n
we can see that the larger the effect, the difference, in the numerator, the larger the value of the t
test statistic. The larger the value of the test statistic, the further it lies in the tails of the sampling
distribution we use to computer the p value.
Thus, the larger the effect we are willing to limit significance to, the greater the power of the test.
Effect of Standard Deviation Assumption on Power
Again, recognizing that all test statistics are a sample size weighted signal-to-noise ratio.
Examining the formula for a t test, assuming equal sized groups,
t
X1  X 2
X  X 2 X1  X 2
 1

 n
s1
s2
s1  s2
s

s
1
2

n1
n2
n
we can see that the smaller the standard deviations (SD, or s) in the denominator, the larger the
value of the t test statistic. The larger the value of the test statistic, the further it lies in the tails
of the sampling distribution we use to computer the p value.
Thus, the smaller the SD we can assume, the greater the power of the test.
Chapter 2-5 (revision 16 May 2010)
p. 13
Effect of Sample Size on Power
Given that the other four things that affect power are in the equation, sample size is uniquely
determined by power, and power is uniquely determined by sample size (N).
power = function of (other 4 things) and N
N = function of (other 4 things) and power
For example, 1) provide sample size and get power
Statistics
Power and sample size
Tests of means and proportions
Main tab: Two-sample comparison of means
Mean one: 5
Mean two: 4
Std. deviation one: 2.5
Std. deviation two: 2.5
Options tab: Output: Compute power
Sample based calculations: Sample size one: 50
Sample size two: 50
OK
sampsi 5 4, sd1(2.5) sd2(2.5) n1(50) n2(50)
Estimated power for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
m1
m2
sd1
sd2
sample size n1
n2
n2/n1
=
=
=
=
=
=
=
=
0.0500
5
4
2.5
2.5
50
50
1.00
(two-sided)
<- defaults to alpha = .05
and a two-sided comparison
Estimated power:
power =
0.5160
Chapter 2-5 (revision 16 May 2010)
p. 14
For example, 2) provide power and get sample size
Statistics
Power and sample size
Tests of means and proportions
Main tab: Two-sample comparison of means
Mean one: 5
Mean two: 4
Std. deviation one: 2.5
Std. deviation two: 2.5
Options tab: Output: Compute sample size
Power of the test: 0.80
OK
sampsi 5 4, sd1(2.5) sd2(2.5) power(.80)
Estimated sample size for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
power
m1
m2
sd1
sd2
n2/n1
=
=
=
=
=
=
=
0.0500
0.8000
5
4
2.5
2.5
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
Chapter 2-5 (revision 16 May 2010)
99
99
p. 15
Sample Size and Power Calculations for an Interval Scaled Outcome Variable
As discussed above, for a given test statistic, power and required sample size are determined by
the following five things:
Five items required to compute power or required sample size
Power
Required Sample Size
1. effect size in the population
effect size in the population
2. standard deviation in the population
standard deviation in the population
3. choice of alpha
choice of alpha
4. choice of one- or two-sided comparison choice of one- or two-sided comparison
5. sample size used
choice of power
Effect Size. For an independent samples t test, which is a comparison of two means, the effect
size is a choice of the two means (or similarly, the difference of the two means). We choose this
to be the minimal clinically relevant difference that we want sufficient power to be able to detect.
Although, it is also a good idea to use a value that is congruent with the true effect in the
population, or it is unlikely we will be able to achieve it--previous published results are useful to
determine this.
Standard Deviation. This is the difficult part. We must have an estimate of the SD for each of
the two groups. With luck, we can get this from already available pilot data or from previously
published research. If we are unsure, we could estimate this on the high end (the largest
reasonable SD)--this will require us to use a larger sample size to insure our power, but we will
have adequate power for all values of SD that are smaller in the population. We refer to this a
conservative estimate of SD.
Alternatively, we might consider what is the biological meaningful min and max, and then divide
this range by 6 (since we know that 99.7% of all values fall with 3 SDs, a total of 6 SDs, for a
normal distribution). We might also consider the reference range and then divide by 4 (since we
know that 1.96 SDs, or a total of about 4 SDs, form this distribution). Both of these strategies
require an approximate symmetrical distribution to apply this normal distribution interpretations
of the SD. If you use either the range/6 or the range/4, you can cite Browne (2001).
If you know the interquartile range (25th , 75th percentiles), you can make use of the following.
The interquartile range bounds the inner 50% of the area under the curve of any distribution. For
a standard normal distribution, which has mean=0 and standard deviation (SD) = 1, the inner
50% of the area is bounded by [-0.6745 , +0.6745]. The interval [0 , 1] between the mean and 1
SD has length 1. The interval of IQR = [-0.6745 , +0.6745] has length 2(.6745) = 1.3490. Thus,
the relationship,
SD
IQR
IQR

 SD 
1 1.3490
1.3490
which provides a good estimate for SD if you can assume the variable has a normal distribution.
Chapter 2-5 (revision 16 May 2010)
p. 16
Example: Suppose the IQR = [3, 7] for a normally distributed variable.
Then SD = (7 – 3)/1.34590 = 2.97.
Example: Suppose the IQR = [3, 7], with median = 4.5 for a skewed distribution. If we plan to
take a log transformation of this variable to achieve normality, we would get
ln(3) = 1.10
ln(4.5) = 1.50
ln(7) = 1.95
using
disp ln(3)
disp ln(4.5)
disp ln(7)
For a normal distribution, the median is right in the center of the distribution, with a symetrical
shape. Since the log transformation perserves order, we would expect the median to be near the
center of the distribution after the transformation. We notice that (1.50 – 1.10 ) = 0.40 on the left
of the median, which is approximately the same as (1.95 – 1.50) = 0.45 on the right of the
median, so the distribution is nearly symmetrical.
Now, applying the formula,
SD 
IQR
1.3490
disp (1.95 – 1.10)/1.3490
we get 0.63 as the estimate of the SD for a log-transformed variable.
Choice of Alpha. We choose the alpha that is appropriate for our purposes, which is the
probability of false positive results, or type I error, that we are willing to accept. Usually we just
use the conventional =0.05 if the results are to be published in the medical literature.
Choice of 1- or 2-Sided Comparison. We can achieve our desired power with a smaller sample
size if we use a one-sided comparison (one-tailed test). However, as presented in the above
Fleiss (1973) quote, we ethically, and conventionally, choose a two-sided comparison if we are
going to publish in the medical literature.
Chapter 2-5 (revision 16 May 2010)
p. 17
Choice of Power. It is generally accepted that a power of 0.80, or 80%, is the smallest power a
study should have. If it is feasible to conduct the study with a larger sample size, a power of
90% should be used, or perhaps even 95%. Since this is our probability of obtaining a
statistically significant result with our study (conditional upon the effect size actually existing in
the population), it makes since to use a larger power if the cost of the study relative to the benefit
of the significance demonstration is justified.
Illustration
For illustration, let’s assume we are planning a clinical trial of two therapies, Therapy A and
Therapy B. We have:
Effect Size: mean A = 5, mean B = 4
SD: SDA = 2.5, SDB = 2.0
Alpha: 0.05
Comparison: 2-sided
Desired Power: 0.90
What sample size is required for this study?
Sample Size Calculation
Using Stata,
sampsi 5 4 , sd1(2.5) sd2(2) power(.90)
we get n=108 required for each group.
Suppose we know we can only collect 75 in each group, due to budget and availability of
consenting study subjects. What power would that sample size provide for us?
Power Calculation
Using Stata,
sampsi 5 4 , sd1(2.5) sd2(2) n1(75) n2(75)
we get power = 77.19%.
Notice that when we omitted the power option and include the sample sizes in the sampsi
command, power was calculated. Up above, when we omitted the sample sizes and included the
power option, the sample size was calculated.
In this situation, you might consider how you might design your study to improve precision, so
that you can get smaller standard deviations. Restricting the sample to a more homogeneous
group is one possibility to achieve this.
Chapter 2-5 (revision 16 May 2010)
p. 18
What to do if you don’t know anything (no effect size or standard deviation estimates)
If you don’t know anything, you can still do a power analysis based a standardize variable with
the effect size expressed in standard deviation units.
A standardized score, or z-score, is computed for each variable using,
z
Y Y
SDY
where the variable’s mean is subtracted by the observation, and then this difference is divided by
the variable’s standard deviation. A z-score, then, is in units of the number of standard
deviations an observation is from its mean. A standardized variable always has a mean of 0 and
standard deviation (SD) of 1.
It is best if you can say something about your choice of effect size, so it sounds reasonable. Also
this approach works best if you are implying to be doing it to avoid a separate calculation for a
long list of outcome variables (see example paragraph below).
For a 1/3 SD effect size, or standardized mean difference, you would use
sampsi
0 .33 , sd1(1) sd2(1) power(.95)
Estimated sample size for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
power
m1
m2
sd1
sd2
n2/n1
=
=
=
=
=
=
=
0.0500
0.9500
0
.33
1
1
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
239
239
Chapter 2-5 (revision 16 May 2010)
p. 19
Expressing the effect size in SD units has a long history. Cohen (1988) popularized the idea in
his textbook on power analysis. The book contains look-up tables based on effect size expressed
this way. For a two group comparison of means, the effect size is (mean difference)/SD, where
SD is for either group, since the ordinary t test formula assumes the SDs are equal (Cohen, 1988,
p. 20). Cohen denoted such an effect size as d, and it has since became popularly known as
Cohen’s d. Expressed in Cohen’s notation, d = (mA – mB)/σ. To assist the investigator who does
know what the means and standard deviation, or correlation coefficients, are in advance, but does
have a since that the effect will be small, medium, or large, Cohen (1988, pp.25-27) proposed the
effect sizes of
d = 0.2 is a small effect size
d = 0.5 is a medium effect size
d = 0.8 is a large effect size
Explaining further, Cohen states (1988, pp.24-25),
“Take, for example, an especially constructed test of learning ability appropriate
for use with phenylpyruvic mental deficients. The investigator may well be satisfied with
the relevance of the test to his purpose, yet may have no idea of either what the σ is or
how many points of difference on Y between means of treated and untreated populations
he can expect. Thus, he has neither the numerator (mA – mB) nor the denominator (σ )
needed to compute d.
It is precisely at this point in the apparent dilemma that the utility of the d concept
comes to the fore. It is not necessary to compute d from a posited difference between
means and an estimated standard deviation; one can posit d directly. Thus, if the
investigator thinks that the effect of his treatment method on learning ability in
phyenylpyruvia is small, he might posit a d value such as .2 or .3. If he anticipates it to
be large, he might posit d as .8 or 1.0. If he expects it to be medium (or simply seeks to
straddle the fence on the issue), he might select some such value as d = .5.”
Chow, Shao and Wang (2008, p.13) advocate the use of expressing the effect size as standard
deviation units when no prior knowledge is available. Using δ to denote an effect of clinical
importance, they state:
“In clinical trials, the choice of δ may depend upon absolute change, percent change, or
effect size of the primary study endpoint. In practice, a standard effect size (i.e., effect
size adjusted for standard deviation) between 0.25 and 0.5 is usually chosen as δ if no
prior knowledge regarding clinical performance of the test drug is available. This
recommendation is made based on the fact that the standard effect size of clinical
importance obsered from most clinical trials is within the range of 0.25 and 0.5.”
Chapter 2-5 (revision 16 May 2010)
p. 20
Study Protocal Suggestion
Here is an example power analysis for a study protocol when z scores are used in the power
analysis.
To compute a single power analysis that applies to each outcome, a z-score, or
standardized score, approach is used. Expressed as z-scores, every outcome has the same
distribution, which is a mean of 0 and standard deviation (SD) of 1. Significance test p
values and power calculations using z-scores are identical to those using original scales.
Thus, one power analysis based on z-scores applies to all outcomes, as long as the same
effect size is used. An effect size of 1/3 SD is assumed for all outcomes. For the PAM
scale, this is equivalent to 1 point on the original scale (Hibbard, 2005), so a 1/3 SD is not
excessively large. Assuming the control group outcomes will be unchanged from
baseline to post-intervention, while the intervention group outcomes will be increased by
1/3 SD, the sample size of N=239 per group provides 95% power to detect the these
differences, using a two-sided, alpha 0.05 comparison.
Example 1. In the sample size paragraph of their article, Cahen et al. (N Engl J Med, 2007)
state,
“…We determined that a study with 23 patients per group would have 90% power to
detect a difference of 1 SD with the use of a two-group t-test at a two-sided significance
level of 0.05….”
Example 2. In the sample size paragraph of their article, Hovi et al. (N Engl J Med, 2007) state,
“We calculated that we would need to enroll at least 140 subjects in each group to have a
statistical power of 90% to detect a between-group difference of 0.40 in the standarddeviation score with an alpha level of 0.05 in a two-sided analysis.”
They took advantage of the fact that p values from statistical comparisons are unchanged
by switching between orginal units and standard deviation units of a variable. Using
standard deviation units is particularly intuitive for defining cutpoints. For example, in
their Methods section they state, “For each very-low-birth-weight survivor, we selected
the next available singleton infant born at term (gestational age, ≥37 weeks) of the same
sex who was not small for gestational age (standard-deviation score for birth weight, ≥2).” They used the fact that for a normally distributed variable, about 95% of the values
are between -2SD and +2SD from the mean, applying the same logic as used with
laboratory reference ranges to define “normal” lab values. In their Results section, they
stated, “A post hoc analysis in which we used the 10th percentile (standard-deviation
score, -1.3) as a cutoff point to define small and appropriate for gestational age also
showed no significant differencs (range of P values, 0.25 to 0.88).” This time, they used
the fact that for a normally distributed variable, 10% of the values are below -1.3 SDs
from the mean.
Chapter 2-5 (revision 16 May 2010)
p. 21
Example 3. In the sample size paragraph of their article, Papi et al. (N Engl J Med, 2007) state,
“Using a total of 480 patients, a two-sided test, and an alpha of 0.05, we estimated the
statistical power to be more than 80% to detect a significant difference between these
treatments and an effect size of 0.42. The effect size is a dimensionless variable
expressing the standardized difference (i.e., the mean difference divided by standard
deviation) between, in our study, the mean morning peak expiratory flow rate after asneeded combination therapy and after as-needed albuterol therapy.”
Chapter 2-5 (revision 16 May 2010)
p. 22
Sample Size Calculation When a Multiple Comparison Adjustment is Planned
When more than two groups are compared, a multiple comparison procedure is frequently
appropriate. Usually, we will not be interested in the simultaneous comparison of the means
(e.g., one-way ANOVA) but rather will be interested in the pairwise comparisons. Therefore, a
good approach to sample size calculation for the k sample comparison is to think of the smallest
pairwise difference you want to detect, and then compute your sample size calculation using a
two group comparison method. When p value adjustments will be used (or any other class of
multiple comparison procedure), the required sample size is larger than when p value
adjustments are not needed (since the actual alpha used in the pairwise comparisons is smaller).
Therefore, instead of using alpha = 0.05, use alpha = 0.05/k, where k is the number of
comparisons to be adjusted with a p value adjustment procedure (Witte et al, 2000). The
resulting sample size will provide the desired power for all of the adjusted comparisons. This
approach works for any of the multiple comparisons procedures, since the smallest p value is
compared to 0.05/k at a minimum.
Chapter 2-5 (revision 16 May 2010)
p. 23
Overfitting
If you only require a small sample size, but you want to use a large number of predictor
variables, you must increase the sample size to accommodate this. Otherwise, you can introduce
unreliable correlations into the model that will not hold up in larger or future datasets.
This problem of detecting unreliable correlation, due to having too many predictor variables for
the available sample size, is called “overfitting” (Harrell et al, 1996)
The following rule of thumb for interpreting the size of correlation coefficient will be useful for
this discussion.
Rule of Thumb for Interpreting the Size of a Correlation Coefficient
The Pearson correlation coefficient has the range [-1 to 0] or [0 to 1] with perfect correlation
being 1.0 and 0 being no linear association. A rule of thumb for interpreting the size of the
correlation coefficient is presented by Hinkle et al (1998, p.120):
Rule of Thumb for Interpreting the Size of a Correlation Coefficient
Size of Correlation
Interpretation
0.90 to 1.00 (-0.90 to -1.00)
Very high correlation
0.70 to 0.90 (-0.70 to -0.90)
High correlation
0.50 to 0.70 (-0.50 to -0.70)
Moderate correlation
0.30 to 0.50 (-0.30 to -0.50)
Low correlation
0.00 to 0.30 ( 0.00 to -0.30)
Little if any correlation
Imagine taking two pairs of random numbers, which are clearly not correlated except perhaps by
chance. If you plot these two ordered pairs (x, y), you can fit a straight line through them
perfectly. If you do this for three ordered pairs, you will most likely have a straight line that fits
them quite well, even though they should be uncorrelated. It is not until you get up to over 10
such pairs that the pattern begins to look random upon successive repetitions of this experiment.
Keeping in mind that when you have only one predictor variable (and one outcome variable) the
multiple R from linear regression is exactly the Pearson correlation coefficient (similarly,
R2 = r2). We can do our experiment using the Pearson correlation coefficient, then, and the result
is the same for linear regression.
Chapter 2-5 (revision 16 May 2010)
p. 24
*-- overfitting simulation (correlate random numbers using sample sizes *
of 2 through 10)
set obs 54
set seed 999
capture drop rand1 rand2 frame
gen rand1=invnorm(uniform()) in 1/54
gen rand2=invnorm(uniform()) in 1/54
gen frame=2 in 1/2
// frame 1: n=2
replace frame=3 in 3/5 // frame 2: n=3
replace frame=4 in 6/9 // frame 3: n=4
replace frame=5 in 10/14 // frame 4: n=5
replace frame=6 in 15/20 // frame 4: n=6
replace frame=7 in 21/27 // frame 4: n=7
replace frame=8 in 28/35 // frame 4: n=8
replace frame=9 in 36/44 // frame 4: n=9
replace frame=10 in 45/54 // frame 4: n=10
bysort frame: pwcorr rand1 rand2, obs sig
*
capture label drop framelab
#delimit ;
label define framelab 2 "n=2 (r= 1.00)" 3 "n=3 (r = -0.40)" 4 "n=4 (r=0.29)"
5 "n=5 (r= -0.05)" 6 "n=6 (r= -0.66)"
7 "n=7 (r=0.87)" 8 "n=8 (r= -0.41)" 9 "n=9 (r= -0.16)" 10 "n=10 (r=-0.22)"
;
#delimit cr
label values frame framelab
label variable frame "sample size"
#delimit ;
twoway (scatter rand1 rand2)(lfit rand1 rand2)
, by(frame,legend(off) title(Spurious Correlation of Pairs of Random
Numbers))
xtitle("")
;
#delimit cr
The result is:
Correlation of Normal Random Numbers
for Sample Sizes n=2 to 10
sample size Pearson r
p value
Interpretation
of Strength
2
1.00
1.000
very high
3
-0.40
.735
low
4
0.29
.708
little
5
-0.05
.940
little
6
-0.66
.158
moderate
7
0.87
.011
high
8
-0.41
.316
low
9
-0.16
.681
little
10
-0.22
.540
little
Chapter 2-5 (revision 16 May 2010)
p. 25
Spurious Correlation of Pairs of Random Numbers
n=3 (r = -0.40)
n=4 (r=0.29)
n=5 (r= -0.05)
n=6 (r= -0.66)
n=7 (r=0.87)
n=8 (r= -0.41)
n=9 (r= -0.16)
n=10 (r=-0.22)
-2
0
2
4
-2
0
2
4
-2
0
2
4
n=2 (r= 1.00)
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
Graphs by sample size
In this example, the term “spurious” correlation is used, rather than “unreliable” correlation, to
emphasize that we are correlating two variables that are just random numbers (thus, the
correlation is 0 in the population). In actual datasets, a correlation might exist but be
overestimated, thus “unreliable”, since it would not hold up in future larger datasets.
Now, let’s repeat this simulation for sample sizes 2 to 100, using the following commands
Chapter 2-5 (revision 16 May 2010)
p. 26
clear , all
set seed 888
quietly set obs 100
gen n = _n
// numbers 1 to 100
gen r = .
gen v1 = .
gen v2 = .
forvalues i=2(1)100 {
quietly replace v1=invnorm(uniform()) // use a new sample on each
iteration
quietly replace v2=invnorm(uniform())
quietly corr v1 v2 in 1/`i' // correlation for incrementing sample size
* return list // correlation coefficient saved in r(rho)
quietly replace r = r(rho) in `i'/`i'
}
#delimit cr
gen xval = 0 in 1/1
// data to fit reference line
replace xval = 100 in 2/2
gen yval = 0 in 1/1
replace yval = 0 in 2/2
#delimit ;
graph twoway (line r n ) (line yval xval)
, title(Spurious Correlation)
xtitle(Sample Size) ytitle(Pearson r) legend(off)
xlabel(0(10)100) ylabel(-1(0.1)1,format(%3.1f)angle(horizontal))
;
#delimit cr
Pearson r
Spurious Correlation
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
-0.7
-0.8
-0.9
-1.0
0
10
20
30
Chapter 2-5 (revision 16 May 2010)
40
50
60
Sample Size
70
80
90
100
p. 27
We see that the spurious correlation starts to settle down just over n=10 (although spurious
correlations as high as r=0.30 are still common even with sample sizes up to n=70). That is why
it is suggested that you use n=10 subjects for every predictor variable.
This is not to say that correlations < 0.30 should be ignored. Some things simply cannot be
explained to a large degree by a single variable, because of their complexity. Knoke et al (2002,
p.132) state,
“Typically, a single independent variable in social research seldom accounts for more
than 25% to 30% of the variance in a dependent variable, and often for as little as 2% to
5%.”
Fortunately, the point at which correlations of these sizes achieve statistical significance is what
you would hope to be case, after observing the above simulation.
2
r
0.25
0.30
0.02
0.05
r
0.50
0.55
0.14
0.22
frequency occurs
in social research
seldom
seldom
often
often
statistically significant
(p < 0.05) when
n  16
n  14
n  196
n  80
Rule of Thumb for Sample Size Required to Avoid Overfitting
Harrell (2001, pp. 60-61) states:
“When a model is fitted that is too complex, that is it has too many free parameters to
estimate for the amount of information in the data, the worth of the model (e.g., R2) will
be exaggerated and future observed values will not agree with predicted values. In this
situation, overfitting is said to be present, and some of the findings of the analysis come
from fitting noise or finding spurious associations between X and Y. In this section
general guidelines for preventing overfitting are given. Here we concern ourselves with
the reliability or calibration of a model, meaning the ability of the model to predict future
observations as well as it appeared to predict the responses at hand. For now we avoid
judging whether the model is adequate for the task, but restrict our attention to the
likelihood that the model has significantly overfitted the data.
Studies in which models are validated on independent datasets184, 186, 391 have shown
that in many situations a fitted regression model is likely to be reliable when the number
of predictors (or candidate predictors if using variable selection) p is less than m/10 or
m/20, where m is the ‘limiting sample size’ given in Table 4.1. For example, Smith et al.
391
found in one series of simulations that the expected error in Cox model predicted fiveyear survival probabilities was below 0.05 when p < m/20 for “average” subjects and
below 0.10 when p < m/20 for “sick” subjects, where m is the number of deaths. For
“average” subjects, m/10 was adequate for preventing expected errors > 0.1. Narrowly
Chapter 2-5 (revision 16 May 2010)
p. 28
distributed predictor variables (e.g., if all subjects’ ages are between 30 and 45 or only
5% of subjects are female) will require even higher sample sizes. Note that the number
of candidate variables must include all variables screened for association with the
response, including nonlinear terms and interactions.
TABLE 4.1: Limiting Sample Sizes for Various Response Variables
Type of Response Variable Limiting Sample Size m
Continuous
n (total sample size)
Binary
min(n1, n2) c
Ordinal (k categories)
1
k
n  2 i 1 ni3 d
n
Failure (survival time)
number of failures e
__________
c
See [329]. If one considers the power of a two-sample binomial test compared with a
Wilcoxon test if the response could be made continuous and the proportional odds
assumption holds, the effective sample size for a binary response is 3n1n2/n  3min(n1,
n2) if n1/n2 is near 0 or 1 [452, Eq. 10, 15]. Here n1 and n2 are the marginal frequencies of
the two response levels.
d
Based on the power of a proportional odds model two-sample test when the marginal
cells sizes for the response are n1,…, nk , compared with all cell sizes equal to unity
(response is continuous)[452, Eq. 3]. If all cells sizes are equal, the relative efficiency of
having k response categories compared to a continuous response is 1 - 1/k2 [452, Eq. 14],
for example, a five-level response is almost as efficient as a continuous one if
proportional odds holds across category cutoffs.
e
This is approximate, as the effective sample size may sometimes be boosted somewhat
by censored observations, especially for nonproportional hazards methods such as
Wilcoxon-type tests.34
___________
34. J. K. Benedetti, P. Liu, H.N. Sather, J. Seinfeld, and M.A. Epton. Effective sample
size for tests of censored survival data. Biometrika, 69:343-349, 1982.
184. F.E. Harrell, K.L. Lee, R.M. Califf, D.B. Pryor, and R.A. Rosati. Regression
modeling strategies for improved prognostic prediction. Statistics in Medicine,
3:143-152, 1984.
186. F.E. Harrell, K.L. Lee, D.B. Matchar, and T.A. Reichert. Regression models for
prognostic prediction: Advantages, problems, and suggested solutions. Cancer
Treatment Reports, 69:1071-1077, 1985.
329. P. Peduzzi, J. Concato, E. Kemper, T.R. Holford, and A.R. Feinstein. A simulation
study of the number of events per variable in logistic regression analysis. Journal of
Clinical Epidemiology, 49:1373-1379, 1996.
391. L.R. Smith, F.E. Harrell, and L.H. Muhlbaier. Problems and potentials in modeling
survival. In M.L. Grady and H.A. Schwartz, editors, Medical Effectiveness
Research Data Methods (Summary Report), AHCPR Pub. No. 92-00576, pages
151-159. US Dept. of Health and Human Services, Agency for Health Care Policy
and Research, Rockville MD, 1992.
452. J. Whitehead. Sample size calculations for ordered categorical data. Statistics in
Medicine, 12:2257-2271, 1993.”
Chapter 2-5 (revision 16 May 2010)
p. 29
Harrell (1996) states the m/10 rule for linear regression, logistic, and Cox in a much more
abbreviated form.
Protocol Suggestion
Here is some suggested wording for a study protocol:
Sample Size Determination
A sample size of n=30 is required for 90% power. However, we intend to include 10
predictor terms in the model (counting the number of indicator terms needed for the
categorical variables). [It is a more impressive presentation to list these.] To avoid
“overfitting” then, where unreliable correlation is introduced by violating the n=10
subjects for every predictor term, we will collect a sample size of n=100. (Harrell, 2001)
Overfitting: 5 to 9 events per variable in models to assess an effect while adjusting for
confounding
If you can only achieve 5 to 9 events per variable (EPV) in a logistic regression or Cox
regression, however, you are still okay if you cite Vittinghoff and McCulloch (2007) to support
this relaxed rule. In a large simulation study to investigate this rule, Vittinghoff and McCulloch
concluded,
“Our simulation study shows that the rule of thumb of 10 or more EPV in logistic and
Cox models is not a well-defined bright line. If we (somewhat subjectively) regard
confidence interval coverage less than 93 percent, type I error greater than 7 percent, or
relative bias greater than 15 percent as problematic, our results indicate that problems are
fairly frequent with 2-4 EPV, uncommon with 5-9 EPV, and still observed with 10-16
EPV. Cox models appear to be slightly more susceptible than logistic. The worst
instances of each problem were not servere with 5-9 EPV and usually comparable to
those with 10-16 EPV.”
Note: This 5 to 9 events per variable only applies to the situation where the aim is to examinine
the association of an exposure while adjusting for confounding (Vittinghoff and McCulloch,
2007;Steyerberg, 2009, p.51). Specifically, it should not be used for developing prediction, or
prognostic, models (Steyerberg, 2009, p.50-51).
The final touch to the discussion on reporting the sample size determination in your study
protocol (grant application) is when you determine that you need a small sample size for the
number of predictor variables you plan to model. In that case, you need to increase your sample
size to avoid overfitting.
Protocol Suggestion
Here is some suggestion wording for using 5 to 9 events per predictor variable in a logistic
regression or Cox regression in an article.
Chapter 2-5 (revision 16 May 2010)
p. 30
Sample Size Determination
We included 8 predictor variables in our logistic regression model, with 40 observed
events, resulting in 40/8, or 5 events per predictor variable. A suggested rule-of-thumb is
to have 10 events per predictor varible in a logistic regression model to avoid overfitting
(Harrell, 1996). However, it has since been shown that as few as 5 events per predictor
variable in logistic and Cox regression models are sufficient to avoid overfitting when the
aim of the model is to adjust for confounding (Vittinghoff and McCulloch, 2007).
Switching the Dependent and Independent Variables
It is sometimes easier to reverse the roles of dependent and independent variable for sample size
determination.
Discussing the arbitrariness of which is considered the dependent and which is independent
variable, Chinn (2001,p.394 last paragraph) states,
“Before presentation and analysis can be discussed, the distinction between outcome, or
dependent, variables and explanatory variables, also called independent or exposure
variables, needs to be clarified. Usually there will be no confusion. In a randomized
controlled trial, survival or recovery of the patient may be the outcome of interest and the
treatment group is the explanatory variable. There may be additional explanatory
variables, such as age and sex, and these sould include any variable used to stratify the
patients in the RCT. However, in some circumstances there is ambiguity. In a casecontrol study, subjects are seleced as having the disease, the cases, or not having the
disease, the controls, and the measured potential risk factors are the outcomes of the
study. The data analysis proceeds by treating “caseness” as the outcome and the risk
factors as explanatory variables, but strictly speaking the opposite is true. In a study of
asthmatic patients presenting in Accident and Emergency it is possible to compare the
ages of patients that do or do not require admission or to analyse the risk of admission by
age. In the first analysis, age is treated as the outcome and admission the independent
variable, but more logically in the second, admission is the outcome. Although a
conclusion that increasing age is associated with lower risk of admission might be found
from either analysis, the second leads to results in a more useful form and also enables
adjustment for risk factors other than age.”
Protocol suggestion (for a case-control study design, for example)
The independent and dependent variables can be reversed for sample size determination,
just as in tests of significance, since an association is invariant to the choice (Chinn,
2001). As a simple illustration, it makes no difference which is the Y and which is the X
variable when computing a correlation coefficient, and this practice of switching
dependent and independent variables is commonplace in the patients characteristics table
of journal articles. To test the association between the cases and controls on the
continuous variable X, assuming mean±SDs of ….
Chapter 2-5 (revision 16 May 2010)
p. 31
Excessive Power (Sample Size Very Large)
If your sample size is too large, even trivial differences are statistically significant. When this
happens, p values are no longer useful. You should then not bother with p values, and instead
just show effects, such as mean differences, along with confidence intervals. State whether or
not the difference observed is thought to be clinically significant or not. A good discussion, and
citation, for this is section called “Studies With Excessive Power That Detect Differencs That
Are Not Clinically Meaningful” on page 1521 in Bhardwaj et al (2004).
Sample Size Based on Precision (Desired Width of Confidence Interval)
Sometimes the goal of a study is to demonstrate an effect with a desired precision. That is, the
investigator wants to report an effect with a desired width of the confidence interval around the
effect. A good example is computing a reliability coefficient, such as kappa, with “good”
precision, such as a 95% CI of Kappa±0.05. Studies designed to report test characteristics of a
diagnostic test, such as sensitivity and specificity, can also benefit from the precision approach to
sample size determination.
If the investigator desires a statistical significant p value, as well as desired precision around the
effect estimate, then sample size is determined both for adequate power to detect the effect, as
well as determined to achieve desired precision. The sample size is then selected as the larger of
the two, so that both goals are achievable. (Bristol, 1989).
The approach to computing sample size for a desired precision is based only on the formula for
the confidence interval, rather than power of a significance test. At first, this seems counterintuitive, since it seems we want to be 80% sure, for example, that the CI will be no wider than
the desired width. In significance testing, where we want to test if the mean difference is
different from zero, the CI can be used as the significance test, where we conclude a difference
only if the CI covers zero. We want to be 80% sure, or have 80% power, that the CI does not
cover 0. However, that is what happens on the outside of the CI. The precision is only based on
the inside of the CI, where precision for a CI of (a, b) is the width, or b-a. For precision, we
don’t actually care if the interval includes 0.
For sample size determination for a significance test, we assume the means and standard
deviations, and then determine n for a desire power, which is only as reliable as our assumptions
of what the means and standard deviations are. For precision, given the means, standard
deviations, and sample size, the width of the interval is determined directly from the CI formula.
Given any 3 of the quantities means, SDs, N, or Width, the other quantity is solvable by algebra.
So, if our assumptions of the means and SDs are reliable, where our Width is set by our desire,
the N is simply determined by algebra. If our assumptions are off, then our N will be off.
However, we have that same problem in ordinary significance testing sample size determination,
so we are in no worse of a situation.
Chapter 2-5 (revision 16 May 2010)
p. 32
Precision for a single mean
Rosner (2006, p.258-59) gives the formula for precision of a single mean. Using the asymptotic,
or large sample, formula for a CI around a mean, the 100(1-α)% CI for the population mean, μ, is
given by
X  z1 /2 / n
The width of this interval (a , b) is b-a, or 2( z1 /2 / n ) . If we wish the interval to be no wider
than L, then
2( z1 /2 / n )  L
Multiplying both sides of the equation by
N / L , we obtain
2z1 /2 / L)  n
Then squaring both sides, gives us the number of subjects needed to obtain a CI no wider than L,
n  4z12 /2 2 / L2
For a two-sided 95% CI,
z1 /2  z10.05/2  z97.5  1.96
In Rosner’s example (Rosner, 2005, p.258-59), where we want to estimate the mean change in
heart rate, with a 95% CI no wider than 5 beats per minute, with a SD=10 beats per minute, we
require
N = 4(1.96)2(102)/(52) = 61.5, or 62 patients.
Chapter 2-5 (revision 16 May 2010)
p. 33
Precision for a difference between two means
Bristol (1989) gives the formula for estimating the required sample size to obtain a desired
precision around the difference between two means.
Beginning with the z test for the difference between two means,
z
X  Y  (x   y )
1 1

n1 n2


X  Y  0 X  Y  0 X  Y  0


when n1  n2
1 1
2
2




n1 n2
n
n
The CI around the mean difference, then, is given by
( X  Y )  z /2

n
2
 ( X  Y )  z /2
 2
n
The width of this interval is
2z /2
 2
n
Solving for n gives






n=  2 2 z /2  =22 (2)  z /2   8  z /2 
L
L
L



2
2
2
(Bristol’s equation 3)
Assuming the standard deviation, σ, is 1, and we desire L = 1, the required n is
2
1

n = 8 1.96   8(1.96) 2  30.73 , or 31 subjects
1

which agrees with the value shown in Bristol’s Table I in row L0 /   1/1  1.0 and column nL .
Chapter 2-5 (revision 16 May 2010)
p. 34
These two citations, Ross and Bristol, both illustrate that the required sample size for a desired
precision is simply applying the CI formula. The same thing is illustrated in Chow et al (2008,
pp.15-16) textbook on sample size calculation.
In practice, you usually have a desired sample size and simply want to show it is large enough to
achieve a desired precision for some effect. For your power analysis, then, you simply compute
the CI for your effect and point out that the width of the CI is adequate precision. Actually, there
really is no standard for what a desired precision is. So, just show what you have and state that
“it is narrow enough to be informative.”
The width is usually expressed as one-half of the interval, since it generally does not matter if the
lack of precision is above or below an estimate in a two-sided 95% CI. Chow et al (2008, p.15)
describes the approach,
“For a (1 – α)100% confidence interval, the precision of the interval depends on its width.
The narrower the interval is, the more precise the inference is. Therefore, the precision
analysis for sample size determination is to consider the maximum half width of the (1 –
α)100% confidence interval of the unknown parameter that one is willing to accept. Note
that the maxiumum half width of the confidence interval is usually referred to as the
maximum error of an estimate of the unknown parameter.”
Protocol Suggestion
Suppose you are designing a study where you wish to show that first year orthopaedic surgical
residents can reliably determine that a knee joint requires revision due to infection. To be
conservative, you will assume that the reliability coefficient, kappa, is 0.70. You would like to
keep the number of residents to 4 and the number of patients to 30. You might state something
like the following:
To demonstrate that first year orthopaedic surgical residents can reliably determine that a
knee joint requires revision due to infection, the sample size estimation will be based on
acceptable precision around the kappa interrater reliability coefficient (Bristol,1989;
Chow et al, 2008). It is anticipated that the kappa will be no lower than 0.70. A 95%
confidence interval of no less than ±0.15 would be considered sufficiently informative.
The following table shows the expected precision of the 95% confidence interval around
an expected kappa of 0.70.
Sample Size Determination Based on Precision of Kappa Estimate
Two-sided 95% Confidence Interval Around Kappa=0.70 (shown as percent)
Number of
Reviewers
3
4
9
15
20
30
Number of Patient Charts Reviewed
10
41-90
44-88
48-87
49-86
49-86
49-85
15
47-87
49-86
52-84
52-84
52-83
53-84
Chapter 2-5 (revision 16 May 2010)
20
50-85
52-84
54-82
54-82
55-82
56-82
25
52-84
54-83
56-81
56-81
57-81
57-81
30
54-83
55-82
57-80
58-80
58-80
58-80
35
54-82
57-82
58-80
59-79
59-79
59-79
40
56-81
57-80
59-80
60-79
60-79
59-78
45
57-81
58-80
60-79
60-79
60-78
60-78
50
58-81
59-79
60-78
61-78
61-78
61-78
p. 35
Adequate precision is obtainable with n=4 residents and 30 patients. We see that
increasing the number of residents above 4 and the number of patients above 30 does not
provide a large enough gain in precision to justify the increase in effort and cost of the
project.
Two Group Comparison of Interval Scale Outcome Sample Size Presentation in Study
Protocol
Although most articles are published without a sample size paragraph in the Statistical Methods
section, for a grant application it is mandatory. This is so because a granting agency does not
want to invest money in your study if it has little chance of success (not a sufficient sample size
to demonstrate a true effect).
In your sample size paragraph, you should provide the five elements required for sample size
calculation, as well as the test procedure it is based on, so the reviewer can check your
calculation. This also informs the reviewer you knew what was needed for the calculation,
which suggests you actually did a calculation, rather than just guessed and are now bluffing your
way through. You should also describe padding the sample size to allow for subject dropouts
and losses-to-follow-up.
Protocol Suggestion
Something to the effect of the following is appropriate:
Our sample size calculation is based on the primary study aim (Study Aim 1). We
consider the minimal detectable effect to be a mean difference of 1.0 (means: Group A, 5,
Group B, 4). A smaller effect than 1.0 would likely translate into a negligible clinical
outcome, whereas a difference of 1.0 would likely be noticeable improvement in the
patient’s well-being. Previous research suggests the mean difference will be at least 1.0
in this patient population. references We estimated the population standard deviations from
previous research (standard deviations: Group A, 2.5, Group B, 2.0).references Based on an
independent samples t test, we require n=109 in each group (total n=218) to have 90%
power, with alpha of 0.05, using a two-sided comparison. To provide for a 10%
reduction in evaluative subjects, due to dropouts and losses-to-follow-up, we will collect
a sample of n=121 in each group (total n = 242).
An even more impressive presentation (and good research practice as well) is to provide a
sensitivity analysis of your sample size calculation. This is simply a table showing the power
provided by the selected sample size for a reasonable range of deviations from both the assumed
effect size and assumed standard deviations.
Chapter 2-5 (revision 16 May 2010)
p. 36
It is a good idea to add a sensitivity analysis paragraph after the above paragraph, something to
the effect of:
Our evaluative sample size of n=109 in each group appears to be sufficiently robust to
reasonable deviations from our assumed effect size and standard deviation estimates, as
the power does not drop below 80% for any reasonable deviation.
Power for n=109 in each group for reasonable combinations
of deviations in assumed effect size and standard deviation estimates
Effect Size
(mean difference)
Standard Deviations
2.5 & 2.0
2.75 & 2.25
3.0 & 2.50
1
90*
83
76
0.9
83
75
67
0.8
74
65
57
*Estimate used for our sample size calculation
Note The above sentence does not match the table, since power does drop below 80%. We need
to either change the deviations in the table, qualify our statement by saying some of these
deviations are very unlikely, or adjust our sample size to allow for the reasonable deviations in
the table. If we really suspect that deviations as large as these are possible, we should use a
sample size that gives us 80% power to detect a mean difference of 0.8 with standard deviations
of 3.0 & 2.50 (the lower right cell of table).
Sample Size Presentation in Article
A condensed version of a sample size determination, or power analysis, paragraph is frequently
reported in a research article. Horton and Switzer (2005) surveyed what statistical methods are
used in research articles published in NEJM. They found that 39% of research articles published
in 2004-2005 provided a power analysis.
In an article, a much shorter presentation is appropriate. Here is an example for a survival
analysis (Calverley et al., 2007),
“Assuming a 17% mortality rate in the placebo group at 3 years,17 we estimated that 1510
patients would be needed for each study group to detect a reduction in mortality of 4.3
percentage points in the combination-therapy group, as compared with the placebo group
(hazard ratio for death, 0.728), at a two-sided alpha level of 0.05 with 90% power.”
Chapter 2-5 (revision 16 May 2010)
p. 37
Example of Working Through a Sample Size for Comparing Two Means (Smoking
Cessation Program Example)
Suppose you have the following study aim in your study protocol, or grant proposal:
Aim 1. Assess whether a smoking cessation program designed around physicians
presenting health benefits is less effective for young adults (ages
21 to 39) than for not-so-young adults (ages 40 and above).
When preparing our statistical methods section, we would appear naïve and fail to get our grant
funded if we proposed a t test as our statistical method for testing the Aim 1 hypothesis. The
reviewer would immediately recognize that the effect is probably confounded, possibly by
gender and certainly by habit (cigarettes/day) upon entering the cessation program.
We might use something like,
Statistical Methods
To test our Aim 1 hypothesis [smoking cessation program designed around physicians
presenting health benefits work less well for both young adults (ages 21 to 39) than for
not-so-young adults (ages 40 and above)] we will use multivariable linear regression.
The outcome will be days remaining smoke free within first year. The study factor (age
group) will be adjusted for potential confounding variables, such as gender and daily
smoking habit (cigarettes/day) at the start of smoking cessation program.
We next need to prepare the “sample size justification” section for the grant application, where a
multivariable linear regression model is planned.
Since we know that the t test and linear regression are equivalent, we can base our sample size
calculation using the same formula that we would use for a t test.
Suppose we have our own preliminary data, or a previously published study with the same study
aim. In those data, we have the following mean number of days before study subjects returned to
smoking:
group 1: mean = 85, SD = 137
group 2: mean = 71, SD = 119
Calculating the sample size for powers of 80, 85, 90, and 95:
sampsi
sampsi
sampsi
sampsi
85
85
85
85
71,
71,
71,
71,
sd1(137)
sd1(137)
sd1(137)
sd1(137)
sd2(119)
sd2(119)
sd2(119)
sd2(119)
power(0.8)
power(0.85)
power(0.9)
power(0.95)
. sampsi 85 71, sd1(137) sd2(119) power(0.8)
Estimated sample size for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
Chapter 2-5 (revision 16 May 2010)
p. 38
and m2 is the mean in population 2
Assumptions:
alpha
power
m1
m2
sd1
sd2
n2/n1
=
=
=
=
=
=
=
0.0500
0.8000
85
71
137
119
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
1319
1319
Realistically, we can probably stop here. It is not likely that we can get N=1509 subjects to
enroll into our smoking cessation program and effectively follow them for one year. For 85%
power, we need n=1766 in each group; for 90% power, we need n=1766 in each group, and for
95% power we need n=2184 in each group. Let’s keep going, however, for illustration.
A clever sample size determination paragraph will include a sensitivity analysis, which is a table
that shows the power for various effect sizes for the sample size you are requesting in your grant.
sampsi
sampsi
sampsi
sampsi
78 71, sd1(137) sd2(119) n1(1500) n2(1500) // 1 week diff
85 71, sd1(137) sd2(119) n1(1500) n2(1500) // 2 week diff
92 71, sd1(137) sd2(119) n1(1500) n2(1500) // 3 week diff
101 71, sd1(137) sd2(119) n1(1500) n2(1500) // 4 week diff
Estimated power for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
m1
m2
sd1
sd2
sample size n1
n2
n2/n1
=
=
=
=
=
=
=
=
0.0500
78
71
137
119
1500
1500
1.00
(two-sided)
Estimated power:
power =
0.3209
For 1 week, the power is 32%; for 2 weeks, the power is 85%; for 3 weeks, the power is 99%; for
4 weeks, the power is 100%.
Chapter 2-5 (revision 16 May 2010)
p. 39
We are now ready to write our sample size justification paragraph.
Sample Size Justication
To test our Aim 1 hypothesis, we require n=1509 in each of two groups (total N=3018) to
detect a difference in days remaining smoke free of 2 weeks. This sample size gives us
85% power, for a two-sided comparison, using alpha=0.05 and standard deviation
estimates from our pilot study (control group SD=137, intervention group SD=119).
From our pilot study, we feel confident that no less than a two-week difference will be
observed, and this difference seems to us to be a meaningful effect given the difficulty
that smokers have in quitting smoking. As shown in the following sensitivity analysis, a
larger effect will be nearly certain to be achieved if it is exists.
Sensitivity of power for various effect sizes
(n=1509 for each of two groups, total N=3018)
Difference in days remaining smoke free
1 week
2 weeks
3 weeks
4 weeks
32%
85%
99%
100%
To allow for a 10% dropout and losses to follow-up, we will collect n=1.1 x 1509, or
1660 in each group (total N=3,320).
Chapter 2-5 (revision 16 May 2010)
p. 40
References
Bhardwaj SS, Camacho F, Derrow A et al (2004). Statistican significance and clinical relevance:
the importance of power in clinical trials in dermatology. Arch Dermatol 40:1520-1523.
Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trails. Annals
Allergy, Asthma, & Immunology 78:5-11.
Borenstein M, Rothstein H, Cohen J. (2001). SamplePower® 2.0. Chicago, SPSS Inc.
software can be purchased at http://www.spss.com
Bristol DR. (1989). Sample size for constructing confidence intervals and testing hypotheses.
Statist Med 8:803-811.
Browne RH. (2001). Using the sample range as a basis for calculating sample size in power
calculations. The American Statistician, 55(4):293-298.
Cahen DJ, Gouma DJ, Nio Y, et al. (2007). Endoscopic versus surgical drainage of the
Pancreatic duct in chronic pancreatitis. N Engl J Med 356(7):676-784.
Chow S-C, Shao J, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed. New
York, Chapman & Hall/CRC.
Chinn S. (2001). Statistics for the European Respiratory Journal. Eur Respir J 18:393-401.
Cohen J. (1965). Some statistical issues in psychological research. In B.B. Wolman (Ed.),
Handbook of Clinical Psychology. New York, McGraw-Hill. pp. 95-121.
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, New
Jersey.
Fleiss JL. (1973). Statistical Methods For Rates and Proportions. New York, John Wiley &
Sons. pp.21-22.
Harrell Jr FE, Lee KL, Mark DB. (1996). Tutorial in Biostatistics: Multivariable prognostic
models: issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Statist Med 15:361-387
Harrell FE Jr. (2001). Regression Modeling Strategies With Applications to Linear Models,
Logistic Regression, and Survival Analysis. New York, Springer.
Hinkle DE, Wiersma W, Jurs SG. (1998). Applied Statistics for the Behavioral Sciences, 4th ed.
Boston, Houghton Mifflin Company.
Hoenig JM, Heisey DM. (2001). The abuse of power: the prevasive fallacy of power calculations
for data analysis. The American Statistician 55(1):19-24.
Chapter 2-5 (revision 16 May 2010)
p. 41
Horton NJ, Switzer SS. (2005). Statistical methods in the Journal. [letter] NEJM 353;18:197779.
Hovi P, Andersson S, Eriksson JG, et al. (2007). Glucose regulation in young adults with very
low birth weight. N Engl J Med 356(320):2053-63.
Knoke D, Bohrnstedt GW, Mee AP. (2002). Statistics for Social Data Analysis, 4th ed. Itasca, IL,
F.E. Peacock Publishers.
Lilienfeld DE, Stolley PD (1994). Foundations of Epidemiology, 3rd ed., New York, Oxford
University Press.
Miller RG, Moore DH, Gelinas DF, et al. (2001). Phase III randomized trial of gabapentin in
patients with amyotrophic lateral sclerosis. Neurology 2001;56:843-848.
Papi A, Canonica GW, Maestrelli P, et al. (2007). Rescue use of beclomethasone and albuterol in
a single inhaler for mild asthma. N Engl J Med 356;20:2040-52.
Petrie A. (2006). Statistics in orthopaedic papers. Journal of Bone and Joint Surgery 88-B(9):
1121-1136.
Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole.
Ross S. (1998). A First Course in Probability, 5th ed. Upper Saddle River, NJ.
Smith D, Eggen M, St. Andre R. (1997). A Transition to Advanced Mathematics, 4th ed.,
Pacific Grove, California, Brooks/Cole Publishing Company.
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer.
Stoddard GJ, Ring WH. (1993). How to evaluate study methodology in published clinical
research. J Intravenous Nursing 16(2):110-117.
Vittinghoff E, McCulloch CE. (2007). Relaxing the rule of ten events per variable in logistic and
Cox regression. Am J Epidemiol 165(6):710-718.
Whitehead J. (1993). Sample size calculations for ordered categorical data. Statistics in Medicine
12:2257-2271.
Witte JS, Elson RC, Cardon LR. (2000). On the relative sample size required for multiple
comparisons. Statist Med. 19;369-372.
Chapter 2-5 (revision 16 May 2010)
p. 42