Download sample size calculation of sample

Document related concepts

Generalized linear model wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
COURSE 6: SAMPLE SIZE
CALCULATION OF SAMPLE SIZE ............................................................85
Preparing to Calculate Sample Size.....................................................86
Sample Size Calculations for Dichotomous Response Variables ..89
Sample Size Calculations for Continuous Response Variables ...102
Sample Size for Time-to-failure Data (censored data case) ..........106
Sample Size for Testing Equivalence of Treatments......................109
Pocock’s Table 9.1...................................................................................112
SAMPLE SIZE TABLES ..............................................................................114
POWER AND SAMPLE SIZE (PS) SOFTWARE .....................................135
How to Download PS .............................................................................135
How to Use PS: Examples .....................................................................138
APPENDIX: PAGANO TABLE A.3 ................................................................ I
- 84 -
CALCULATION OF SAMPLE SIZE
Clinical trials should have sufficient statistical power to detect differences
between groups considered to be of clinical interest. Therefore, calculation of
sample size with provision for adequate levels of significance and power is an
essential part of trial planning.
Type I error, Type II error, p value, and power of a test
H 0 is true
 (type I error)
Reject H 0
Do not reject H 0
H 0 is not true
1 –  (power)
 (type II error)
Power
=
the probability of REJECTING the null
hypothesis if a specific alternative is true
Power
=
Fn. (, variation, clinically significant level,
and SAMPLE SIZE)
 Sample size
=
Fn. (power, variation, clinically significant
level)
p-value
=
the probability we would have observed this
difference (or a greater difference) if the null
hypothesis were true
Calculation of a proper sample size is necessary to ensure adequate levels of
significance and power to detect differences of clinical interest.
Biggest danger: Sample size too small  No significant difference found
 Treatment that may be useful is discarded.
Sample size calculations are approximate.



Often based on roughly estimated parameter values.
Usually based on mathematical models that only approximate truth.
Changes may occur in the target population, the eligibility criteria, or the
expected treatment effect before the study begins.
 Be conservative when estimating sample size.
85
Preparing to Calculate Sample Size
1. What is the main purpose of the trial? (This is the question on which
sample size is based.)
2. What is the principal measure of patient outcome (endpoint)? Is this
measure continuous or discrete? Censoring?
3. What statistical test will be used to assess treatment difference (e.g., ttest, log-rank, chi-square)? With what -level? One-tailed or two-tailed?
4. What result is anticipated with the standard treatment (e.g., average
value or rate)?
5. How small a treatment difference is it important to detect (), and
with what degree of certainty (power = 1 – )?

= type I error
= probability of rejecting H 0 : = 0 when H 0 : =

0 is true

= type II error
=
probability of not rejecting H 0 : = 0 when  
0.  changes as a function of :
 near zero   is large;  far from zero   is small
1–
= power (as a
= probability of rejecting H 0 : = 0 when   0
function of )
 near zero  low power;  far from zero  high power
86
Often we set  = 0.05 or 0.01, but want to check various values of n, , and .
For fixed n, a plot of power = 1 –  vs. is a power curve.
For a two-sided test, it looks like this:
1
large n
Power =
1 - 
small n
α
<0
H 0 false
=0
H 0 true
>0
H 0 false
Alternatively, we could plot any two parameters for fixed values of the others.
For example, for fixed  and , we could plot n vs.:
200
1 –  = 0.9
Sample
size
100
1 –  = 0.8
0
0
|δ|
Because sample size planning often involves a trade-off between desired
sample size, cost, and patient resources, such curves are useful.
87
Alternatively, sample sizes may be based on lengths of confidence
intervals instead of power. If done, it’s best to still check power to make
sure power is adequate. In either case, C.I.’s are useful for reporting results.
These sample size methods assume a single final analysis at the end of the
trial. Interim analyses increase the chance of finding significant difference 
either make adjustments to sample size or use group sequential testing
methods.
Sample size methods will next be given for dichotomous, continuous, and
continuous but censored data.
88
Sample Size
Variables
Calculations
for
Dichotomous
Response
Compare drug A (standard) vs. drug B (new).
PA
=
proportion of failures expected on drug A
PB
=
proportion of failures on Drug B that one would want to detect
as being different
Note:  = P A – P B
We want to test H 0 :P A = P B vs. H a :P A  P B (P = true value) with significance
level , and power = 1 –  to detect a difference of  = P A – P B .
The total sample size required (N in each group) is:
2N 

2 Z
2 p(1  p)  Z  pA(1  pA)  pB(1  pB)
2
( pA  pB ) 2
,
2
where:
p  ( pA  pB ) / 2 ,
and Z /2 and Z  are critical values of the standard normal distribution, for
example, for  = 0.05 (two-sided test), Z 0.05/2 = 1.96. The table below gives
Z /2 and Z  for common values of  and .

Z /2
1–
Z
0.10
0.05
0.025
0.01
1.645
1.960
2.240
2.576
0.80
0.85
0.90
0.95
0.84
1.03
1.282
1.645
89
Example
PA
PB
=
=
0.4
0.3
(An “event” is a failure, so we want a reduced proportion on the new therapy.)
Let  = 0.05, 1 –  = 0.90, two-sided test.
Note:
p
(0.4  0.3)
 0.35
2
From the table provided, we have Z /2 = 1.96 and Z  = 1.282. Substituting
those values into the formula gives:
2N 

2 1.96 2(0.35)(0.65)  1.282 (0.4)(0.6)  (0.3)(0.7)

2
(0.4  0.3) 2
= 952.3
Rounding up to the nearest 10 yields 2N = 960, or N = 480 in each group.
The tables from Fleiss (handout) give sample sizes for several cases
calculated using an adjusted version of the above formula.
90
Pocock’s sample size formula for dichotomous response
variables
The variance of p A  p B equals var ( p A) + var ( p B ) if the two samples are
independent. The binomial variance [i.e., the variance p̂ = x/n, where x has
the binomial (n,p) distribution] is a function of p(1 – p). The trouble is that
we don’t know the true values, p A and p B , to compute the true variance. (If
we did, we wouldn’t have to do the experiment in the first place!)
The variance of p A  p B under H 0 :p A = p B = p is a function of 2 p (1  p ) , and
the variance of p B  p A under H A :p A  p B is a function of p A (1 – p A ) + p B (1 –
p B ). (The sample size formula derivation will be given later, which shows
why the former is multiplied by Z  and the latter by Z  .) Often these two
values will be very similar.
Pocock uses p A (1 – p A ) + p B (1 – p B ) in both places in the sample size formula
above, which simplifies the formula considerably:
b
gb
2 pA(1  pA)  pB(1  pB) Z / 2  Z
2N 
2
pA  pB
b
g
g
2
Pocock’s formula uses proportions multiplied by 100% (e.g., 75% instead of
0.75), but this change in scale cancels in the numerator and denominator,
and gives the same result as using proportions.
Pocock’s Table 9.1 gives (Z  + Z   for several values of  and .
Table 9.1 (Pocock). Values of ƒ(,) to calculate the required number of
patients for a trial
(type I
error)
0.10
0.05
0.02
0.01
0.05
 (type II error)
0.1
0.2
0.5
10.8
13.0
15.8
17.8
8.60
10.5
13.0
14.9
2.7
3.8
5.4
6.6
6.20
7.90
10.0
11.7

91
N adjusted for continuity correction
(Fleiss, 1981; Casagrande et al., 1978)
Recall:
Underlying distribution is binomial (discrete), which we approximate with a
normal distribution (continuous).
Using the continuity correction leads to the following adjustment in sample
size:
N corrected =
L
M
M
N
O
P
P
Q
2
N
4
1 1
N pA  pB
4
Using the previous example, with p A = 0.4, p B = 0.3, N = 480:
2

480 
4
  499.8  500
1 1
N corrected =

4 
480  0.4  0.3 
Using the uncorrected N, the sample size would be too small by:
2 x (500 – 480) = 40 patients.
The corrected N is recommended, and the continuity-corrected test
statistic also should be used. Corrected values are tabulated for extensive
combinations of , , p A , and p B in the references.
92
For example, for power = 0.80)  Z  = 1.96, Z  

PA
PB
N
0.05
0.10
0.20
0.30
0.40
0.45
0.50
0.60
0.70
0.80
0.85
0.15
0.20
0.30
0.40
0.50
0.55
0.60
0.70
0.80
0.90
0.95
140
199
293
356
387
391
387
356
293
199
140
References:
Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New
York, NY: Wiley; 1981.
Casagrande JT, Pike MC, Smith PG. An improved approximate
formula for calculating sample sizes for comparing two binomial
distributions. Biometrics 1978;34(3):483-486.
93
Effect of binomial variance on sample size
Recall that the variance of p is a function of p(1 – p), which is graphed below:
0.25
0.20
p(1 – p)
0.15
0.10
0.05
¼
½
¾
1
p
The variance of p is largest when p = 0.5, and smallest when p is near 0 or 1.
 Larger sample sizes are required to detect a change, p A - p B , when
p A and p B are near 0.5. Smaller sample sizes are required for p A and
p B near 0 or 1.
If one has no idea about the true value of p, then one can conservatively use p
= 0.5 in the variance formula for sample size calculations.
In general, dichotomous outcomes require substantial sample sizes to detect
moderate differences. Continuous outcomes usually require smaller sample
sizes.
94
Derivation of the (uncorrected) sample size formula
Let p A and p B be the sample proportions, and N is the sample size in each
group.
To test H 0 :p A = p B vs. H a :p A > p B (one-tailed test used for simpler
calculations), we use the test statistic:
Z
p A  p B
,
2 pq / N
where:
1
p  ( p A  p B) and q  1  p
2
Testing at level  means:
P(rejecting H 0 H 0 true) = P(Z>Z  H 0 true) = 

We can perform anlevel test for any sample size (recall power curve). To
determine N, we need to specify  and .
For a given  and  = p A - p B , we have:
P(rejecting H 0 H a true) = P(Z>Z  p A - p B = ) = 1 – 
This probability is a function of N (because Z is a function of N), so we can
solve the equation for N.
However, the Z statistic does not have a standard normal distribution if H a is
true. p A  p B was standardized assuming H 0 true, so we must unstandardize and then re-standardize.
Recall that under H a :
pAqA  pBqB 
p A  p B ~ N  pA  pB,
,
N


where q = 1 – p.
95
So:
1    P( Z  Z pA  pB   )  P
F p  p
Z
G
G
H2 pq / N
A
B

pA  pB  
I
JJ
K
Un-standardizing:
 P( pˆ A  pˆ B  Z 
2 pq
pA  pB   )
N
And re-standardizing:
P
F
G
a fa f
G
a f
G

G
Hstandard normal
p A  p B  pA  pB
pAqA  pBqB / N

a f
f
Z 2 pq / N  pA  pB
a
pAqA  pBqB / N
pA  pB  
I
JJ
JJ
K
So:
1   P( Z  Z pA  pB   ) ,
where Z  is a critical value from the normal distribution, and also:
 Z 
b g
bp q  p q g/ N
Z 2 pq / N  pA  pB
A A
B B
96
We can now solve this equation for N. First multiply by
 Z 
N / N  1:
Z 2 pq  ( pA  pB) N
pAqA  pBqB

b
g
  Z  pAqA  pBqB  Z  2 pq  ( pA  pB) N
 ( pA  pB ) N  Z  2 pq  Z  pAqA  pBqB
2
 Z  2 pq  Z  pAqA  pBqB 
 ,
 N 


A  pB )
(
p


which is the sample size formula given earlier.
N is the number required in each group. The formula in the notes for 2N
multiplies the above result by 2.
(Note that pq is still an unknown quantity, but we will approximate it with
[p A + p B ]/2.)
97
Sample size based on width of confidence intervals
(McHugh & Le, 1984)
If we want a C.I. of width 2d (i.e.,   d), then solve for N:
d  Z

N
bp q  p q g/ N
A A
B B
1
Z   pAqA  pBqB 
d
1

 N   Z 
d

2
 pAqA  pBqB 
For the previous example, p A = 0.4, p B = 0.3, N= 480, and Z  = 1.96:
d  1.96 (0.4)(0.6)  (0.3)(0.7) / 480  0.06
If we wanted a C.I. of width 2(0.05) instead of 2(0.06), the required sample
size would be:
 1

1.96
N 
 0.05

2
  0.4 (0.6)  (0.3)(0.7)   691.5
Reference:
McHugh RB, Le CT. Confidence estimation and the size of a clinical trial.
Control Clin Trials 1984;5(2):157-163.
98
Adjustment for noncompliance (crossovers)
Assume a new treatment is being compared with a standard treatment.
Dropouts:
those who refuse the new treatment some time after
randomization and revert to the standard treatment
Drop-ins:
those who receive the new treatment some time after
initial randomization to the standard treatment
These generally dilute the treatment effect.
Example: drug A vs. placebo
Suppose the true values are:
=
pA
p placebo =
0.6
0.4

0.6 – 0.4 = 0.2
So:
=
Enroll N = 100 patients in each treatment group.
 25% of drug A group drops out.
 10% of placebo group drops in.
So, instead of observing:
E ( pˆ A)  0.6 and E ( pˆ B)  0.4 ,
we observe:
E ( pˆ A) 
75
25

(0.6) 
(0.4)  0.55 
100
100


   0.55  0.42  0.13 (instead of 0.20)

90
10
(0.4) 
(0.6)  0.42 
E ( pˆ B) 

100
100
 The power of the study will be less than intended, or else the
sample size must be increased to compensate for the dilution effect.
99
For a dropout or drop-in rate of R (crossovers in one direction only), the
adjusted sample size is:
N
adjusted
N
1
(1  R) 2
For example, if R = 0.25 in the previous example with N = 480:
N
adjusted
 480 
1
 480(1.78)  853.3
(1  0.25) 2
For a dropout rate of R 1 (A  placebo) and a drop-in rate of R 2 (placebo  A),
the adjusted sample size is:
N
adjusted
N
1
(1  R1  R 2) 2
For example, if R 1 = 0.25 and R 2 = 0.10:
N
adjusted
 480 
1
 1,136
(1  0.25  0.10) 2
The large increase in sample size shows the considerable impact of
noncompliance on the ability to detect treatment differences.
 Keep noncompliance to a minimum during trials.
Justification for sample size adjustment formula for noncompliance
Expected difference between treatments: p A - p B = .
R 1 = dropout rate on treatment A.
pA  E ( p A)  pA(1  R1)  pB  R1  pA  R1( pA  pB)
100
Recall the (uncorrected) sample size formula:
2
 Z  2 pq  ZB pAqA  pBqB 

N 


A  pB )
(
p


A small change in p A or p B will have little effect on the numerator. The
denominator, however, will become:
cp  p h cp  R bp  p g p h bp  p g R bp  p g
 b
p pg
b1  R g  bp  p gb1  R g
2
2
A
B
1
A
A
B
2
B
A
2
A
B
1
B
2
1
A
A
B
2
1
B
Thus, the adjustment to N for a dropout rate of R 1 is:
1
2
1  R1
b g
Similarly, if there is also a drop-in rate of R 2 (treatment B  A):
pB  E ( p B )  pB(1  R 2)  pAR 2  pB  R 2( pA  pB )
The denominator of the sample size formula becomes:
c
b
gh c
b
( pA  pB) 2  pA  R1 pA  pB  pB  R2 pA  pB

a
fa
pA  pB 1  R1  R 2
f
gh
2
2
So the adjustment to N is:
1
1  R1  R 2
2
101
Sample Size
Variables
Calculations
for
Continuous
Response
Examples of continuous response variables:
 Blood pressure
 Time to tumor clearance
 Length of hospital stay
Assume all observations are known completely (no censoring).
Data are assumed to be approximately normally distributed.
A
transformation (e.g., log or square root) may be required to normalize skewed
data.
To test H 0 : =  A -  B = 0 vs. H a :=  A -  B  0, use the test statistic:
A  B
Z

1
1

NA NB
Using the same technique as in the dichotomous response case to derive the
sample size formula, we obtain (for given , , , and ):
2N 
b
g
4 Z / 2  Z  2
2
2
Note: This formula is based on a normal (not a t) distribution  either  is
known or N is large enough (N > 30 in both groups) to make this assumption
valid. If 2 is not known, compute N for a range of 2 to determine its effect
on sample size. If N < 30, this formula will underestimate the correct sample
size if  is not known. If variances in the two groups are not equal, base N on
the larger value.
102
Example
In a study of a new diet to reduce cholesterol, a 10 mg/dl difference would be
clinically significant. From other data,  is estimated to be 50 mg/dl. We
want a two-sided test with  = 0.05, power = 1 –  = 0.9 to detect a 10 mg/dl
difference. Z /2 = 1.96 and Z  = 1.282. So:
4 1.96  1.282   50 
2N 
 1,051
2
10 
2
2
How different would the required sample size be if  were actually 60?
4 1.96  1.282   60 
2N 
 1,513.5
2
10 
2
2
 A big difference in N considering the relatively small increase in .
 Be conservative in estimates of !!
103
Sample size based on width of confidence intervals
e
j
d  Z  2A   2B / N
Here, the relationship with power is the same as in the dichotomous
response case.
104
Sample size for change-from-baseline response variables
For example,  = final – baseline cholesterol level.
We test:
H0: A – B = 0
Ha: A – B  0
vs.
The variance of  may be much smaller than the variance of the original
values (person-to-person variability is removed).
 Smaller sample sizes result.
Example
If, in the example above, we used the change in cholesterol level, we may
have found   = 20 (compared with  = 50 above), so N is now:
b
gb g
4 196
.  1282
.
20
2N 
 170
(10) 2
2
2
(This is much smaller than 1,051!)
105
Sample Size for Time-to-failure Data (censored data case)
Generally we want to compare the survival curves s(t) from two groups,
where s(t) = P(T >t) = P(surviving beyond time t).
1
s(t)
time (t)
Generally, the log-rank or Wilcoxon (nonparametric) tests are used to test
differences between survival functions for two groups.
However, sample size calculations are often based on assuming time to
failure has an exponential distribution (parametric assumption).
s(t) = e-t, where  is the hazard rate (force of mortality):

1
mean survival time
If T is the length of the study, and   and  B are the hazard rates for patients
under treatments A and B, respectively:
2  Z  / 2  Z  2
   A      B   ,
2N 
( A   B ) 2
where:
2
ch 1  e
 
 T
This assumes all patients enter at the beginning of the study.
106
Example
We plan a 5-year study (T=5) with  A = 0.20 and  B = 0.30,  = 0.05, 1 –  =
0.90, so Z /2 = 1.96, Z  = 1.282. Assume all patients will enter at the
beginning of the 1st year. Then:
(0.2) 2
  A 
 0.0633 ,
1  e 0.2(5)
(0.3) 2
 B 
 0.1158 ,
1  e 0.3(5)
and
2 1.96  1.282 2
2N 
0.0633  0.1158  376.5
(0.2  0.3)2
For patients recruited continually during this study period, use:
3T
ch T  1  e
 
 T
For the same parameters used above, this would give:
(0.2)3 (5)
  A 
 0.1087
 (0.2)(5)
(0.2)(5) 1  e
(0.3)3 (5)
  B  
 1.867
0.3(5)
(0.3)(5) 1  e
 2N = 620.9
 Accrual throughout the period requires more patients than if all
start at the beginning of the study.
107
For the situation where accrual occurs over a fixed time period, T 0 , followed
by a fixed interval of follow-up, T, use:
   
2
1  e   T T   e  T  / T 0
0
Early accrual builds information faster, and can lead to reduced sample sizes.
See also:
Freedman LS. Tables of the number of patients required in clinical trials
using the logrank test. Stat Med 1982;1:121-129.
108
Sample Size for Testing Equivalence of Treatments
We may be testing a less expensive, less toxic, or less invasive procedure, and
want to make sure that it is “as good” as the standard treatment in terms of
efficacy.
If we do not reject H 0 :  A =  B , that does not mean that we conclude the
treatments are equivalent.
We want high power to
detect differences of clinical
importance, and low power
to detect differences that are
clinically the same.
Power curve
1
Power
=1-β
Often
this
will
mean
switching the emphasis of 
and  (e.g., using  = 0.10
and   0.90).

δ<0
δ≈0
δ>0
References:
Based on a C.I. approach:
Makuch R, Simon R. Sample size requirements for evaluating a
conservative therapy. Cancer Treat Rep 1978;62(7):1037-1040
Based on hypothesis testing, but switching H 0 and H a :
Blackwelder WC. ‘Proving the null hypothesis’ in clinical trials. Control
Clin Trials 1982;3(4):345-353.
Blackwelder WC. Sample size graphs for ‘proving the null hypothesis.’
Control Clin Trials 1984;5(2):97-105.
109
Sample size for testing equality of several normal means
(i.e., continuous response variables)
Procedure is straightforward, but requires tables.
References:
Mace AE. Sample Size Determination. Malabar, FL: Kreiger; 1974.
Neter J, Kutner MH, Wasserman W, Nachtsheim CJ. Applied Linear
Statistical Models. 4th ed. New York, NY: McGraw-Hill/Irwin; 1996.
Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed.
Philadelphia, PA: Lawrence Erlbaum; 1988.
110
Sample size for testing equality of several proportions
N based on chi-square test for homogeneity
Reference:
Lachin JM. Sample size determinants for r X c comparative trials.
Biometrics 1997;33(2):315-324.
111
Pocock’s Table 9.1
Pocock’s Table 9.1 gives values of ƒ(,) to calculate the required number of
patients for a trial.
Table 9.1 (Pocock)
(type I
error)
0.10
0.05
0.02
0.01
0.05
 (type II error)
0.1
0.2
0.5
10.8
13.0
15.8
17.8
8.60
10.5
13.0
14.9
2.7
3.8
5.4
6.6
6.20
7.90
10.0
11.7

 = the level of the 2 significance test used for detecting a treatment
difference (often set  = 0.05).
1 –  = the degree of certainty that the difference p 1 - p 2 , if present, would be
detected (often set 1 –  = 0.90).

(commonly called the type I error) is the probability of detecting a
significant difference when the treatments are really equally
effective (i.e., it represents the risk of a false-positive result).

(commonly called the type II error) is the probability of not
detecting a significant difference when there really is a difference of
magnitude p 1 - p 2 (i.e., it represents the risk of a false-negative
result).
1 - 
commonly called the power) is the probability of detecting a
difference of magnitude p 1 - p 2 .
Here, p 1 and p 2 are the hypothetical percentage successes on the two
treatments that might be achieved if each were given to a large population of
patients. They merely reflect the realistic expectations or goals that one aims
for when planning the trial and do not relate directly to the eventual results.
112
Example
In a trial of anturan, the investigators chose:

p1
p2


=
=

=
90% on placebo expected to survive one year
95%

0.1
The required number of patients on each treatment n =
p1  (100  p1)  p2  (100  p 2)
 f ( ,  ) ,
( p 2  p1) 2
where ƒ(,) is a function of  and , the values of which are given in Pocock’s
Table 9.1 (reproduced above).
In fact:
f ( ,  )   1 ( / 2)   1 (  ) ,
2
where  is the cumulative distribution function of a standardized normal
deviation. Numerical values for -1 may be obtained from statistical tables
such as Geigy (1970, p.28).
Hence, for the anturan trial:
n
90  10  95  5
a95  90f
2
 10.5  578
Thus, 578 patients are required on each treatment.
113
SAMPLE SIZE TABLES
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
POWER AND SAMPLE SIZE (PS) SOFTWARE
PS is a free resource available for download on the Department of
Biostatistics web site.
How to Download PS
1. In your favorite browser, type the following URL:
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize
2. When the following page appears, click on the “Get PS” link:
135
3. A screen similar to the following one should appear. Click OK.
4. The “save as” dialog box will appear, and you can choose the location to
save your file. You may want to save to C:\Temp, so that you can easily
remove the setup files after you have installed the software. When you
have chosen your location, click Save.
136
5. Go to your C:\Temp folder and double click on the PS icon.
6. The PS software will be automatically installed to your machine.
137
How to Use PS: Examples
138
139
140
141
APPENDIX: PAGANO TABLE A.3
I