Download An Introduction to Two-Stage Adaptive Designs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
An Introduction to Two-Stage
Adaptive Designs
Tatsuki Koyama, Ph.D.
Department of Biostatistics
Vanderbilt University School of Medicine
615-936-1232
[email protected]
Significance Testing and Hypothesis Testing
Significance Testing (R.A. Fisher, circa. 1920)
• Null Hypothesis
• p -value
Fisher did not give a p -value the interpretation we are familiar with, i.e.,
the probability of observing the data we have observed assuming the null
hypothesis is true.
To him, it was not a probability; it was used to reflect on the credibility of
the null hypothesis in light of the data.
The p -value was meant to be combined with other sources of information.
Hypothesis Testing (J. Neyman and E. Pearson, 1928)
• Null Hypothesis
• Alternative Hypothesis
• type I and type II errors
• critical values
There was no measure of evidence.
It was not necessary because hypothesis testing was not meant to provide
the information as to how believable each hypothesis was, but rather it was
meant to tell how to act.
Hypothesis Testing
In statistical hypothesis testing, one needs to specify
before collecting the data :
• the null and alternative hypotheses
• type I and II error rates (α and β)
• an analysis plan including
– sample size
– decision rule
(exactly how the null hypothesis will be rejected)
If these are allowed to be changed after looking at the
data, we may be able to “cheat” so that we can reject
H0 .
1 Example
Suppose that we want to evaluate the therapeutic efficacy of a new treatment regimen. Whether the treatment is success or failure is going to be recorded for each
patient. The competitor’s success rate is 0.25.
H0 : π = 0.25 and H1 : π = 0.40.
14 out of 40 patients had success. (π̂ = 14/40 = 0.35)
mmm... How can we reject H0?
Exact binomial test ... p -value= 0.1032.
Oh no...
Z test... p -value= 0.072.
“Yes! Let’s make α = 0.10.”
Let’s go back in time and suppose that we had agreed
on “Exact binomial test” and α = .10.
We barely missed.
So let’s try 5 more patients!
16 out of 45 patients had success. (π̂ = 16/45 = 0.36)
Exact binomial test ... p -value= 0.0753.
H0 is rejected.
Changing the design
When you allow the study design to be changed or updated, you have to do it very, very carefully.
Frankly, if the change is not preplanned, there’s almost
no way.
Two-stage (adaptive) designs
Examples include:
1. Two-stage group sequential design
2. Simon’s design and its variations
3. Acceptance sampling
4. Phase II/III combined “accelerated” designs
5. General two-stage designs in Phase III trials
1, 2, 3 are not truly adaptive because what you are going
to do is completely specified at the beginning.
i.e., If this happens in Stage I I’m going to do this, if
that happens in Stage I, I’m going to do that...
4 and 5 (and 2) can be quite flexible. You do not need
to specify what to do until you see the data.
Why use two-stage designs?
Example - Simon’s Design
Dichotomous outcome (usually in Phase II trial)
Suppose that we want to test
H0 : π = 0.25
H1 : π = 0.40
with α = 0.10 and β = 0.10 (power = 0.90).
A conventional single stage design:
N = 64 and reject H0 if R > 20.
Simon’s two-stage design:
In stage I, n1 = 39 is accrued.
If 9 or fewer responses are observed during stage I, then
the trial is stopped for futility.
Otherwise additional n2 = 25 is accrued.
If 20 or fewer responses are observed by the end of stage
II, then no further investigation (i.e, Phase III trial) is
warranted.
x1 ≤ 9 then stop.
Stage I
n1 = 39
x > 9 then take n = 25.
1
Stage II
n2 = 25
2
xt ≤ 20 then no further investigation.
xt > 20 then Phase III trial.
Two-Stage Designs (Simon-like)
Input :
π0 ... placebo / competitor
π1 ... where the power is computed
α and β
And what type of design ...
• Minimax : Minimize the maximum sample size
• Optimal : Minimize the expected sample size when
π0 is the truth
• Admissible : somewhere in between Minimax and
Optimal. (nice design)
• Balanced : Stage I and II sample sizes are equal.
Then one can compute the sample size and design-specific
characteristics such as α, power and expected sample
size.
Design
Single Stage
Minimax
Optimal
Admissible
Balanced
n1
64
39
29
33
34
x1
20
9
7
8
8
nt
−
64
72
68
68
xt
−
20
22
21
21
α
0.0993
0.0972
0.0977
0.0968
0.0990
power E[Nt|H0] MAX[Nt]
0.905
64.0
64
0.901
52.1
64
0.901
48.1
72
0.901
48.6
68
0.908
50.5
68
How to get sample sizes
The sample size calculation is based on a trial and error
approach.
Jung SH, et al. has a nice free software.
Fei Ye has an accessible software at
http://www.vicc.org/biostatistics/freqapp.php
Also NCSS (not free!) is capable.
Simon’s two-stage designs have obvious advantages over
the conventional single stage design.
Disadvantages?
Computing the p -value and confidence interval for π is
not simple.
More to come on the inference procedure.
Phase III two-stage designs
Mathematically similar to Phase II two-stage designs,
but the research field is relatively new.
Phase III placebo controlled two arm studies.
The outcome variable of interest is often continuous variable.
H 0 : µt = µc
H1 : µt > µc
At the end of Stage I, we compute z-score or t-score.
Variation 1: What to do at the end of Stage I is completely determined beforehand and clearly stated in the
study protocol.
e.g.,
if z1 < 0 then we stop the trial for futility,
if z1 > 2.8 then we stop the trial with overwhelming
evidence in favor of H1,
if 0 < z1 < 2.8 then we continue to stage II with the
following sample size scheme.
400
Total Sample Size
300
200
100
100
100
0
0.0
0.5
1.0
1.5
Z1
2.0
2.5
3.0
Variation 2: What to do at the end of Stage I is unspecified.
if z1 < 0 then we stop the trial for futility,
if z1 > 2.8 then we stop the trial with overwhelming
evidence in favor of H1,
if 0 < z1 < 2.8 then ..., well we will think about it and
come up with a reasonable sample size.
It is possible to control type I error rate (α) using either variation 1 or 2.
Advantage of Variation 1:
It allows specification of the power.
p -value is controversial, but it may be computed.
Confidence interval is controversial, but it may be computed.
Advantage of Variation 2:
It is truly adaptive; e.g., the study design can be adjusted according to the observed variance.
Disadvantage of Variation 2:
p -value is more controversial.
Confidence interval is more controversial.
Why is computing p -value difficult in a two-stage design?
A p -value is the probability of observing what is observed or something more extreme assuming H0 is true.
0.1
0.2
0.3
0.4
Thus, to compute a p -value, we need to be able to order
all the possible outcomes. i.e., z = 2.5 is more extreme
than z = 2.0 under H0.
Observed Z = 2
Pvalue
0.0
H0
−3
−2
−1
0
Z value
1
2
3
In a two-stage design, it is not simple to order all the
possible outcomes.
Which of the following gives more evidence against H0 :
π = 0.25?
Recall n1 = 39, x1 = 9, nt = 64 (n2 = 25) and xt = 20.
1. In stage I, observed x1 = 9 and stop for futility.
2. In stage I, observed x1 = 10 and continue to stage
II. In stage II, observed x2 = 0 out of n2 = 25.
Which of the following gives more evidence against H0 :
π = 0.25?
1. In stage I, observed x1 = 15 and
in stage II, observed x2 = 7 out of n2 = 25.
2. In stage I, observed x1 = 10 and
in stage II, observed x2 = 12 out of n2 = 25.
If you allow “stop in Stage I to conclude efficacy,” the
situation is more complicated.
If you allow the Stage II sample size to be different based
on Stage I observations, the situation is more complicated.
If you allow the Stage II sample size to be determined
after Stage I, the situation is more complicated.
For a Phase II Simon-like design, the most popular ordering is “Stage-wise Ordering”.
1. stop in stage I for futility
2. continue to stage II
3. stop in stage I for efficacy
With this ordering specified, a confidence interval and p
-value can be computed.
p values and hypothesis tests
Goodman SN, p values, hypothesis tests, and likelihood: implications for
epidemiology of a neglected historical debate, American Journal of
Epidemiology, 1993, 137, 485 - 496.
Blume J, Peipert JF, What your statistician never told you about p-values,
the Journal of the American Association of Gynecologic Laparoscopists,
2003, 10, 439 - 445.
Simon’s Design and Extensions
Simon R, Optimal two-stage designs for phase II clinical trials, Controlled
Clinical Trials, 1989, 10, 1 - 10.
Jung SH, Lee T, Kim KM, George SL, Admissible two-stage designs for
phase II cancer clinical trials, Statistics in Medicine, 2004, 23, 561 569.
Phase III two-stage adaptive designs
Proschan MA, Hunsberger SA, Designed extension of studies based on
conditional power, Biometrics, 1995, 51, 1315 - 1324.
Posch M, Bauer P, Adaptive two stage designs and the conditional error
function, Biometrical Journal, 1999, 6, 689 - 696.
Liu Q, Chi GYH, On sample size and inference for two-stage adaptive
designs, Biometrics, 2001, 57, 172 - 177.
Koyama T, Sampson AR, Gleser LJ, A calculus of two-stage adaptive procedures, the Journal of the American Statistical Association, 2005, 100.
Tatsuki Koyama
[email protected]