Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Two-Stage Adaptive Designs Tatsuki Koyama, Ph.D. Department of Biostatistics Vanderbilt University School of Medicine 615-936-1232 [email protected] Significance Testing and Hypothesis Testing Significance Testing (R.A. Fisher, circa. 1920) • Null Hypothesis • p -value Fisher did not give a p -value the interpretation we are familiar with, i.e., the probability of observing the data we have observed assuming the null hypothesis is true. To him, it was not a probability; it was used to reflect on the credibility of the null hypothesis in light of the data. The p -value was meant to be combined with other sources of information. Hypothesis Testing (J. Neyman and E. Pearson, 1928) • Null Hypothesis • Alternative Hypothesis • type I and type II errors • critical values There was no measure of evidence. It was not necessary because hypothesis testing was not meant to provide the information as to how believable each hypothesis was, but rather it was meant to tell how to act. Hypothesis Testing In statistical hypothesis testing, one needs to specify before collecting the data : • the null and alternative hypotheses • type I and II error rates (α and β) • an analysis plan including – sample size – decision rule (exactly how the null hypothesis will be rejected) If these are allowed to be changed after looking at the data, we may be able to “cheat” so that we can reject H0 . 1 Example Suppose that we want to evaluate the therapeutic efficacy of a new treatment regimen. Whether the treatment is success or failure is going to be recorded for each patient. The competitor’s success rate is 0.25. H0 : π = 0.25 and H1 : π = 0.40. 14 out of 40 patients had success. (π̂ = 14/40 = 0.35) mmm... How can we reject H0? Exact binomial test ... p -value= 0.1032. Oh no... Z test... p -value= 0.072. “Yes! Let’s make α = 0.10.” Let’s go back in time and suppose that we had agreed on “Exact binomial test” and α = .10. We barely missed. So let’s try 5 more patients! 16 out of 45 patients had success. (π̂ = 16/45 = 0.36) Exact binomial test ... p -value= 0.0753. H0 is rejected. Changing the design When you allow the study design to be changed or updated, you have to do it very, very carefully. Frankly, if the change is not preplanned, there’s almost no way. Two-stage (adaptive) designs Examples include: 1. Two-stage group sequential design 2. Simon’s design and its variations 3. Acceptance sampling 4. Phase II/III combined “accelerated” designs 5. General two-stage designs in Phase III trials 1, 2, 3 are not truly adaptive because what you are going to do is completely specified at the beginning. i.e., If this happens in Stage I I’m going to do this, if that happens in Stage I, I’m going to do that... 4 and 5 (and 2) can be quite flexible. You do not need to specify what to do until you see the data. Why use two-stage designs? Example - Simon’s Design Dichotomous outcome (usually in Phase II trial) Suppose that we want to test H0 : π = 0.25 H1 : π = 0.40 with α = 0.10 and β = 0.10 (power = 0.90). A conventional single stage design: N = 64 and reject H0 if R > 20. Simon’s two-stage design: In stage I, n1 = 39 is accrued. If 9 or fewer responses are observed during stage I, then the trial is stopped for futility. Otherwise additional n2 = 25 is accrued. If 20 or fewer responses are observed by the end of stage II, then no further investigation (i.e, Phase III trial) is warranted. x1 ≤ 9 then stop. Stage I n1 = 39 x > 9 then take n = 25. 1 Stage II n2 = 25 2 xt ≤ 20 then no further investigation. xt > 20 then Phase III trial. Two-Stage Designs (Simon-like) Input : π0 ... placebo / competitor π1 ... where the power is computed α and β And what type of design ... • Minimax : Minimize the maximum sample size • Optimal : Minimize the expected sample size when π0 is the truth • Admissible : somewhere in between Minimax and Optimal. (nice design) • Balanced : Stage I and II sample sizes are equal. Then one can compute the sample size and design-specific characteristics such as α, power and expected sample size. Design Single Stage Minimax Optimal Admissible Balanced n1 64 39 29 33 34 x1 20 9 7 8 8 nt − 64 72 68 68 xt − 20 22 21 21 α 0.0993 0.0972 0.0977 0.0968 0.0990 power E[Nt|H0] MAX[Nt] 0.905 64.0 64 0.901 52.1 64 0.901 48.1 72 0.901 48.6 68 0.908 50.5 68 How to get sample sizes The sample size calculation is based on a trial and error approach. Jung SH, et al. has a nice free software. Fei Ye has an accessible software at http://www.vicc.org/biostatistics/freqapp.php Also NCSS (not free!) is capable. Simon’s two-stage designs have obvious advantages over the conventional single stage design. Disadvantages? Computing the p -value and confidence interval for π is not simple. More to come on the inference procedure. Phase III two-stage designs Mathematically similar to Phase II two-stage designs, but the research field is relatively new. Phase III placebo controlled two arm studies. The outcome variable of interest is often continuous variable. H 0 : µt = µc H1 : µt > µc At the end of Stage I, we compute z-score or t-score. Variation 1: What to do at the end of Stage I is completely determined beforehand and clearly stated in the study protocol. e.g., if z1 < 0 then we stop the trial for futility, if z1 > 2.8 then we stop the trial with overwhelming evidence in favor of H1, if 0 < z1 < 2.8 then we continue to stage II with the following sample size scheme. 400 Total Sample Size 300 200 100 100 100 0 0.0 0.5 1.0 1.5 Z1 2.0 2.5 3.0 Variation 2: What to do at the end of Stage I is unspecified. if z1 < 0 then we stop the trial for futility, if z1 > 2.8 then we stop the trial with overwhelming evidence in favor of H1, if 0 < z1 < 2.8 then ..., well we will think about it and come up with a reasonable sample size. It is possible to control type I error rate (α) using either variation 1 or 2. Advantage of Variation 1: It allows specification of the power. p -value is controversial, but it may be computed. Confidence interval is controversial, but it may be computed. Advantage of Variation 2: It is truly adaptive; e.g., the study design can be adjusted according to the observed variance. Disadvantage of Variation 2: p -value is more controversial. Confidence interval is more controversial. Why is computing p -value difficult in a two-stage design? A p -value is the probability of observing what is observed or something more extreme assuming H0 is true. 0.1 0.2 0.3 0.4 Thus, to compute a p -value, we need to be able to order all the possible outcomes. i.e., z = 2.5 is more extreme than z = 2.0 under H0. Observed Z = 2 Pvalue 0.0 H0 −3 −2 −1 0 Z value 1 2 3 In a two-stage design, it is not simple to order all the possible outcomes. Which of the following gives more evidence against H0 : π = 0.25? Recall n1 = 39, x1 = 9, nt = 64 (n2 = 25) and xt = 20. 1. In stage I, observed x1 = 9 and stop for futility. 2. In stage I, observed x1 = 10 and continue to stage II. In stage II, observed x2 = 0 out of n2 = 25. Which of the following gives more evidence against H0 : π = 0.25? 1. In stage I, observed x1 = 15 and in stage II, observed x2 = 7 out of n2 = 25. 2. In stage I, observed x1 = 10 and in stage II, observed x2 = 12 out of n2 = 25. If you allow “stop in Stage I to conclude efficacy,” the situation is more complicated. If you allow the Stage II sample size to be different based on Stage I observations, the situation is more complicated. If you allow the Stage II sample size to be determined after Stage I, the situation is more complicated. For a Phase II Simon-like design, the most popular ordering is “Stage-wise Ordering”. 1. stop in stage I for futility 2. continue to stage II 3. stop in stage I for efficacy With this ordering specified, a confidence interval and p -value can be computed. p values and hypothesis tests Goodman SN, p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate, American Journal of Epidemiology, 1993, 137, 485 - 496. Blume J, Peipert JF, What your statistician never told you about p-values, the Journal of the American Association of Gynecologic Laparoscopists, 2003, 10, 439 - 445. Simon’s Design and Extensions Simon R, Optimal two-stage designs for phase II clinical trials, Controlled Clinical Trials, 1989, 10, 1 - 10. Jung SH, Lee T, Kim KM, George SL, Admissible two-stage designs for phase II cancer clinical trials, Statistics in Medicine, 2004, 23, 561 569. Phase III two-stage adaptive designs Proschan MA, Hunsberger SA, Designed extension of studies based on conditional power, Biometrics, 1995, 51, 1315 - 1324. Posch M, Bauer P, Adaptive two stage designs and the conditional error function, Biometrical Journal, 1999, 6, 689 - 696. Liu Q, Chi GYH, On sample size and inference for two-stage adaptive designs, Biometrics, 2001, 57, 172 - 177. Koyama T, Sampson AR, Gleser LJ, A calculus of two-stage adaptive procedures, the Journal of the American Statistical Association, 2005, 100. Tatsuki Koyama [email protected]