Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hypotheses and Sample Size for Showing Equivalence Jim Ashton, SAS Institute Inc., Cary, NC hypothesis and conclude that the treatments are not equally effec- ABSTRACT tive. If you are conducting an equivalence trial, there Is an approach within the Neyman--Pearson theory of hypothesis testing that reformuates the hypotheses. with equivalence of treatments as the after· native rather than the null hypothesis. This approach swaps the roles of the type I and type II errors. It also means you can explicitly control the probability of making the more serious error of finding no difference In treatments when in fact the standard Is superior. Most clinical trials comparing two treatments are conducted to deteonlne H one treatment Is significantly different hom the other. The traditional approach to this problem tests the null hypothesis that the success rates for the two treatments are equal against a two-<lided afternative that they are not equal. However, equivalence trlals are conducted with the intent of showing that two treatments are equally effective, that is, showing that an experimental treatment Is as good as, but not necessarily better than, a standard treatment. This approach is inconsistent with the intent of an equivalence trial. In an equivalence trial. you are interested in a difference In a single direction only, that Is, when the experimental treatment happens to be inferior. A second approach that is consistent with the intent of an equivalence trial is to state the problem as a one-sided test. In this case, you can state the hypotheses as H : 1ts !51te versus A: 1ts > 1te . The usual test statistic is the normal approximation to the binomial given by Ps - Pe z-se ThIs paper investigates the properties of this approach. Sample sizes are calculated for both forms of tests. The relative efficiency Of one form to the other depends on the specific assumptions made. In the appropriate setting, the sample size requirements for the equivalence approach can be substantially smaller. where se- V IP.(l-p,)/nJ+(p.(l-p.VnO) You reject the null hypothesis of equality of treatments when the test statistic gets too large, that Is, larger than your reference value (usually z,_,,). See Blackwelder (1982), Donner (1984), and Makuch and Simon (1978). Figure 1 shows how the sample space Is divided into two parts, above and below the line ",-",. For points (""x,) above the line, the standard treatment is superior, while for points below the line, the experimental treatment Is superior. All sample size caloulations and graphics were done using SAS/IML' software. KEYWORDS 1.0 r------------------------------------~-~~~ equivalence tests. hypothesis tests, proportions, sample size, SAS/IML software, type I errOl, type /I error. ~ ~ ~ ;///// A INTRODUCTION ////" Suppose that both a standard treatment and an experimental treat· ment are available for treating a disease. A clinical trial designed to determine whether the experimental treatment is as effective as, but not necessarily better than, the standard is referred to as an equivalence trial. A typical setting for this would have the standard a severe treatment (for example, radiation treatments) and the experimental a therapy with fewer side eflects. It is hoped that the experimental Is as effective as the standard. In this setting, the standard treat· men! should be a highly effective therapy, or else there is small value In finding another treatment equally eflectlve, ////" ///"/" //~//~ / H ///// / o THE CONVENTIONAL APPROACHES Figure 1 Sample Space for Conventional Hypotheses One approach you can take Is to state the problem as a two-sided test of hypotheses, wHh the null hypothesis being that the two treat· ments are equally effective. For example, let '" and lte be the true success rates for the standard and experimental treatments, respectively, and p,and p. be the corresponding sample propor· tions. You can then state the hypotheses as H : '" - '" versus A: '" ,. '" 1.0 TIe Table 1 classifies the possible decisions and the corresponding types of errors you can make with this test. Let S represent the dn· ference between the true treatment success rates. that is, S - ",-",. You make a type I error, rejecting the null hypothesis when It Is true, n you erroneously declare the standard to be superior. You make a type II error, not rejecting the nuU hypothesis when ~ is false, if you find no difference in treatments when the standard Is in fact superior, . You collect your data and observe the sample proportions for each treatment. n the experimental treatment's success rate differs suffl· dently from the standard's in either direction, you reject the null 1387 gives you control over the error rate you want to control. The nuB Table 1 Oassification of Possible Decisions and sUemative hypotheses o 0> 0 fail to -reject H reject H correct type I error type II error correct H' : x" '" tt" +0 are versus A': x" < tt" +0 The statistic for testing this hypothesis Is -p - 0 z' = P~ e The more serious error in an equivaJence trial is a type II error, calling the treatments equivalent when the standard Is superior. FIQure 2 shows the probability of making a type II error. For a fixed level of a, the power of this test depends on the sample size n, 11:. and 1te and must be calculated for each posslbie value of o. Note that the probability of making a type II error decreases as &se increases. so For this test, you reject the null hypothesis If the test statistic gets too small, that Is, less than your reference value (usually -ZI -,J. Dunnett and Gent (1977) give sUemativa test statistics for testing H', and discuss their properties. Figure 3 shows how the sample space is divided when you use a test of a specified differenoa. The sample space Is divided along the fine x" -tt" +0. The treatments differ by an amount equal to or greater than 0 for points in H', and they diller by an amount less than a for points In A'. {3 1.0 ,..-------------""7---, Ii H' Z I-a Z I-a - 0 Figure 2 Probability of a Type II Error The conventional approach has two major problems. First, a nonsignificant test can be difficuU to interpret. For example, you may have concerns oyer sufficient sample size. An inherent problem with this approach, pointed out by Blackwelder (1982), is that H is easier to fail to find a significant difference with a small study than wHh a large one. Statisticians agree that the nuR hypothesis cannot be proved and that failure to reject the nun hypothesis cannot be interpreted as pennlssion to accept it. Again, as Blackwelder puts It, the p-value is a measure of evidence against H, not for it. Insufficient evidence to reject H does not imply sufficient evidence to accept it. Second, the more serious error in this settmg is the type II error. The conventional approach does not, per se, take the type II error rate into account. The basic problem is that the type Ii error rate must be caicuiated for each possible difference in treatments which is of interest. You cannot make a global statemant about the type II error rate as you can for the type I error rate. A' d o 1.0 Figure 3 Sample Space under Hypothesis of a Specified Difference Table 2 gives the possible decisions and the corresponding errors you can make with the test of a specified difference. Table 2 Classification of PossIble Decisions o NULL HYPOTHESIS OF A SPECIFIED DIFFERENCE 0> 0 Now consider a situation where you have a standard therapy having a success rate of 80".4 and you want to detennlne if the success rate for an expertmental therapy is within .10 of the standard. The conventional approach formulates the null hypothesis of equality of success rates, x.=1te. You determine the power of the test, 1-P. for each specific sUemative, that is, for each value of x" -x" -0, where 0 is considered to be a cfotically significant observable difference. reject H' fail to reject H' correct type /I error type I error correct Here the roles of the type I and type II errors are reversed from the previous setting. A type I error occurs when you reject H' and c0nclude that the treatments are equivalent when In fact the standard is superior. A type II error occurs when you do not reject H' and err0neously conclude that the standard is superior. As stated before, in an equivalence trial, the more serious error is claiming the treatments to be equivalent when tha experimental treatment is inferior. WHhin this setting, you can control the type I error rata expiicilty. When you select a, you know that the probability of making a type I error is less than a. The point here is that even with the conventional approach, you must sooner or later specify a value for a. In an equivalence trial, you select 6 as the minimal difference so that, if the treatments truly differed by at least 0, you would consider them to be different. The appropriate null hypothesis is that the treatments differ by at least 6. The alternative, then, is that the success rates do not differ by a; that is, the treatments are what may be called &«!uivaient. This trick of making the alternative hypothesis the one of equivalence 1388 Figures 4 through 7 graphically present these sample size calculations. As you can see, the advantage n' has over n diminishes as the success rate 01 the standard treatment decreases. When the standand treatment has a success rate of 90%. the advantage of n' is apparent. For a success rate of 60% for the standand. the lormulas give similar results. When the success rate for the standard drops to 40%. n' is less efficient. Makuch and Simon found that the sample size for testing H' is less than that for testing H when ~ Is greater than (1 +S)l2. When '" is less than (t +6V2. the sample size for testing H'1s greater than that lor testing H. You can then determine which approach is more economical for a particular situation. SAMPLE SIZES You are probably lamillar with the formula for estimating the required sample size lor the conventional hypothesis. The simplest version Is n ~ .:.(Z-,,'____d,-+_Z-,-'__.!'~)_2(,-n,.;(_1_--,lt;,,-)_+_n-,.~(1,----_n2."'-)) (Its - 1te )2 . The corresponding lormula for the test of a specified difference was presented by Makuch and Simon (1976). It differs from the conventional ronnula only by a term 6 in the denominator. 30" m 15" Although there are better fonnulas lor determining sample sizes when comparing two proportions. lor simplicity 01 calculations and ease of oornparability the normal approximations are used. See the excellent treafrnent of this subject by casagrande and Pike (1978). You usually calculate n' by setting '" ~ "e and treating 6 as the difference In treatment efficacy that you want to rule out wHh probability (I-P). as suggested by Donner (1964). Makuch and Simon recommend setting p-.l0. ,to Table 3 presents sample size calculations lor varying values of '" and "e. determined wHh a~.05 and ~~.10. The data are a subset taken from the data used In the graphs in Figures 4 through 7. The relationship between the nun and alternative hypotheses for the two tests creates a symmetry; the hypothesis of equivalence Is the null for the conventionalles1 and the ahomative for the test of a specified difference. Because 01 this symmetry. the a for one test is the P for the other and vice versa. The first three columns give the val ues lor ",. "e. and n under the conventional hypothesis. The next four columns give the values for ",. "e. 6. and n' under the hypothesis 01 a specified difference Cl. The last ooumn gives the ratio of n to n'. a measure of the relative efficiency 01 the two methods. When the ratio Is greater than unity. the equivalence approach is more efficient. When the ratio Is less than unity. the conventional approach is more efficient. Difference to Detect Figure 4 and , ,, n' ratio 1.39 0.90 0.15 69 1.54 0.90 0.20 39 1.67 0.90 0.25 25 1.76 0.80 0.80 0.10 275 L15 0.80 0.80 0.15 122 1.21 0.90 0.75 106 0.90 0.90 0.70 65 0.90 0.90 0.65 44 0.90 0.80 0.70 317 O.Ba 0.65 148 0.90 0.90 30" . H' "" '"" 15" to" 155 215 Standard Treatment of 1ts=.90 H m ~~.10 0.10 0.80 .25 350 H' H 0.90 a~.05 ,'" '"" w Table 3 Sample Sizes Calculations for 15 50 ,to .\5 Difl~rence O.Bo 0.60 86 0.80 0.80 0.20 69 1.25 0.80 0.55 56 0.80 0.80 0.25 44 1.27 0.60 0.50 420 0.60 0.60 0.10 412 1.02 0.60 0.45 186 0.60 0.60 0.15 183 1.02 0.60 0.40 103 0.60 0.60 0.20 103 1.00 0.60 0.35 65 0.60 0.60 0.25 66 0.98 0.40 0.30 386 0.40 OAO 0.10 412 0.94 0.40 0.25 163 0.40 0.40 0.15 183 0.89 0.40 0.20 86 0.40 DAD 0.20 103 0.83 0.40 0.15 51 0.40 0.40 0.25 66 0.77 Figure 5 1389 .20 to Detect Standard Treatment of 1ts =.80 ," . , ,, 450 Figure 8 shows the Increases in sample sizes needed when the assumptions are that .. + .. fO( .. ~.90 and "" ranging from .80 to .90. The cust~ method fO( calculating semple sizes Is to assume that .. ~ .. so that the fine corresponding to .. -.90 serves as the reference sample sizes. m 400 350 300 250 '00 ;, 200 700 '50 '00 '00 '00 50 400 10 .15 .20 .25 300 Dlfference to Deled 200 figure 6 Standard Treatment of .,-.60 1Te=.90 100 . 450 .10 J!' m 400 .<5 Difference to Deled H Figure 8 Sample Sizes fO( .. + ",,(a~.05, p~.10) 350 DISCUSSION 300 Testing a hypothesis of a specified difference Is not new. Hfits nicely within the Neyman-Pearson th""'Y for testing hypotheses, which should make statisticians feel comfortable with H. And there can be substantial savings In terms of sample size reqUirements oompared to the conventional approach. 250 200 '50 In gaining any advantage, you must sacrifice something. In this 100 case, what you sacrifice Is power; the test of a specified difference Is a conservative test. You give up power to reject the null hypoth&- 50 .10 .15 .20 sis when the treatments are really equivalent in order to minimize the chances of rejecting the null hypothesis when the standard Is actually superior. .25 Dilf",ence to Detect Table 5 shows what happens when .. -.9 and you set a -.05 and P~ .1 0 to be the errO( rates fO( the hypothesis of a specified difference. The sample size r8(JJined to detect a difference a-.l, from Table 3, Is n'~155. (By symmetry, this means that a~.l fO( the conventional test. The corresponding sample size for the convenlional hypothesis fO( p~.05, also taken from Table 3, Is n-215). Table 5 gives the error rates fO( each test for values of "" ranging from .9 (equivalent), .89-.81 (&-equlvalent), and .8-.75 (definitely not equivalent) calculated using a fixed sample size of n-n'- 155. Figure 7 Standard Treatment of .. ~.40 On a slightly different note, conslder a case where you know the standard treatment is marginally superior but want to rule out a difference as large as 6 with probability 1 - p. For example, suppose that .. ~ .90. You believe that ",,~.85 is a good guess for the experImental treatment, and you want to rule out a difference of .10 between .. and .. with probability .90. The requined sample size for this situation is 746. Compare this to n'=155 when you can assume the 1ta-1te- Table 5 Table 4 and Figure 8 show what happens to sample size requirements using n' when you assume lts +- 1te. The sample sizes in Table 4 are a subset of the data used to create Figure 8. Table 4 Sample Sizes when .. + "" (a ~ .05, 6 n' P- .10) '" .9 ".9 .9 .89 Example of Emor Rates of the Two Tests H A H' .10 .10 .16 .17 .39 .9 .87 .33 .9 .85 .52 .62 .83 .70 .81 .84 .92 A' equivalent '" " .85 .10 746 .9 .90 .85 .15 187 .9 .81 .90 .85 .20 83 .9 .8 .11 .05 .9 .75 .01 .002 .90 .90 .80 .15 .90 .80 .20 215 857 1390 &-equivalent not equivalent The test of a specified difference cIear1y minimizes the chances of making an error when the treatments diller by more than .1 (the definitely not equivalent area). n also perlonns well when "" is equal to or greater than X. (the equivalent area). In the (koquivalenl area, Where the standard is better by a smaD margin, Its conservative nature shows. The chance of not rejecting H' is larger than you would like to see. For instance, when ne=.85, you have a 620/0 chance of not rejecting H' and making the mistake of declaring the treatments not eqUivalent. With the conventional test, you have a 52% chance of declaring the treatments not equivalent. REFERENCES Blackw_, W.C. (1982), "Proving the Nul HypotheSIs In CYnical Trials," Controlled Clinical Toals 3, 345-353. Casagrande, J.T. and Pike, M.C. (1978), "An Improved Formula for calculating Sample Slzes for Comparing Two Binomial 0isIr1butians: Biometrics 34, 483486. Donner, A. (1984), "Approaches to Sample Siza Estimation In the Design of Clinical Toals - a Review: Statistics in Medicine, Vol In appropriate situations and under appropriate assumptions, this approach to testing for equivalence protects you against erroneously finding that the treatments are equally effective. Specifically, ~ is very efficient when 3, 199-214. Dunnett, C.w. and Gent, M. (1977), ·Signiflcance testing to estal>lish between treatments, with special reference to data In the form of 2x2 tables," Biometrics 33:593-602. • you have a standard treatment with a high success rate . Makuch, R.w. and Simon, R. (1978), "Sample Size Requirements for Evaluating a Conservative Therapy," Cancer Treat Rep 62, 1037-1040. • your Intent is to demonstrate equivalence of an experimental treatment with the standard • you have reason to believe that the success rate for the SAsnML is a registered trademar1< of SAS Institute Inc., Cary, NC, USA. experimental treatment is at least as large as the standard's • ~ is Imperative that you not declare the treatments equivalent when the standard is truly the superior treatment. 1391