Download Chapter 2-5. Basics of Power Analysis

Chapter 2-5. Basics of Power Analysis Statistical Regularity and Signal-to-Noise Ratio In Chapter 2-3, we saw how the significance test is designed to be consistent with statistical regularity. Using Stata to perform a statistical regularity simulation for two groups, we take random samples of increasing size (ranging from 1 to 200) from two normal populations (1st, mean = 5, standard deviation = 2.5) (2nd, mean = 4, std. dev. = 2.5), and then plot the sample mean. The plot of the sample means is: *--------------------------------------------------------------* Demonstrate statistical regularity by plotting the mean from * two normal distributions (1st: mean=5, std.dev=2.5; * 2nd: mean=4, std.dev=2.5) for increasingly larger sample sizes *--------------------------------------------------------------- 6 2 4 Mean 8 10 Statistical Regularity for N(5,2.5) and N(4,2.5) Variables 1 10 15 30 Sample Size (log scale) 50 100 200 Here we see the signal/noise ratio becoming more distinct, so we can more confidently claim a difference between the two groups, as the sample size increases. All test statistics are constructed as a ratio of the effect to the variability of the effect, or signalto-noise ratio (Stoddard and Ring, 1993; Borenstein M, 1997), in perfect consistency with statistical regularity and displayed in this graph. _______________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 2-5 (revision 16 May 2010) p. 1 For example, the formula for a two-sample t test, for comparing two means, is t X1  X 2 s1 s  2 n1 n2 where the numerator is the effect, the mean difference, and the denominator is the variability of the mean difference, which is called the standard error of the mean difference. The standard error is simply an estimate of the standard deviation of the mean difference that we could get had we taken a large number of samples, by repeating the clinical trial, and so had a bunch of mean differences to compute the standard deviation for. Sample size determination is performed to insure that we take a large enough sample to obtain enough data to detect a difference if it exists. We can detect a difference in means provided we are far enough to the right on the statistical regularity graph. By far enough to the right, we mean a sufficiently large sample, since sample size is the x-axis on this graph. Power analysis is performed to discover the probability of detecting a difference, if it exists, for the sample size we have planned. Definition of Power Power = Prob(our sample will achieve statistical significance | H A : 1  2 ) for a given sample size and a given difference in means Power Increases as Sample Size Increases The power of a statistic increases monotonically (continues to go up) as sample size increases. In fact, if you make your sample size large enough, you will eventually get a statistically significant p value every time, regardless of how small the population difference is (as long as the difference in means or proportions is not zero). Let’s look again at the statistical regularity graph shown above. Chapter 2-5 (revision 16 May 2010) p. 2 6 2 4 Mean 8 10 Statistical Regularity for N(5,2.5) and N(4,2.5) Variables 1 10 15 30 Sample Size (log scale) 50 100 200 We know that this simulation is sampling from two distributions where the means are different, 5 and 4, so the null hypothesis is actually false. Looking at the graph, we observe that with sample sizes < 50 or so, given the standard deviations of 2.5, we do not seem to get a clear signal-tonoise ratio in this graph. That is, sufficient statistical regularly has not been achieved. Regardless of the value of power, for this example, the t test does not achieve statistical significance until n=50 in each group, as seen by the following Stata output. Chapter 2-5 (revision 16 May 2010) p. 3 Using the immediate form of the ttest command Immediate form of two-sample mean comparison test: ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 [, options] ttesti 49 5 2.5 49 4 2.5 ttesti 50 5 2.5 50 4 2.5 Two-sample t test with equal variances -----------------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------x | 49 5 .3571429 2.5 4.281916 5.718084 y | 49 4 .3571429 2.5 3.281916 4.718084 ---------+-------------------------------------------------------------------combined | 98 4.5 .256311 2.53735 3.991294 5.008706 ---------+-------------------------------------------------------------------diff | 1 .5050763 -.0025685 2.002568 -----------------------------------------------------------------------------diff = mean(x) - mean(y) t = 1.9799 Ho: diff = 0 degrees of freedom = 96 Ha: diff < 0 Pr(T < t) = 0.9747 Ha: diff != 0 Pr(|T| > |t|) = 0.0506 Ha: diff > 0 Pr(T > t) = 0.0253 Two-sample t test with equal variances -----------------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------x | 50 5 .3535534 2.5 4.289508 5.710492 y | 50 4 .3535534 2.5 3.289508 4.710492 ---------+-------------------------------------------------------------------combined | 100 4.5 .2537596 2.537596 3.996486 5.003514 ---------+-------------------------------------------------------------------diff | 1 .5 .0077663 1.992234 -----------------------------------------------------------------------------diff = mean(x) - mean(y) t = 2.0000 Ho: diff = 0 degrees of freedom = 98 Ha: diff < 0 Pr(T < t) = 0.9759 Ha: diff != 0 Pr(|T| > |t|) = 0.0483 Ha: diff > 0 Pr(T > t) = 0.0241 This property of a test statistic not being significant until a sufficient sample size is achieved, for a given mean difference and standard deviations of groups, is “designed into” the p value compution by the developers of the test statistic to be consistent with the statistical regularity property. Actually, it just works out that way if the development of the test statistic is logically consistent with statistical theory. In this example, for n=49, insufficient statistical regularity is achieved. For n=49, the value of the signal-to-noise ratio, the t test statistic, is still inside the inner 95% of values of the t test statistic that can occur when the null hypothesis, H 0 : 1  2 , is true (p = 0.0506 > 0.05) We can immediately see how sample size is designed into the size of the test statistic by looking at the t test statistic formula. Chapter 2-5 (revision 16 May 2010) p. 4 To simplify the illustration, we will assume the two groups have the same sample size. Then, the t test statistic (the unequal variances version) simplifies to: t X1  X 2 X  X 2 X1  X 2  1   n s1 s2 s1  s2 s  s 1 2  n1 n2 n Thus t always increases as n increases. Examining the formula, you can see that the same relationship holds even for unequal n’s, since you are dividing by a smaller denominator as either or both of the two n’s increase. Similarly, sample size directly affects the p value, since a larger test statistic always produces a smaller p value (you go further into the tail of the distribution of t). Using SamplePower® 2.0, we can quickly generate the power curve for our example population parameters of 1  5, 2  4; 1   2  2.5 . With a little bit of programming, you can get such a curve in Stata, as well. We see that power goes up with increasing sample size, which is always the case. A different curve could be drawn for any different combination of population parameters (the means and standard deviations), but it will always have this monotonically increasing shape. Chapter 2-5 (revision 16 May 2010) p. 5 Decision Errors of Significance Tests When we use the p value to make a decision about the null hypothesis of a test statistic, we conceive of probabilities of making errors using a layout analogous to diagnostic test (see box). Sensitivity/Specificity Layout For a diagnostic test, we can construct the following table of agreements/disagreements between the diagnostic test decision and the true state, or gold standard. Test “probable value” Gold Standard “true value” disease present ( + ) disease absent ( - ) disease present ( + ) a (true positives) b (false negatives) a+b disease absent ( - ) c (false positives) d (true negatives) c+d a+c b+d For the non-shaded cells, we recognize the familiar terminology (Lilienfeld, 1994, p. 118-124) for agreements, expressed as percents: sensitivity = (true positives)/(all those with the disease) = a / (a + b) 100 specificity = (true negatives)/(all those without the disease) = d / (c + d)  100 For a significance test conclusion, we have the same layout as shown in the sensitivity/specificity box. Here, H0 denotes the hypothesis of no effect. Conclusion From Test Reality population difference present (+) H0 is false population difference absent ( -) H0 is true H0 is false (+) H0 is true ( -) no error 1-beta beta (type II error) (false negatives) no error 1-alpha alpha (type I error) (false positives) We describe these probabilities of a decision error as: alpha = Prob(reject H0 | H0 is true) = Prob(Type I error) = Prob(false positive conclusion) and beta = 1 – Power = Prob(accept H0 | H0 is false) = Prob(Type II error) = Prob(false negative conclusion) Chapter 2-5 (revision 16 May 2010) p. 6 The quantity, 1-alpha, is referred to as the confidence level. Chow et al. (1980, p.15) describe it, “In practice, the maximum probability of committing a type I error that one can tolerate is usually considered as the level of significance. The confidence level, 1 – α , then reflects the probability or confidence of not rejecting the true null hypothesis.” Type II Error and Sample Size Paragraph in Journal Article There are two reasons to report a sample size paragraph in your research article. First, it adds credibility to your paper. Since a sample size determination is a necessary part of good study design, reporting one suggests that you understand study design principles. Second, it helps the reader interpret nonsignificant p values. When the reader sees a nonsignificant p value, the question comes up of whether this is due to an effect not being present in the sampled population, or whether it was the result of an inadequately powered study. In orthopaedics, research is frequently done on limited sample sizes, such as explanted artificial joints. Probably largely motivated by the necessity of small sample sizes, although it is appropriate for any sample size, the 2009 Instructions to Authors for The Journal of Bone and Joint Surgery state the following, “For hypothesis testing scenarios the statement ‘no significant difference was found between two groups’ must be accompanied by a value describing the power of the study to detect a Type II error (Designing Clinical Research, eds. Stephen Hulley, Steven Cummings, 1988, Williams and Wilkins, Baltimore pp 128-49).” That statement does not make sense, but they are trying to say that reporting the power helps the reader to determine if the lack of significance is explainable simply by an insufficient sample size. ( see next page for correct “conclusions of no difference” ) Chapter 2-5 (revision 16 May 2010) p. 7 Conclusions of Equivalence Given the definition, beta = 1 – Power = Prob(accept H0 | H0 is false) = Prob(Type II error) = Prob(false negative conclusion) It would seem that we could make a probability argument in support of the null hypothesis of “no difference” or “no effect” using beta. That is, suppose we have 95% power to detect a minimal biological significance difference, say of 1 to be consistent with our example, because our sample size is 165 in each group. Then, we take our sample and get a nonsignificant p value. Shouldn’t we be able to say that we demonstrated there is no effect at the beta=1-0.95 = 0.05 level of significance? This seems reasonable enough, and it is advocated by Jacob Cohen in his classical textbook, Statistical Power Analysis for the Behavioral Sciences. He first proposed it in a chapter of another textbook published in 1965 (Cohen, 1965), so the idea has been around for awhile. Until 2001, no one, including Cohen, had ever provided a proof that this approach is logically consistent. Cohen’s proposition was later shown to be fallacious, however, by a logic proof presented by Hoenig and Heisey (2001). Hoenig and Heisey’s proof is described in Chapter 2-15 of the course manual. To demonstrate equivalence, or no effect, requires a statistical approach called equivalence testing. Similarly, to demonstrate “no worse than” requires an analogous approach called noninferiority testing. Otherwise, when you get a nonsignificant result, all you can conclude is that there is not sufficient evidence in your study to demonstrate a difference. Chapter 2-5 (revision 16 May 2010) p. 8 Power of a Significance Test We saw above that a simplified definition of power for our example is: power = Prob(draw a sample that achieves statistical significance | H1 : 1  5  2  4 ) Power is different for every specified effect ( H1 : 1  5  2  4 , for example) Power actually depends, or is conditional upon, five things: 1) minimum size of effect one wishes to detect 2) whether you are using a one- or two-sided comparison 3) the choice of alpha 4) the sample size 5) the test statistic you are using We will now look at each of these five things. Chapter 2-5 (revision 16 May 2010) p. 9 Effect of One- or Two-Sided Comparison on Power For a two-sided comparison, p < 0.05 when the test statistic falls in the shaded rejection region in either tail of the sampling distribution for the test statistic. For a one-sided comparison, p < 0.05 when the test statistic falls in the shaded rejection region of one tail of sampling distribution for the test statistic. two-tail test critical values 2.5% 2.5% -1.96 SE 0 +1.96 SE one-tail test critical values 5% 0 +1.645 SE Since the test statistic does not have to be as large to fall in the 5% rejection region, a one-sided comparison is more powerful. That is, the one-sided comparison test statistic can exceed the critical value using a smaller sample size than a two-sided comparison test for any fixed power. Chapter 2-5 (revision 16 May 2010) p. 10 The tail-area graphs just shown were created using the following Stata commands: * -- tail area graphs -* normal tail area graph -- p. 187 of Stata version 9 graphics manual * #delimit ; graph twoway (function y=normden(x), range(-4 -1.96) bcolor(red) recast(area) plotregion(style(none))) (function y=normden(x), range(1.96 4) bcolor(red) recast(area) plotregion(style(none))) (function y=normden(x), range(-4 4) clstyle(foreground) plotregion(style(none))) ,yscale(off) legend(off) xlabel( -1.96 "-1.96 SE" 0 "0" 1.96 "+1.96 SE") xtitle("") title(two-tail test critical values) text(.06 -2.5 "2.5%") text(.06 2.5 "2.5%") saving(twotail, replace) ; #delimit cr * #delimit ; graph twoway (function y=normden(x), range(1.645 4) bcolor(red) recast(area) plotregion(style(none))) (function y=normden(x), range(-4 4) clstyle(foreground) plotregion(style(none))) ,yscale(off) legend(off) xlabel( 0 "0" 1.645 "+1.645 SE") xtitle("") title(one-tail test critical values) text(.06 2.5 "5%") saving(onetail, replace) ; #delimit cr graph combine twotail.gph onetail.gph , col(1) saving(bothtail.gph) Chapter 2-5 (revision 16 May 2010) p. 11 For manuscripts that will be published in the medical literature, you should always use a twotailed (two-sided) comparison, because that is what readers expect to see in publications. Fleiss (1973) gives a good explanation why: “…a one-tailed test is called for only when the investigator is not interested in a difference in the reverse direction from that hypothesized. For example, if he hypothesizes that P2 > P1, [Fleiss’ example is comparing proportions.] then it will make no difference to him if either P2 = P1 or P2 < P1. Such an instance is assuredly rare. One example where a one-tailed test is called for is when an investigator is comparing the response rate for a new treatment (p2) with the response rate for a standard treatment (p1), and when he will substitute the new treatment for the standard in his own practice only if p2 is significantly greater than p1. It will make no difference to him if the two treatments are equally effective or if the new treatment is actually worse than the standard; in either case, he will stick with the standard. If however, the investigator intends to report his results to his professional colleagues, he is ethically bound to perform a two-tailed test. For if his results indicate that the new treatment is actually worse than the standard--an inference possible only with a two-tailed test--he is obligated to report this as a warning to others who might plan to study the new treatment. In the vast majority of research undertakings, two-tailed tests are called for. Even if a theory or a large accumulation of published data suggests that the difference being studied should be in one direction and not the other, the investigator should nevertheless guard against the unexpected by performing a two-tailed test. Especially in such cases, the scientific importance of a difference in the unexpected direction may be greater than yet another confirmation of the difference being in the expected direction.” Petrie (2006) offers similar advice for researchers submitting manuscripts to J Bone Joint Surg, “One-tailed alternatives are rarely used because we have to be absolutely certain for biological or clinical reasons, in advance of collecting the data, that if H0 is not true that direction of the difference is known (e.g., that the mean of treatment A is greater than that of B), and tis is rarely possible.” Although very rarely done, researchers occasionally advance a justification for using a one-tailed test. Miller et al (2001, p.844, 3rd paragraph) stated the following in their paper published in Neurology: “We used a one-tailed t-test (significance = p < 0.05) to examine the primary outcome measure because the phase II study documented a decline in arm strength, but a twotailed test was used with all secondary measures.” This is a weak justification, however, even though it is taught in introductory statistics textbooks, because they miss the point of why two-tailed tests are used in the medical literature. If the outcome would have come out in the unexpected direction, where arm strength increased, the reader would be interested in knowing this. It would be hard to imagine that a result in the opposite direction in this study would be an equivalent clinical effect as “no difference”, which is what a one-tailed test assumes. Chapter 2-5 (revision 16 May 2010) p. 12 Noninferiority Studies The one exception to the use of a two-tailed test for manuscripts published in the medical literature is the noninferiority study. For that study, a one-tail test is appropriate, although many still advocate a two-tailed test. This is discussed in Chapter 2-15. Effect of Choice of Alpha on Power If we let alpha = .01, instead of .05, we have to go further out in the tails which requires a larger t. If we let alpha 0.10, we do not have to go as far into the tails. Therefore, the choice of alpha affects our power and how large of a sample size we need to achieve power. Effect of Choice of Minimum Detectable Effect Size on Power All test statistics are a sample size weighted signal-to-noise ratio. Examining the formula for a t test, assuming equal sized groups, t X1  X 2 X  X 2 X1  X 2  1   n s1 s s1  s2 s1  s2  2 n1 n2 n we can see that the larger the effect, the difference, in the numerator, the larger the value of the t test statistic. The larger the value of the test statistic, the further it lies in the tails of the sampling distribution we use to computer the p value. Thus, the larger the effect we are willing to limit significance to, the greater the power of the test. Effect of Standard Deviation Assumption on Power Again, recognizing that all test statistics are a sample size weighted signal-to-noise ratio. Examining the formula for a t test, assuming equal sized groups, t X1  X 2 X  X 2 X1  X 2  1   n s1 s2 s1  s2 s  s 1 2  n1 n2 n we can see that the smaller the standard deviations (SD, or s) in the denominator, the larger the value of the t test statistic. The larger the value of the test statistic, the further it lies in the tails of the sampling distribution we use to computer the p value. Thus, the smaller the SD we can assume, the greater the power of the test. Chapter 2-5 (revision 16 May 2010) p. 13 Effect of Sample Size on Power Given that the other four things that affect power are in the equation, sample size is uniquely determined by power, and power is uniquely determined by sample size (N). power = function of (other 4 things) and N N = function of (other 4 things) and power For example, 1) provide sample size and get power Statistics Power and sample size Tests of means and proportions Main tab: Two-sample comparison of means Mean one: 5 Mean two: 4 Std. deviation one: 2.5 Std. deviation two: 2.5 Options tab: Output: Compute power Sample based calculations: Sample size one: 50 Sample size two: 50 OK sampsi 5 4, sd1(2.5) sd2(2.5) n1(50) n2(50) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = = = = = = = = 0.0500 5 4 2.5 2.5 50 50 1.00 (two-sided) <- defaults to alpha = .05 and a two-sided comparison Estimated power: power = 0.5160 Chapter 2-5 (revision 16 May 2010) p. 14 For example, 2) provide power and get sample size Statistics Power and sample size Tests of means and proportions Main tab: Two-sample comparison of means Mean one: 5 Mean two: 4 Std. deviation one: 2.5 Std. deviation two: 2.5 Options tab: Output: Compute sample size Power of the test: 0.80 OK sampsi 5 4, sd1(2.5) sd2(2.5) power(.80) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 5 4 2.5 2.5 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = Chapter 2-5 (revision 16 May 2010) 99 99 p. 15 Sample Size and Power Calculations for an Interval Scaled Outcome Variable As discussed above, for a given test statistic, power and required sample size are determined by the following five things: Five items required to compute power or required sample size Power Required Sample Size 1. effect size in the population effect size in the population 2. standard deviation in the population standard deviation in the population 3. choice of alpha choice of alpha 4. choice of one- or two-sided comparison choice of one- or two-sided comparison 5. sample size used choice of power Effect Size. For an independent samples t test, which is a comparison of two means, the effect size is a choice of the two means (or similarly, the difference of the two means). We choose this to be the minimal clinically relevant difference that we want sufficient power to be able to detect. Although, it is also a good idea to use a value that is congruent with the true effect in the population, or it is unlikely we will be able to achieve it--previous published results are useful to determine this. Standard Deviation. This is the difficult part. We must have an estimate of the SD for each of the two groups. With luck, we can get this from already available pilot data or from previously published research. If we are unsure, we could estimate this on the high end (the largest reasonable SD)--this will require us to use a larger sample size to insure our power, but we will have adequate power for all values of SD that are smaller in the population. We refer to this a conservative estimate of SD. Alternatively, we might consider what is the biological meaningful min and max, and then divide this range by 6 (since we know that 99.7% of all values fall with 3 SDs, a total of 6 SDs, for a normal distribution). We might also consider the reference range and then divide by 4 (since we know that 1.96 SDs, or a total of about 4 SDs, form this distribution). Both of these strategies require an approximate symmetrical distribution to apply this normal distribution interpretations of the SD. If you use either the range/6 or the range/4, you can cite Browne (2001). If you know the interquartile range (25th , 75th percentiles), you can make use of the following. The interquartile range bounds the inner 50% of the area under the curve of any distribution. For a standard normal distribution, which has mean=0 and standard deviation (SD) = 1, the inner 50% of the area is bounded by [-0.6745 , +0.6745]. The interval [0 , 1] between the mean and 1 SD has length 1. The interval of IQR = [-0.6745 , +0.6745] has length 2(.6745) = 1.3490. Thus, the relationship, SD IQR IQR   SD  1 1.3490 1.3490 which provides a good estimate for SD if you can assume the variable has a normal distribution. Chapter 2-5 (revision 16 May 2010) p. 16 Example: Suppose the IQR = [3, 7] for a normally distributed variable. Then SD = (7 – 3)/1.34590 = 2.97. Example: Suppose the IQR = [3, 7], with median = 4.5 for a skewed distribution. If we plan to take a log transformation of this variable to achieve normality, we would get ln(3) = 1.10 ln(4.5) = 1.50 ln(7) = 1.95 using disp ln(3) disp ln(4.5) disp ln(7) For a normal distribution, the median is right in the center of the distribution, with a symetrical shape. Since the log transformation perserves order, we would expect the median to be near the center of the distribution after the transformation. We notice that (1.50 – 1.10 ) = 0.40 on the left of the median, which is approximately the same as (1.95 – 1.50) = 0.45 on the right of the median, so the distribution is nearly symmetrical. Now, applying the formula, SD  IQR 1.3490 disp (1.95 – 1.10)/1.3490 we get 0.63 as the estimate of the SD for a log-transformed variable. Choice of Alpha. We choose the alpha that is appropriate for our purposes, which is the probability of false positive results, or type I error, that we are willing to accept. Usually we just use the conventional =0.05 if the results are to be published in the medical literature. Choice of 1- or 2-Sided Comparison. We can achieve our desired power with a smaller sample size if we use a one-sided comparison (one-tailed test). However, as presented in the above Fleiss (1973) quote, we ethically, and conventionally, choose a two-sided comparison if we are going to publish in the medical literature. Chapter 2-5 (revision 16 May 2010) p. 17 Choice of Power. It is generally accepted that a power of 0.80, or 80%, is the smallest power a study should have. If it is feasible to conduct the study with a larger sample size, a power of 90% should be used, or perhaps even 95%. Since this is our probability of obtaining a statistically significant result with our study (conditional upon the effect size actually existing in the population), it makes since to use a larger power if the cost of the study relative to the benefit of the significance demonstration is justified. Illustration For illustration, let’s assume we are planning a clinical trial of two therapies, Therapy A and Therapy B. We have: Effect Size: mean A = 5, mean B = 4 SD: SDA = 2.5, SDB = 2.0 Alpha: 0.05 Comparison: 2-sided Desired Power: 0.90 What sample size is required for this study? Sample Size Calculation Using Stata, sampsi 5 4 , sd1(2.5) sd2(2) power(.90) we get n=108 required for each group. Suppose we know we can only collect 75 in each group, due to budget and availability of consenting study subjects. What power would that sample size provide for us? Power Calculation Using Stata, sampsi 5 4 , sd1(2.5) sd2(2) n1(75) n2(75) we get power = 77.19%. Notice that when we omitted the power option and include the sample sizes in the sampsi command, power was calculated. Up above, when we omitted the sample sizes and included the power option, the sample size was calculated. In this situation, you might consider how you might design your study to improve precision, so that you can get smaller standard deviations. Restricting the sample to a more homogeneous group is one possibility to achieve this. Chapter 2-5 (revision 16 May 2010) p. 18 What to do if you don’t know anything (no effect size or standard deviation estimates) If you don’t know anything, you can still do a power analysis based a standardize variable with the effect size expressed in standard deviation units. A standardized score, or z-score, is computed for each variable using, z Y Y SDY where the variable’s mean is subtracted by the observation, and then this difference is divided by the variable’s standard deviation. A z-score, then, is in units of the number of standard deviations an observation is from its mean. A standardized variable always has a mean of 0 and standard deviation (SD) of 1. It is best if you can say something about your choice of effect size, so it sounds reasonable. Also this approach works best if you are implying to be doing it to avoid a separate calculation for a long list of outcome variables (see example paragraph below). For a 1/3 SD effect size, or standardized mean difference, you would use sampsi 0 .33 , sd1(1) sd2(1) power(.95) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.9500 0 .33 1 1 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 239 239 Chapter 2-5 (revision 16 May 2010) p. 19 Expressing the effect size in SD units has a long history. Cohen (1988) popularized the idea in his textbook on power analysis. The book contains look-up tables based on effect size expressed this way. For a two group comparison of means, the effect size is (mean difference)/SD, where SD is for either group, since the ordinary t test formula assumes the SDs are equal (Cohen, 1988, p. 20). Cohen denoted such an effect size as d, and it has since became popularly known as Cohen’s d. Expressed in Cohen’s notation, d = (mA – mB)/σ. To assist the investigator who does know what the means and standard deviation, or correlation coefficients, are in advance, but does have a since that the effect will be small, medium, or large, Cohen (1988, pp.25-27) proposed the effect sizes of d = 0.2 is a small effect size d = 0.5 is a medium effect size d = 0.8 is a large effect size Explaining further, Cohen states (1988, pp.24-25), “Take, for example, an especially constructed test of learning ability appropriate for use with phenylpyruvic mental deficients. The investigator may well be satisfied with the relevance of the test to his purpose, yet may have no idea of either what the σ is or how many points of difference on Y between means of treated and untreated populations he can expect. Thus, he has neither the numerator (mA – mB) nor the denominator (σ ) needed to compute d. It is precisely at this point in the apparent dilemma that the utility of the d concept comes to the fore. It is not necessary to compute d from a posited difference between means and an estimated standard deviation; one can posit d directly. Thus, if the investigator thinks that the effect of his treatment method on learning ability in phyenylpyruvia is small, he might posit a d value such as .2 or .3. If he anticipates it to be large, he might posit d as .8 or 1.0. If he expects it to be medium (or simply seeks to straddle the fence on the issue), he might select some such value as d = .5.” Chow, Shao and Wang (2008, p.13) advocate the use of expressing the effect size as standard deviation units when no prior knowledge is available. Using δ to denote an effect of clinical importance, they state: “In clinical trials, the choice of δ may depend upon absolute change, percent change, or effect size of the primary study endpoint. In practice, a standard effect size (i.e., effect size adjusted for standard deviation) between 0.25 and 0.5 is usually chosen as δ if no prior knowledge regarding clinical performance of the test drug is available. This recommendation is made based on the fact that the standard effect size of clinical importance obsered from most clinical trials is within the range of 0.25 and 0.5.” Chapter 2-5 (revision 16 May 2010) p. 20 Study Protocal Suggestion Here is an example power analysis for a study protocol when z scores are used in the power analysis. To compute a single power analysis that applies to each outcome, a z-score, or standardized score, approach is used. Expressed as z-scores, every outcome has the same distribution, which is a mean of 0 and standard deviation (SD) of 1. Significance test p values and power calculations using z-scores are identical to those using original scales. Thus, one power analysis based on z-scores applies to all outcomes, as long as the same effect size is used. An effect size of 1/3 SD is assumed for all outcomes. For the PAM scale, this is equivalent to 1 point on the original scale (Hibbard, 2005), so a 1/3 SD is not excessively large. Assuming the control group outcomes will be unchanged from baseline to post-intervention, while the intervention group outcomes will be increased by 1/3 SD, the sample size of N=239 per group provides 95% power to detect the these differences, using a two-sided, alpha 0.05 comparison. Example 1. In the sample size paragraph of their article, Cahen et al. (N Engl J Med, 2007) state, “…We determined that a study with 23 patients per group would have 90% power to detect a difference of 1 SD with the use of a two-group t-test at a two-sided significance level of 0.05….” Example 2. In the sample size paragraph of their article, Hovi et al. (N Engl J Med, 2007) state, “We calculated that we would need to enroll at least 140 subjects in each group to have a statistical power of 90% to detect a between-group difference of 0.40 in the standarddeviation score with an alpha level of 0.05 in a two-sided analysis.” They took advantage of the fact that p values from statistical comparisons are unchanged by switching between orginal units and standard deviation units of a variable. Using standard deviation units is particularly intuitive for defining cutpoints. For example, in their Methods section they state, “For each very-low-birth-weight survivor, we selected the next available singleton infant born at term (gestational age, ≥37 weeks) of the same sex who was not small for gestational age (standard-deviation score for birth weight, ≥2).” They used the fact that for a normally distributed variable, about 95% of the values are between -2SD and +2SD from the mean, applying the same logic as used with laboratory reference ranges to define “normal” lab values. In their Results section, they stated, “A post hoc analysis in which we used the 10th percentile (standard-deviation score, -1.3) as a cutoff point to define small and appropriate for gestational age also showed no significant differencs (range of P values, 0.25 to 0.88).” This time, they used the fact that for a normally distributed variable, 10% of the values are below -1.3 SDs from the mean. Chapter 2-5 (revision 16 May 2010) p. 21 Example 3. In the sample size paragraph of their article, Papi et al. (N Engl J Med, 2007) state, “Using a total of 480 patients, a two-sided test, and an alpha of 0.05, we estimated the statistical power to be more than 80% to detect a significant difference between these treatments and an effect size of 0.42. The effect size is a dimensionless variable expressing the standardized difference (i.e., the mean difference divided by standard deviation) between, in our study, the mean morning peak expiratory flow rate after asneeded combination therapy and after as-needed albuterol therapy.” Chapter 2-5 (revision 16 May 2010) p. 22 Sample Size Calculation When a Multiple Comparison Adjustment is Planned When more than two groups are compared, a multiple comparison procedure is frequently appropriate. Usually, we will not be interested in the simultaneous comparison of the means (e.g., one-way ANOVA) but rather will be interested in the pairwise comparisons. Therefore, a good approach to sample size calculation for the k sample comparison is to think of the smallest pairwise difference you want to detect, and then compute your sample size calculation using a two group comparison method. When p value adjustments will be used (or any other class of multiple comparison procedure), the required sample size is larger than when p value adjustments are not needed (since the actual alpha used in the pairwise comparisons is smaller). Therefore, instead of using alpha = 0.05, use alpha = 0.05/k, where k is the number of comparisons to be adjusted with a p value adjustment procedure (Witte et al, 2000). The resulting sample size will provide the desired power for all of the adjusted comparisons. This approach works for any of the multiple comparisons procedures, since the smallest p value is compared to 0.05/k at a minimum. Chapter 2-5 (revision 16 May 2010) p. 23 Overfitting If you only require a small sample size, but you want to use a large number of predictor variables, you must increase the sample size to accommodate this. Otherwise, you can introduce unreliable correlations into the model that will not hold up in larger or future datasets. This problem of detecting unreliable correlation, due to having too many predictor variables for the available sample size, is called “overfitting” (Harrell et al, 1996) The following rule of thumb for interpreting the size of correlation coefficient will be useful for this discussion. Rule of Thumb for Interpreting the Size of a Correlation Coefficient The Pearson correlation coefficient has the range [-1 to 0] or [0 to 1] with perfect correlation being 1.0 and 0 being no linear association. A rule of thumb for interpreting the size of the correlation coefficient is presented by Hinkle et al (1998, p.120): Rule of Thumb for Interpreting the Size of a Correlation Coefficient Size of Correlation Interpretation 0.90 to 1.00 (-0.90 to -1.00) Very high correlation 0.70 to 0.90 (-0.70 to -0.90) High correlation 0.50 to 0.70 (-0.50 to -0.70) Moderate correlation 0.30 to 0.50 (-0.30 to -0.50) Low correlation 0.00 to 0.30 ( 0.00 to -0.30) Little if any correlation Imagine taking two pairs of random numbers, which are clearly not correlated except perhaps by chance. If you plot these two ordered pairs (x, y), you can fit a straight line through them perfectly. If you do this for three ordered pairs, you will most likely have a straight line that fits them quite well, even though they should be uncorrelated. It is not until you get up to over 10 such pairs that the pattern begins to look random upon successive repetitions of this experiment. Keeping in mind that when you have only one predictor variable (and one outcome variable) the multiple R from linear regression is exactly the Pearson correlation coefficient (similarly, R2 = r2). We can do our experiment using the Pearson correlation coefficient, then, and the result is the same for linear regression. Chapter 2-5 (revision 16 May 2010) p. 24 *-- overfitting simulation (correlate random numbers using sample sizes * of 2 through 10) set obs 54 set seed 999 capture drop rand1 rand2 frame gen rand1=invnorm(uniform()) in 1/54 gen rand2=invnorm(uniform()) in 1/54 gen frame=2 in 1/2 // frame 1: n=2 replace frame=3 in 3/5 // frame 2: n=3 replace frame=4 in 6/9 // frame 3: n=4 replace frame=5 in 10/14 // frame 4: n=5 replace frame=6 in 15/20 // frame 4: n=6 replace frame=7 in 21/27 // frame 4: n=7 replace frame=8 in 28/35 // frame 4: n=8 replace frame=9 in 36/44 // frame 4: n=9 replace frame=10 in 45/54 // frame 4: n=10 bysort frame: pwcorr rand1 rand2, obs sig * capture label drop framelab #delimit ; label define framelab 2 "n=2 (r= 1.00)" 3 "n=3 (r = -0.40)" 4 "n=4 (r=0.29)" 5 "n=5 (r= -0.05)" 6 "n=6 (r= -0.66)" 7 "n=7 (r=0.87)" 8 "n=8 (r= -0.41)" 9 "n=9 (r= -0.16)" 10 "n=10 (r=-0.22)" ; #delimit cr label values frame framelab label variable frame "sample size" #delimit ; twoway (scatter rand1 rand2)(lfit rand1 rand2) , by(frame,legend(off) title(Spurious Correlation of Pairs of Random Numbers)) xtitle("") ; #delimit cr The result is: Correlation of Normal Random Numbers for Sample Sizes n=2 to 10 sample size Pearson r p value Interpretation of Strength 2 1.00 1.000 very high 3 -0.40 .735 low 4 0.29 .708 little 5 -0.05 .940 little 6 -0.66 .158 moderate 7 0.87 .011 high 8 -0.41 .316 low 9 -0.16 .681 little 10 -0.22 .540 little Chapter 2-5 (revision 16 May 2010) p. 25 Spurious Correlation of Pairs of Random Numbers n=3 (r = -0.40) n=4 (r=0.29) n=5 (r= -0.05) n=6 (r= -0.66) n=7 (r=0.87) n=8 (r= -0.41) n=9 (r= -0.16) n=10 (r=-0.22) -2 0 2 4 -2 0 2 4 -2 0 2 4 n=2 (r= 1.00) -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 Graphs by sample size In this example, the term “spurious” correlation is used, rather than “unreliable” correlation, to emphasize that we are correlating two variables that are just random numbers (thus, the correlation is 0 in the population). In actual datasets, a correlation might exist but be overestimated, thus “unreliable”, since it would not hold up in future larger datasets. Now, let’s repeat this simulation for sample sizes 2 to 100, using the following commands Chapter 2-5 (revision 16 May 2010) p. 26 clear , all set seed 888 quietly set obs 100 gen n = _n // numbers 1 to 100 gen r = . gen v1 = . gen v2 = . forvalues i=2(1)100 { quietly replace v1=invnorm(uniform()) // use a new sample on each iteration quietly replace v2=invnorm(uniform()) quietly corr v1 v2 in 1/ì' // correlation for incrementing sample size * return list // correlation coefficient saved in r(rho) quietly replace r = r(rho) in ì'/ì' } #delimit cr gen xval = 0 in 1/1 // data to fit reference line replace xval = 100 in 2/2 gen yval = 0 in 1/1 replace yval = 0 in 2/2 #delimit ; graph twoway (line r n ) (line yval xval) , title(Spurious Correlation) xtitle(Sample Size) ytitle(Pearson r) legend(off) xlabel(0(10)100) ylabel(-1(0.1)1,format(%3.1f)angle(horizontal)) ; #delimit cr Pearson r Spurious Correlation 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1.0 0 10 20 30 Chapter 2-5 (revision 16 May 2010) 40 50 60 Sample Size 70 80 90 100 p. 27 We see that the spurious correlation starts to settle down just over n=10 (although spurious correlations as high as r=0.30 are still common even with sample sizes up to n=70). That is why it is suggested that you use n=10 subjects for every predictor variable. This is not to say that correlations < 0.30 should be ignored. Some things simply cannot be explained to a large degree by a single variable, because of their complexity. Knoke et al (2002, p.132) state, “Typically, a single independent variable in social research seldom accounts for more than 25% to 30% of the variance in a dependent variable, and often for as little as 2% to 5%.” Fortunately, the point at which correlations of these sizes achieve statistical significance is what you would hope to be case, after observing the above simulation. 2 r 0.25 0.30 0.02 0.05 r 0.50 0.55 0.14 0.22 frequency occurs in social research seldom seldom often often statistically significant (p < 0.05) when n  16 n  14 n  196 n  80 Rule of Thumb for Sample Size Required to Avoid Overfitting Harrell (2001, pp. 60-61) states: “When a model is fitted that is too complex, that is it has too many free parameters to estimate for the amount of information in the data, the worth of the model (e.g., R2) will be exaggerated and future observed values will not agree with predicted values. In this situation, overfitting is said to be present, and some of the findings of the analysis come from fitting noise or finding spurious associations between X and Y. In this section general guidelines for preventing overfitting are given. Here we concern ourselves with the reliability or calibration of a model, meaning the ability of the model to predict future observations as well as it appeared to predict the responses at hand. For now we avoid judging whether the model is adequate for the task, but restrict our attention to the likelihood that the model has significantly overfitted the data. Studies in which models are validated on independent datasets184, 186, 391 have shown that in many situations a fitted regression model is likely to be reliable when the number of predictors (or candidate predictors if using variable selection) p is less than m/10 or m/20, where m is the ‘limiting sample size’ given in Table 4.1. For example, Smith et al. 391 found in one series of simulations that the expected error in Cox model predicted fiveyear survival probabilities was below 0.05 when p < m/20 for “average” subjects and below 0.10 when p < m/20 for “sick” subjects, where m is the number of deaths. For “average” subjects, m/10 was adequate for preventing expected errors > 0.1. Narrowly Chapter 2-5 (revision 16 May 2010) p. 28 distributed predictor variables (e.g., if all subjects’ ages are between 30 and 45 or only 5% of subjects are female) will require even higher sample sizes. Note that the number of candidate variables must include all variables screened for association with the response, including nonlinear terms and interactions. TABLE 4.1: Limiting Sample Sizes for Various Response Variables Type of Response Variable Limiting Sample Size m Continuous n (total sample size) Binary min(n1, n2) c Ordinal (k categories) 1 k n  2 i 1 ni3 d n Failure (survival time) number of failures e __________ c See [329]. If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the proportional odds assumption holds, the effective sample size for a binary response is 3n1n2/n  3min(n1, n2) if n1/n2 is near 0 or 1 [452, Eq. 10, 15]. Here n1 and n2 are the marginal frequencies of the two response levels. d Based on the power of a proportional odds model two-sample test when the marginal cells sizes for the response are n1,…, nk , compared with all cell sizes equal to unity (response is continuous)[452, Eq. 3]. If all cells sizes are equal, the relative efficiency of having k response categories compared to a continuous response is 1 - 1/k2 [452, Eq. 14], for example, a five-level response is almost as efficient as a continuous one if proportional odds holds across category cutoffs. e This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for nonproportional hazards methods such as Wilcoxon-type tests.34 ___________ 34. J. K. Benedetti, P. Liu, H.N. Sather, J. Seinfeld, and M.A. Epton. Effective sample size for tests of censored survival data. Biometrika, 69:343-349, 1982. 184. F.E. Harrell, K.L. Lee, R.M. Califf, D.B. Pryor, and R.A. Rosati. Regression modeling strategies for improved prognostic prediction. Statistics in Medicine, 3:143-152, 1984. 186. F.E. Harrell, K.L. Lee, D.B. Matchar, and T.A. Reichert. Regression models for prognostic prediction: Advantages, problems, and suggested solutions. Cancer Treatment Reports, 69:1071-1077, 1985. 329. P. Peduzzi, J. Concato, E. Kemper, T.R. Holford, and A.R. Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49:1373-1379, 1996. 391. L.R. Smith, F.E. Harrell, and L.H. Muhlbaier. Problems and potentials in modeling survival. In M.L. Grady and H.A. Schwartz, editors, Medical Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-00576, pages 151-159. US Dept. of Health and Human Services, Agency for Health Care Policy and Research, Rockville MD, 1992. 452. J. Whitehead. Sample size calculations for ordered categorical data. Statistics in Medicine, 12:2257-2271, 1993.” Chapter 2-5 (revision 16 May 2010) p. 29 Harrell (1996) states the m/10 rule for linear regression, logistic, and Cox in a much more abbreviated form. Protocol Suggestion Here is some suggested wording for a study protocol: Sample Size Determination A sample size of n=30 is required for 90% power. However, we intend to include 10 predictor terms in the model (counting the number of indicator terms needed for the categorical variables). [It is a more impressive presentation to list these.] To avoid “overfitting” then, where unreliable correlation is introduced by violating the n=10 subjects for every predictor term, we will collect a sample size of n=100. (Harrell, 2001) Overfitting: 5 to 9 events per variable in models to assess an effect while adjusting for confounding If you can only achieve 5 to 9 events per variable (EPV) in a logistic regression or Cox regression, however, you are still okay if you cite Vittinghoff and McCulloch (2007) to support this relaxed rule. In a large simulation study to investigate this rule, Vittinghoff and McCulloch concluded, “Our simulation study shows that the rule of thumb of 10 or more EPV in logistic and Cox models is not a well-defined bright line. If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2-4 EPV, uncommon with 5-9 EPV, and still observed with 10-16 EPV. Cox models appear to be slightly more susceptible than logistic. The worst instances of each problem were not servere with 5-9 EPV and usually comparable to those with 10-16 EPV.” Note: This 5 to 9 events per variable only applies to the situation where the aim is to examinine the association of an exposure while adjusting for confounding (Vittinghoff and McCulloch, 2007;Steyerberg, 2009, p.51). Specifically, it should not be used for developing prediction, or prognostic, models (Steyerberg, 2009, p.50-51). The final touch to the discussion on reporting the sample size determination in your study protocol (grant application) is when you determine that you need a small sample size for the number of predictor variables you plan to model. In that case, you need to increase your sample size to avoid overfitting. Protocol Suggestion Here is some suggestion wording for using 5 to 9 events per predictor variable in a logistic regression or Cox regression in an article. Chapter 2-5 (revision 16 May 2010) p. 30 Sample Size Determination We included 8 predictor variables in our logistic regression model, with 40 observed events, resulting in 40/8, or 5 events per predictor variable. A suggested rule-of-thumb is to have 10 events per predictor varible in a logistic regression model to avoid overfitting (Harrell, 1996). However, it has since been shown that as few as 5 events per predictor variable in logistic and Cox regression models are sufficient to avoid overfitting when the aim of the model is to adjust for confounding (Vittinghoff and McCulloch, 2007). Switching the Dependent and Independent Variables It is sometimes easier to reverse the roles of dependent and independent variable for sample size determination. Discussing the arbitrariness of which is considered the dependent and which is independent variable, Chinn (2001,p.394 last paragraph) states, “Before presentation and analysis can be discussed, the distinction between outcome, or dependent, variables and explanatory variables, also called independent or exposure variables, needs to be clarified. Usually there will be no confusion. In a randomized controlled trial, survival or recovery of the patient may be the outcome of interest and the treatment group is the explanatory variable. There may be additional explanatory variables, such as age and sex, and these sould include any variable used to stratify the patients in the RCT. However, in some circumstances there is ambiguity. In a casecontrol study, subjects are seleced as having the disease, the cases, or not having the disease, the controls, and the measured potential risk factors are the outcomes of the study. The data analysis proceeds by treating “caseness” as the outcome and the risk factors as explanatory variables, but strictly speaking the opposite is true. In a study of asthmatic patients presenting in Accident and Emergency it is possible to compare the ages of patients that do or do not require admission or to analyse the risk of admission by age. In the first analysis, age is treated as the outcome and admission the independent variable, but more logically in the second, admission is the outcome. Although a conclusion that increasing age is associated with lower risk of admission might be found from either analysis, the second leads to results in a more useful form and also enables adjustment for risk factors other than age.” Protocol suggestion (for a case-control study design, for example) The independent and dependent variables can be reversed for sample size determination, just as in tests of significance, since an association is invariant to the choice (Chinn, 2001). As a simple illustration, it makes no difference which is the Y and which is the X variable when computing a correlation coefficient, and this practice of switching dependent and independent variables is commonplace in the patients characteristics table of journal articles. To test the association between the cases and controls on the continuous variable X, assuming mean±SDs of …. Chapter 2-5 (revision 16 May 2010) p. 31 Excessive Power (Sample Size Very Large) If your sample size is too large, even trivial differences are statistically significant. When this happens, p values are no longer useful. You should then not bother with p values, and instead just show effects, such as mean differences, along with confidence intervals. State whether or not the difference observed is thought to be clinically significant or not. A good discussion, and citation, for this is section called “Studies With Excessive Power That Detect Differencs That Are Not Clinically Meaningful” on page 1521 in Bhardwaj et al (2004). Sample Size Based on Precision (Desired Width of Confidence Interval) Sometimes the goal of a study is to demonstrate an effect with a desired precision. That is, the investigator wants to report an effect with a desired width of the confidence interval around the effect. A good example is computing a reliability coefficient, such as kappa, with “good” precision, such as a 95% CI of Kappa±0.05. Studies designed to report test characteristics of a diagnostic test, such as sensitivity and specificity, can also benefit from the precision approach to sample size determination. If the investigator desires a statistical significant p value, as well as desired precision around the effect estimate, then sample size is determined both for adequate power to detect the effect, as well as determined to achieve desired precision. The sample size is then selected as the larger of the two, so that both goals are achievable. (Bristol, 1989). The approach to computing sample size for a desired precision is based only on the formula for the confidence interval, rather than power of a significance test. At first, this seems counterintuitive, since it seems we want to be 80% sure, for example, that the CI will be no wider than the desired width. In significance testing, where we want to test if the mean difference is different from zero, the CI can be used as the significance test, where we conclude a difference only if the CI covers zero. We want to be 80% sure, or have 80% power, that the CI does not cover 0. However, that is what happens on the outside of the CI. The precision is only based on the inside of the CI, where precision for a CI of (a, b) is the width, or b-a. For precision, we don’t actually care if the interval includes 0. For sample size determination for a significance test, we assume the means and standard deviations, and then determine n for a desire power, which is only as reliable as our assumptions of what the means and standard deviations are. For precision, given the means, standard deviations, and sample size, the width of the interval is determined directly from the CI formula. Given any 3 of the quantities means, SDs, N, or Width, the other quantity is solvable by algebra. So, if our assumptions of the means and SDs are reliable, where our Width is set by our desire, the N is simply determined by algebra. If our assumptions are off, then our N will be off. However, we have that same problem in ordinary significance testing sample size determination, so we are in no worse of a situation. Chapter 2-5 (revision 16 May 2010) p. 32 Precision for a single mean Rosner (2006, p.258-59) gives the formula for precision of a single mean. Using the asymptotic, or large sample, formula for a CI around a mean, the 100(1-α)% CI for the population mean, μ, is given by X  z1 /2 / n The width of this interval (a , b) is b-a, or 2( z1 /2 / n ) . If we wish the interval to be no wider than L, then 2( z1 /2 / n )  L Multiplying both sides of the equation by N / L , we obtain 2z1 /2 / L)  n Then squaring both sides, gives us the number of subjects needed to obtain a CI no wider than L, n  4z12 /2 2 / L2 For a two-sided 95% CI, z1 /2  z10.05/2  z97.5  1.96 In Rosner’s example (Rosner, 2005, p.258-59), where we want to estimate the mean change in heart rate, with a 95% CI no wider than 5 beats per minute, with a SD=10 beats per minute, we require N = 4(1.96)2(102)/(52) = 61.5, or 62 patients. Chapter 2-5 (revision 16 May 2010) p. 33 Precision for a difference between two means Bristol (1989) gives the formula for estimating the required sample size to obtain a desired precision around the difference between two means. Beginning with the z test for the difference between two means, z X  Y  (x   y ) 1 1  n1 n2   X  Y  0 X  Y  0 X  Y  0   when n1  n2 1 1 2 2     n1 n2 n n The CI around the mean difference, then, is given by ( X  Y )  z /2  n 2  ( X  Y )  z /2  2 n The width of this interval is 2z /2  2 n Solving for n gives       n=  2 2 z /2  =22 (2)  z /2   8  z /2  L L L    2 2 2 (Bristol’s equation 3) Assuming the standard deviation, σ, is 1, and we desire L = 1, the required n is 2 1  n = 8 1.96   8(1.96) 2  30.73 , or 31 subjects 1  which agrees with the value shown in Bristol’s Table I in row L0 /   1/1  1.0 and column nL . Chapter 2-5 (revision 16 May 2010) p. 34 These two citations, Ross and Bristol, both illustrate that the required sample size for a desired precision is simply applying the CI formula. The same thing is illustrated in Chow et al (2008, pp.15-16) textbook on sample size calculation. In practice, you usually have a desired sample size and simply want to show it is large enough to achieve a desired precision for some effect. For your power analysis, then, you simply compute the CI for your effect and point out that the width of the CI is adequate precision. Actually, there really is no standard for what a desired precision is. So, just show what you have and state that “it is narrow enough to be informative.” The width is usually expressed as one-half of the interval, since it generally does not matter if the lack of precision is above or below an estimate in a two-sided 95% CI. Chow et al (2008, p.15) describes the approach, “For a (1 – α)100% confidence interval, the precision of the interval depends on its width. The narrower the interval is, the more precise the inference is. Therefore, the precision analysis for sample size determination is to consider the maximum half width of the (1 – α)100% confidence interval of the unknown parameter that one is willing to accept. Note that the maxiumum half width of the confidence interval is usually referred to as the maximum error of an estimate of the unknown parameter.” Protocol Suggestion Suppose you are designing a study where you wish to show that first year orthopaedic surgical residents can reliably determine that a knee joint requires revision due to infection. To be conservative, you will assume that the reliability coefficient, kappa, is 0.70. You would like to keep the number of residents to 4 and the number of patients to 30. You might state something like the following: To demonstrate that first year orthopaedic surgical residents can reliably determine that a knee joint requires revision due to infection, the sample size estimation will be based on acceptable precision around the kappa interrater reliability coefficient (Bristol,1989; Chow et al, 2008). It is anticipated that the kappa will be no lower than 0.70. A 95% confidence interval of no less than ±0.15 would be considered sufficiently informative. The following table shows the expected precision of the 95% confidence interval around an expected kappa of 0.70. Sample Size Determination Based on Precision of Kappa Estimate Two-sided 95% Confidence Interval Around Kappa=0.70 (shown as percent) Number of Reviewers 3 4 9 15 20 30 Number of Patient Charts Reviewed 10 41-90 44-88 48-87 49-86 49-86 49-85 15 47-87 49-86 52-84 52-84 52-83 53-84 Chapter 2-5 (revision 16 May 2010) 20 50-85 52-84 54-82 54-82 55-82 56-82 25 52-84 54-83 56-81 56-81 57-81 57-81 30 54-83 55-82 57-80 58-80 58-80 58-80 35 54-82 57-82 58-80 59-79 59-79 59-79 40 56-81 57-80 59-80 60-79 60-79 59-78 45 57-81 58-80 60-79 60-79 60-78 60-78 50 58-81 59-79 60-78 61-78 61-78 61-78 p. 35 Adequate precision is obtainable with n=4 residents and 30 patients. We see that increasing the number of residents above 4 and the number of patients above 30 does not provide a large enough gain in precision to justify the increase in effort and cost of the project. Two Group Comparison of Interval Scale Outcome Sample Size Presentation in Study Protocol Although most articles are published without a sample size paragraph in the Statistical Methods section, for a grant application it is mandatory. This is so because a granting agency does not want to invest money in your study if it has little chance of success (not a sufficient sample size to demonstrate a true effect). In your sample size paragraph, you should provide the five elements required for sample size calculation, as well as the test procedure it is based on, so the reviewer can check your calculation. This also informs the reviewer you knew what was needed for the calculation, which suggests you actually did a calculation, rather than just guessed and are now bluffing your way through. You should also describe padding the sample size to allow for subject dropouts and losses-to-follow-up. Protocol Suggestion Something to the effect of the following is appropriate: Our sample size calculation is based on the primary study aim (Study Aim 1). We consider the minimal detectable effect to be a mean difference of 1.0 (means: Group A, 5, Group B, 4). A smaller effect than 1.0 would likely translate into a negligible clinical outcome, whereas a difference of 1.0 would likely be noticeable improvement in the patient’s well-being. Previous research suggests the mean difference will be at least 1.0 in this patient population. references We estimated the population standard deviations from previous research (standard deviations: Group A, 2.5, Group B, 2.0).references Based on an independent samples t test, we require n=109 in each group (total n=218) to have 90% power, with alpha of 0.05, using a two-sided comparison. To provide for a 10% reduction in evaluative subjects, due to dropouts and losses-to-follow-up, we will collect a sample of n=121 in each group (total n = 242). An even more impressive presentation (and good research practice as well) is to provide a sensitivity analysis of your sample size calculation. This is simply a table showing the power provided by the selected sample size for a reasonable range of deviations from both the assumed effect size and assumed standard deviations. Chapter 2-5 (revision 16 May 2010) p. 36 It is a good idea to add a sensitivity analysis paragraph after the above paragraph, something to the effect of: Our evaluative sample size of n=109 in each group appears to be sufficiently robust to reasonable deviations from our assumed effect size and standard deviation estimates, as the power does not drop below 80% for any reasonable deviation. Power for n=109 in each group for reasonable combinations of deviations in assumed effect size and standard deviation estimates Effect Size (mean difference) Standard Deviations 2.5 & 2.0 2.75 & 2.25 3.0 & 2.50 1 90* 83 76 0.9 83 75 67 0.8 74 65 57 *Estimate used for our sample size calculation Note The above sentence does not match the table, since power does drop below 80%. We need to either change the deviations in the table, qualify our statement by saying some of these deviations are very unlikely, or adjust our sample size to allow for the reasonable deviations in the table. If we really suspect that deviations as large as these are possible, we should use a sample size that gives us 80% power to detect a mean difference of 0.8 with standard deviations of 3.0 & 2.50 (the lower right cell of table). Sample Size Presentation in Article A condensed version of a sample size determination, or power analysis, paragraph is frequently reported in a research article. Horton and Switzer (2005) surveyed what statistical methods are used in research articles published in NEJM. They found that 39% of research articles published in 2004-2005 provided a power analysis. In an article, a much shorter presentation is appropriate. Here is an example for a survival analysis (Calverley et al., 2007), “Assuming a 17% mortality rate in the placebo group at 3 years,17 we estimated that 1510 patients would be needed for each study group to detect a reduction in mortality of 4.3 percentage points in the combination-therapy group, as compared with the placebo group (hazard ratio for death, 0.728), at a two-sided alpha level of 0.05 with 90% power.” Chapter 2-5 (revision 16 May 2010) p. 37 Example of Working Through a Sample Size for Comparing Two Means (Smoking Cessation Program Example) Suppose you have the following study aim in your study protocol, or grant proposal: Aim 1. Assess whether a smoking cessation program designed around physicians presenting health benefits is less effective for young adults (ages 21 to 39) than for not-so-young adults (ages 40 and above). When preparing our statistical methods section, we would appear naïve and fail to get our grant funded if we proposed a t test as our statistical method for testing the Aim 1 hypothesis. The reviewer would immediately recognize that the effect is probably confounded, possibly by gender and certainly by habit (cigarettes/day) upon entering the cessation program. We might use something like, Statistical Methods To test our Aim 1 hypothesis [smoking cessation program designed around physicians presenting health benefits work less well for both young adults (ages 21 to 39) than for not-so-young adults (ages 40 and above)] we will use multivariable linear regression. The outcome will be days remaining smoke free within first year. The study factor (age group) will be adjusted for potential confounding variables, such as gender and daily smoking habit (cigarettes/day) at the start of smoking cessation program. We next need to prepare the “sample size justification” section for the grant application, where a multivariable linear regression model is planned. Since we know that the t test and linear regression are equivalent, we can base our sample size calculation using the same formula that we would use for a t test. Suppose we have our own preliminary data, or a previously published study with the same study aim. In those data, we have the following mean number of days before study subjects returned to smoking: group 1: mean = 85, SD = 137 group 2: mean = 71, SD = 119 Calculating the sample size for powers of 80, 85, 90, and 95: sampsi sampsi sampsi sampsi 85 85 85 85 71, 71, 71, 71, sd1(137) sd1(137) sd1(137) sd1(137) sd2(119) sd2(119) sd2(119) sd2(119) power(0.8) power(0.85) power(0.9) power(0.95) . sampsi 85 71, sd1(137) sd2(119) power(0.8) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 Chapter 2-5 (revision 16 May 2010) p. 38 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 85 71 137 119 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 1319 1319 Realistically, we can probably stop here. It is not likely that we can get N=1509 subjects to enroll into our smoking cessation program and effectively follow them for one year. For 85% power, we need n=1766 in each group; for 90% power, we need n=1766 in each group, and for 95% power we need n=2184 in each group. Let’s keep going, however, for illustration. A clever sample size determination paragraph will include a sensitivity analysis, which is a table that shows the power for various effect sizes for the sample size you are requesting in your grant. sampsi sampsi sampsi sampsi 78 71, sd1(137) sd2(119) n1(1500) n2(1500) // 1 week diff 85 71, sd1(137) sd2(119) n1(1500) n2(1500) // 2 week diff 92 71, sd1(137) sd2(119) n1(1500) n2(1500) // 3 week diff 101 71, sd1(137) sd2(119) n1(1500) n2(1500) // 4 week diff Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = = = = = = = = 0.0500 78 71 137 119 1500 1500 1.00 (two-sided) Estimated power: power = 0.3209 For 1 week, the power is 32%; for 2 weeks, the power is 85%; for 3 weeks, the power is 99%; for 4 weeks, the power is 100%. Chapter 2-5 (revision 16 May 2010) p. 39 We are now ready to write our sample size justification paragraph. Sample Size Justication To test our Aim 1 hypothesis, we require n=1509 in each of two groups (total N=3018) to detect a difference in days remaining smoke free of 2 weeks. This sample size gives us 85% power, for a two-sided comparison, using alpha=0.05 and standard deviation estimates from our pilot study (control group SD=137, intervention group SD=119). From our pilot study, we feel confident that no less than a two-week difference will be observed, and this difference seems to us to be a meaningful effect given the difficulty that smokers have in quitting smoking. As shown in the following sensitivity analysis, a larger effect will be nearly certain to be achieved if it is exists. Sensitivity of power for various effect sizes (n=1509 for each of two groups, total N=3018) Difference in days remaining smoke free 1 week 2 weeks 3 weeks 4 weeks 32% 85% 99% 100% To allow for a 10% dropout and losses to follow-up, we will collect n=1.1 x 1509, or 1660 in each group (total N=3,320). Chapter 2-5 (revision 16 May 2010) p. 40 References Bhardwaj SS, Camacho F, Derrow A et al (2004). Statistican significance and clinical relevance: the importance of power in clinical trials in dermatology. Arch Dermatol 40:1520-1523. Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trails. Annals Allergy, Asthma, & Immunology 78:5-11. Borenstein M, Rothstein H, Cohen J. (2001). SamplePower® 2.0. Chicago, SPSS Inc. software can be purchased at http://www.spss.com Bristol DR. (1989). Sample size for constructing confidence intervals and testing hypotheses. Statist Med 8:803-811. Browne RH. (2001). Using the sample range as a basis for calculating sample size in power calculations. The American Statistician, 55(4):293-298. Cahen DJ, Gouma DJ, Nio Y, et al. (2007). Endoscopic versus surgical drainage of the Pancreatic duct in chronic pancreatitis. N Engl J Med 356(7):676-784. Chow S-C, Shao J, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed. New York, Chapman & Hall/CRC. Chinn S. (2001). Statistics for the European Respiratory Journal. Eur Respir J 18:393-401. Cohen J. (1965). Some statistical issues in psychological research. In B.B. Wolman (Ed.), Handbook of Clinical Psychology. New York, McGraw-Hill. pp. 95-121. Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, New Jersey. Fleiss JL. (1973). Statistical Methods For Rates and Proportions. New York, John Wiley & Sons. pp.21-22. Harrell Jr FE, Lee KL, Mark DB. (1996). Tutorial in Biostatistics: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist Med 15:361-387 Harrell FE Jr. (2001). Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer. Hinkle DE, Wiersma W, Jurs SG. (1998). Applied Statistics for the Behavioral Sciences, 4th ed. Boston, Houghton Mifflin Company. Hoenig JM, Heisey DM. (2001). The abuse of power: the prevasive fallacy of power calculations for data analysis. The American Statistician 55(1):19-24. Chapter 2-5 (revision 16 May 2010) p. 41 Horton NJ, Switzer SS. (2005). Statistical methods in the Journal. [letter] NEJM 353;18:197779. Hovi P, Andersson S, Eriksson JG, et al. (2007). Glucose regulation in young adults with very low birth weight. N Engl J Med 356(320):2053-63. Knoke D, Bohrnstedt GW, Mee AP. (2002). Statistics for Social Data Analysis, 4th ed. Itasca, IL, F.E. Peacock Publishers. Lilienfeld DE, Stolley PD (1994). Foundations of Epidemiology, 3rd ed., New York, Oxford University Press. Miller RG, Moore DH, Gelinas DF, et al. (2001). Phase III randomized trial of gabapentin in patients with amyotrophic lateral sclerosis. Neurology 2001;56:843-848. Papi A, Canonica GW, Maestrelli P, et al. (2007). Rescue use of beclomethasone and albuterol in a single inhaler for mild asthma. N Engl J Med 356;20:2040-52. Petrie A. (2006). Statistics in orthopaedic papers. Journal of Bone and Joint Surgery 88-B(9): 1121-1136. Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole. Ross S. (1998). A First Course in Probability, 5th ed. Upper Saddle River, NJ. Smith D, Eggen M, St. Andre R. (1997). A Transition to Advanced Mathematics, 4th ed., Pacific Grove, California, Brooks/Cole Publishing Company. Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer. Stoddard GJ, Ring WH. (1993). How to evaluate study methodology in published clinical research. J Intravenous Nursing 16(2):110-117. Vittinghoff E, McCulloch CE. (2007). Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol 165(6):710-718. Whitehead J. (1993). Sample size calculations for ordered categorical data. Statistics in Medicine 12:2257-2271. Witte JS, Elson RC, Cardon LR. (2000). On the relative sample size required for multiple comparisons. Statist Med. 19;369-372. Chapter 2-5 (revision 16 May 2010) p. 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 2-5. Basics of Power Analysis