Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2101 Power Analysis Power is the conditional probability that one will reject the null hypothesis given that the null hypothesis is really false. Beta, β, is the conditional probability that one will not reject the null hypothesis given that the null hypothesis is really false (that is, make a Type II error). Since rejecting the null and not rejecting the null are mutually exclusive and exhaustive events, Power + Beta = 1. Imagine that we are evaluating the effect of a putative memory enhancing drug. We have randomly sampled 25 people from a population known to have normally distributed IQ with a of 100 and a of 15. We administer the drug, wait a reasonable time for it to take effect, and then test our subjects’ IQ. Assume that we were so confident in our belief that the drug would either increase IQ or have no effect that we entertained directional hypotheses. Our null hypothesis is that after administering the drug 100; our alternative hypothesis is > 100. These hypotheses must first be converted to exact hypotheses. Converting the null is easy: it becomes = 100. The alternative is more troublesome. If we knew that the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis would be = 115 and we could compute power, the probability of correctly rejecting the false null hypothesis given that is really equal to 115 after drug treatment, not 100 (normal IQ). But if we already knew how large the effect of the drug was we would not need to do inferential statistics. One solution is to decide on a minimum nontrivial effect size. What is the smallest effect that you would consider to be nontrivial? Suppose that you decide that if the drug increases iq by 2 or more points, then that is a nontrivial effect, but if the mean increase is less than 2 then the effect is trivial. Now we can test the null of = 100 versus the alternative of = 102. Refer to Figure 15.1 on page 385 of the 7th edition of Howell. Let the left curve represent the distribution of sample means if the null hypothesis were true, = 100. This sampling 15 3 . Let the right curve represent the distribution has a = 100 and a x 25 sampling distribution if the exact alternative hypothesis is true, = 102. Its is 102 15 3. and, assuming the drug has no effect on the variance in IQ scores, x 25 The dark shaded area in the upper tail of the null distribution is . Assume we are using a one-tailed of .05. How large would a sample mean need be for us to reject the null? Since the upper 5% of a normal distribution extends from 1.645 above the up to positive infinity, the sample mean IQ would need be 100 + 1.645(3) = 104.935 or more to reject the null. What are the chances of getting a sample mean of 104.935 or more if the alternative hypothesis is correct, if the drug increases IQ by 2 points? The area under the alternative curve from 104.935 up to positive infinity Copyright 2013, Karl L. Wuensch - All rights reserved. Power01.docx 2 represents that probability, which is power. Assuming the alternative hypothesis is true, that = 102, the probability of rejecting the null hypothesis is the probability of getting a sample mean of 104.935 or more in a normal distribution with = 102, = 3. Z = (104.935 102)/3 = 0.98 and P(Z > 0.98) = .1635. That is, power is about 16%. If the drug really does increase IQ by an average of 2 points, we have a 16% chance of rejecting the null. If its effect is even larger, we have a greater than 16% chance. Suppose we consider 5 the minimum nontrivial effect size. This will separate the null and alternative distributions more, decreasing their overlap and increasing power, as in Figure 15.2 on page 388. Now, Z = (104.935 105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to detect large effects than small effects. Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ; is now split into both tails of the null distribution, .025 in each tail. We shall reject the null if the sample mean is 1.96 or more standard errors away from the of the null distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100 1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the alternative is correct ( = 105) is: Z = (105.88 105)/3 = .29, P(Z > .29) = .3859, power = about 39%. We can ignore P(Z < (94.12 105)/3) = P(Z < 3.63) = very, very small. Note that our power is less than it was with a one-tailed test. If you can correctly predict the direction of effect, a one-tailed test is more powerful than a two-tailed test. What would power be if you incorrectly predicted the direction of effect? Consider what would happen if you increased sample size to 100. Now the 15 x 1.5 . With the null and alternative distributions less plump they should 100 overlap less, increasing power, as in Figure 15.3 on page 388. With x 1.5 the sample mean will need be 100 + (1.96)(1.5) = 102.94 or more to reject the null. If the drug increases IQ by 5 points, power is : Z = (102.94 105)/1.5 = 1.37, P(Z > 1.37) = .9147 or between 91 and 92%. Anything that decreases the standard error will increase power. This may be achieved by increasing the sample size or by reducing the of the dependent variable. The of the DV may be reduced by reducing the influence of extraneous variables upon the DV (eliminating “noise” in the DV makes it easier to detect the signal, the IV’s effect on the DV). Now consider what happens if you change . Let us reduce to .01. Now the sample mean must be 2.58 or more standard errors from the null before we reject the null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative Z = (103.87 105)/1.5 = 0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with at .05, ceteris paribus. Reducing reduces power. Please note that all of the above analyses have assumed that we have used a X normally distributed test statistic, as Z will be if the criterion variable is x normally distributed in the population or if sample size is large enough to invoke the CLT. Remember that using Z also requires that you know the population rather than estimating it from the sample data. We more often estimate the population , using Student’s t as the test statistic. If N is fairly large, Student’s t is nearly normal, so this is 3 no problem. For example, with a two-tailed of .05 and N = 25 we went out ± 1.96 standard errors to mark off the rejection region. With Student’s t on N 1 = 24 df we should have gone out ± 2.064 standard errors. But 1.96 versus 2.06 is relatively trivial, so we should feel comfortable with the normal approximation. If, however, we had N = 5, df = 4, critical t = ± 2.776, and the normal approximation would not do. A more complex analysis would be needed. One Sample Power Analysis the Easy Way Howell presents a simple method of doing power analyses for various designs. His method also assumes a normal sampling distribution, so it is most valid with large sample sizes. The first step in using Howell’s method is to compute d, the effect size 1 in units. For the one sample test d . For our IQ problem with minimum nontrivial effect size at 5 IQ points, d = (105 100)/15 = 1/3. We combine d with N to get . For the one sample test, d N . For our IQ problem with N = 25, = 1/3 5 = 1.67. Once is obtained, power is obtained using Table E.5 on page 595 of Howell. For a .05 two-tailed test, power = 36% for a of 1.60 and 40% for a of 1.70. By linear interpolation, power for = 1.67 is 36% + .7(40% 36%) = 38.8%, within rounding error of the result we obtained using the normal curve. If you look carefully at the table, you will see that the larger is, the greater power is. For a one-tailed test, use the column in the table with twice its one-tailed value. For of .05 one-tailed, use the .10 two-tailed column. For = 1.67 power then = 48% + 7(52% 48%) = 50.8%, the same answer we got with the normal curve. If we were not able to reject the null hypothesis in our research on the putative IQ drug and our power analysis indicated about 39% power, we would be in an awkward position. Although we could not reject the null, we also could not accept it, given that we only had a relatively small (39%) chance of rejecting it even if it were false. We might decide to repeat the experiment using an N large enough to allow us to accept the null if we cannot reject it. In my opinion, if 5% is a reasonable risk for a Type One error (), then 5% is also a reasonable risk for a Type Two error ( ), so let us use power = 1 = 95%. From the power table, to have power = .95 with = .05, is 3.60. 3.6 N . For a five-point minimum IQ effect N 116.64 . Thus, if we repeat d 1/ 3 the experiment with 117 subjects and still cannot reject the null, we can accept the null and conclude that the drug has no nontrivial ( 5 IQ points) effect upon IQ. The null hypothesis we are accepting here is a “loose null hypothesis” [95 < < 105] rather than a “sharp null hypothesis” [ = exactly 100]. Sharp null hypotheses are probably very rarely ever true. Others could argue with your choice of the minimum nontrivial effect size. Cohen has defined a small effect as d = .20, a medium effect as d = .50, and a large effect as d 2 2 4 = .80. If you defined minimum d at .20 you would need even more subjects for 95% power. A third approach one can take is to find the smallest effect that one could have detected with high probability given N. This is known as a “sensitivity analysis.” If that d is small and the null hypothesis is not rejected, it is accepted. For example, I used 225 subjects in the IQ enhancer study. For power = 95%, = 3.60 with .05 two-tailed, and 3.60 d 0.24 . If I can’t reject the null, I accept it, concluding that if the drug 15 N has any effect, it is a small effect, since I had a 95% chance of detecting an effect as small as .24 . The loose null hypothesis accepted here would be that the population differs from 100 by less than .24 . Two Independent Samples n 2 where n = the number of scores in one group and both groups have the same n; 2 , where is the standard deviation in either population, assuming is d 1 For the two group independent sampling design, d identical in both populations. If n1 n2 use the harmonic mean sample size, n~ 2 1 1 n1 n2 . For a fixed total N, the harmonic mean (and thus power) is higher the more nearly equal n1 and n2 are. This is one good reason to use equal n designs. Other good reasons are computational simplicity with equal n’s and greater robustness to violation of assumptions. Try computing the effective (harmonic) sample size for 100 subjects evenly split into two groups of 50 each and compare that with the effective sample size obtained if you split them into 10 in one group and 90 in the other. You should be able to rearrange the above formula for to solve for d or for n as required. Consider the following a priori (done before gathering the data) power analysis. We wish to compare the mean psychology achievement test score of students in Research Psychology graduate programs with that of those in Clinical Psychology graduate programs. We decide that we will be satisfied if we have enough data to have an 80% chance of detecting an effect of 1/3 of a standard deviation, employing a .05 criterion of significance. How many scores do we need in each group, if we have the same number of scores in each group? From the table in Howell, we obtain the value of 2.8 2.8 for . n 2 2 141 scores in each group, a total of 282 scores. d 1/ 3 Consider the following a posteriori (done after gathering the data) power analysis. We have available only 36 scores from students in clinical programs and 48 2 2 5 scores from students in general programs. What are our chances of detecting a difference of 40 points (which is that actually observed at ECU way back in 1981) if we use a .05 criterion of significance and the standard deviation is 98? The standardized 2 ~ 41.14 effect size, d, is 40/98 = .408. The harmonic mean sample size is n 1 1 36 48 .408 41.14 1.85 . From our power table, power is 46% (halfway between .44 and 2 .48). Correlated Samples The correlated samples t test is mathematically equivalent to a one-sample t test conducted on the difference scores (for each subject, score under one condition less score under the other condition). The greater ρ12, the correlation between the scores in the one condition and those in the second condition, the smaller the standard deviation of the difference scores and the greater the power, ceteris paribus. By the variance sum law, the standard deviation of the difference scores is Diff 12 22 2 12 1 2 . If we assume equal variances, this simplifies to Diff 2(1 ) . When conducting a power analysis for the correlated samples design, we can take into account the effect of ρ12 by computing dDiff, an adjusted value of d: 2 d d Diff 1 , where d is the effect size as computed above, with Diff 2(1 12 ) independent samples. We can then compute power via d Diff n or the required sample size via n d Diff 2 . Please note that using the standard error of the difference scores, rather than the standard deviation of the criterion variable, as the denominator of dDiff, is simply Howell’s method of incorporating into the analysis the effect of the correlation produced by matching. If we were computing estimated d as an estimate of the standardized effect size given the obtained results, we would use the standard deviation of the criterion variable in the denominator, not the standard deviation of the difference scores. I should admit that on rare occasions I have argued that, in a particular research context, it made more sense to use the standard deviation of the difference scores in the denominator of d. Consider the following a posteriori power analysis. One of my colleagues has data on level of cortisol in humans under two conditions: 1.) in a relaxed setting, and 2.) in a setting thought to increase anxiety. The standard deviations in her samples are 108 and 114. If the effect size in the population were 20 points (which she considers the smallest nontrivial effect), how much power would she have? The estimation correlation resulting from the within-subjects design is .75. 6 The pooled standard deviation is .5(1082 ) .5(1142 ) 111. d = 20/111 = .18. .18 .255 . .255 15 1.02. From the power table, power = 17%. 2(1 .75) My colleague is not very happy about this low power. I point out that maybe she is lucky and has such a big difference in means that it is significant despite the low power. Regretfully, that is not so. She starts planning to repeat the study with a larger sample size. d Diff How much data will my colleague need? She wants to have enough power to be able to detect an effect as small as 20 points of a standard deviation (d = .18) with 95% power – she considers Type I and Type II errors equally serious and is employing a .05 criterion of statistical significance, so she wants beta to be not more than .05. . n d Diff 2 3.6 199.3. . 255 2 Here is another example of power analysis with matched pairs. Addendum What would power be if you incorrectly predicted the direction of effect? For example, for the current design, suppose that we (in our alternative hypothesis) predicted that the drug would increase IQ but, in fact, it decreases IQ (µ is less than 100). The null hypothesis is that µ is 100 or less. The one-tailed critical value for mean IQ is still 100 + 1.645(3) = 104.935 or more to reject the null. If the effect of the drug were 5 points, but 5 points below 100 instead of above 100, the probability of rejecting the null would be the probability of getting a z greater or equal to (104.935 – 95)/3; z = 2.31; probability = 1%. This probability is not, however, power. Power is the probability of rejecting the null given that the null is false. If, however, the alternative hypothesis has incorrectly predicted the direction, then the null has correctly predicted the direction, and the null cannot be false, that is, you could never correctly reject the null. Copyright 2013, Karl L. Wuensch - All rights reserved.