Download Power Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
2101
Power Analysis
Power is the conditional probability that one will reject the null hypothesis given
that the null hypothesis is really false. Beta, β, is the conditional probability that one will
not reject the null hypothesis given that the null hypothesis is really false (that is, make
a Type II error). Since rejecting the null and not rejecting the null are mutually exclusive
and exhaustive events, Power + Beta = 1.
Imagine that we are evaluating the effect of a putative memory enhancing drug.
We have randomly sampled 25 people from a population known to have normally
distributed IQ with a  of 100 and a  of 15. We administer the drug, wait a reasonable
time for it to take effect, and then test our subjects’ IQ. Assume that we were so
confident in our belief that the drug would either increase IQ or have no effect that we
entertained directional hypotheses. Our null hypothesis is that after administering the
drug   100; our alternative hypothesis is  > 100.
These hypotheses must first be converted to exact hypotheses. Converting the
null is easy: it becomes  = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be  = 115 and we could compute power, the probability of correctly rejecting the
false null hypothesis given that  is really equal to 115 after drug treatment, not 100
(normal IQ). But if we already knew how large the effect of the drug was we would not
need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide that if
the drug increases iq by 2 or more points, then that is a nontrivial effect, but if the mean
increase is less than 2 then the effect is trivial.
Now we can test the null of  = 100 versus the alternative of  = 102. Refer to
Figure 15.1 on page 385 of the 7th edition of Howell. Let the left curve represent the
distribution of sample means if the null hypothesis were true,  = 100. This sampling
15
 3 . Let the right curve represent the
distribution has a  = 100 and a  x 
25
sampling distribution if the exact alternative hypothesis is true,  = 102. Its  is 102
15
 3.
and, assuming the drug has no effect on the variance in IQ scores,  x 
25
The dark shaded area in the upper tail of the null distribution is . Assume we
are using a one-tailed  of .05. How large would a sample mean need be for us to
reject the null? Since the upper 5% of a normal distribution extends from 1.645  above
the  up to positive infinity, the sample mean IQ would need be 100 + 1.645(3) =
104.935 or more to reject the null. What are the chances of getting a sample mean of
104.935 or more if the alternative hypothesis is correct, if the drug increases IQ by 2
points? The area under the alternative curve from 104.935 up to positive infinity

Copyright 2013, Karl L. Wuensch - All rights reserved.
Power01.docx
2
represents that probability, which is power. Assuming the alternative hypothesis is true,
that  = 102, the probability of rejecting the null hypothesis is the probability of getting a
sample mean of 104.935 or more in a normal distribution with  = 102,  = 3. Z =
(104.935  102)/3 = 0.98 and P(Z > 0.98) = .1635. That is, power is about 16%. If the
drug really does increase IQ by an average of 2 points, we have a 16% chance of
rejecting the null. If its effect is even larger, we have a greater than 16% chance.
Suppose we consider 5 the minimum nontrivial effect size. This will separate the
null and alternative distributions more, decreasing their overlap and increasing power,
as in Figure 15.2 on page 388. Now, Z = (104.935  105)/3 = 0.02, P(Z > 0.02) =
.5080 or about 51%. It is easier to detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ; 
is now split into both tails of the null distribution, .025 in each tail. We shall reject the
null if the sample mean is 1.96 or more standard errors away from the  of the null
distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100 
1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the
alternative is correct ( = 105) is: Z = (105.88  105)/3 = .29, P(Z > .29) = .3859, power
= about 39%. We can ignore P(Z < (94.12  105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly
predict the direction of effect, a one-tailed test is more powerful than a two-tailed
test. What would power be if you incorrectly predicted the direction of effect?
Consider what would happen if you increased sample size to 100. Now the
15
x 
 1.5 . With the null and alternative distributions less plump they should
100
overlap less, increasing power, as in Figure 15.3 on page 388. With  x  1.5 the
sample mean will need be 100 + (1.96)(1.5) = 102.94 or more to reject the null. If the
drug increases IQ by 5 points, power is : Z = (102.94  105)/1.5 = 1.37, P(Z > 1.37) =
.9147 or between 91 and 92%. Anything that decreases the standard error will
increase power. This may be achieved by increasing the sample size or by
reducing the  of the dependent variable. The  of the DV may be reduced by
reducing the influence of extraneous variables upon the DV (eliminating “noise” in the
DV makes it easier to detect the signal, the IV’s effect on the DV).
Now consider what happens if you change . Let us reduce  to .01. Now the
sample mean must be 2.58 or more standard errors from the null  before we reject the
null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative Z = (103.87  105)/1.5 =
0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with  at .05, ceteris
paribus. Reducing  reduces power.
Please note that all of the above analyses have assumed that we have used a
X  
normally distributed test statistic, as Z 
will be if the criterion variable is
x
normally distributed in the population or if sample size is large enough to invoke the
CLT. Remember that using Z also requires that you know the population  rather than
estimating it from the sample data. We more often estimate the population , using
Student’s t as the test statistic. If N is fairly large, Student’s t is nearly normal, so this is
3
no problem. For example, with a two-tailed  of .05 and N = 25 we went out ± 1.96
standard errors to mark off the rejection region. With Student’s t on N 1 = 24 df we
should have gone out ± 2.064 standard errors. But 1.96 versus 2.06 is relatively trivial,
so we should feel comfortable with the normal approximation. If, however, we had N =
5, df = 4, critical t = ± 2.776, and the normal approximation would not do. A more
complex analysis would be needed.
One Sample Power Analysis the Easy Way
Howell presents a simple method of doing power analyses for various designs.
His method also assumes a normal sampling distribution, so it is most valid with large
sample sizes. The first step in using Howell’s method is to compute d, the effect size
1  
in  units. For the one sample test d 
. For our IQ problem with minimum

nontrivial effect size at 5 IQ points, d = (105  100)/15 = 1/3. We combine d with N to
get . For the one sample test,   d N .
For our IQ problem with N = 25,  = 1/3  5 = 1.67. Once  is obtained, power is
obtained using Table E.5 on page 595 of Howell. For a .05 two-tailed test, power = 36%
for a  of 1.60 and 40% for a  of 1.70. By linear interpolation, power for  = 1.67 is
36% + .7(40%  36%) = 38.8%, within rounding error of the result we obtained using the
normal curve. If you look carefully at the table, you will see that the larger  is, the
greater power is.
For a one-tailed test, use the column in the table with  twice its one-tailed value.
For  of .05 one-tailed, use the .10 two-tailed column. For  = 1.67 power then = 48% +
7(52%  48%) = 50.8%, the same answer we got with the normal curve.
If we were not able to reject the null hypothesis in our research on the putative IQ
drug and our power analysis indicated about 39% power, we would be in an awkward
position. Although we could not reject the null, we also could not accept it, given that
we only had a relatively small (39%) chance of rejecting it even if it were false. We
might decide to repeat the experiment using an N large enough to allow us to accept
the null if we cannot reject it. In my opinion, if 5% is a reasonable risk for a Type One
error (), then 5% is also a reasonable risk for a Type Two error ( ), so let us use
power = 1   = 95%. From the power table, to have power = .95 with  = .05,  is 3.60.
 
 3.6 
N    . For a five-point minimum IQ effect N  
  116.64 . Thus, if we repeat
d 
 1/ 3 
the experiment with 117 subjects and still cannot reject the null, we can accept the null
and conclude that the drug has no nontrivial ( 5 IQ points) effect upon IQ. The null
hypothesis we are accepting here is a “loose null hypothesis” [95 <  < 105] rather
than a “sharp null hypothesis” [ = exactly 100]. Sharp null hypotheses are probably
very rarely ever true.
Others could argue with your choice of the minimum nontrivial effect size. Cohen
has defined a small effect as d = .20, a medium effect as d = .50, and a large effect as d
2
2
4
= .80. If you defined minimum d at .20 you would need even more subjects for 95%
power.
A third approach one can take is to find the smallest effect that one could have
detected with high probability given N. This is known as a “sensitivity analysis.” If that d
is small and the null hypothesis is not rejected, it is accepted. For example, I used 225
subjects in the IQ enhancer study. For power = 95%,  = 3.60 with  .05 two-tailed, and

3.60
d

 0.24 . If I can’t reject the null, I accept it, concluding that if the drug
15
N
has any effect, it is a small effect, since I had a 95% chance of detecting an effect as
small as .24 . The loose null hypothesis accepted here would be that the population 
differs from 100 by less than .24 .
Two Independent Samples
n
2
where n = the number of scores in one group and both groups have the same n;
  2
, where  is the standard deviation in either population, assuming  is
d 1
For the two group independent sampling design,   d

identical in both populations.
If n1  n2 use the harmonic mean sample size, n~ 
2
1 1

n1 n2
.
For a fixed total N, the harmonic mean (and thus power) is higher the more
nearly equal n1 and n2 are. This is one good reason to use equal n designs. Other
good reasons are computational simplicity with equal n’s and greater robustness to
violation of assumptions. Try computing the effective (harmonic) sample size for 100
subjects evenly split into two groups of 50 each and compare that with the effective
sample size obtained if you split them into 10 in one group and 90 in the other.
You should be able to rearrange the above formula for  to solve for d or for n as
required.
Consider the following a priori (done before gathering the data) power analysis.
We wish to compare the mean psychology achievement test score of students in
Research Psychology graduate programs with that of those in Clinical Psychology
graduate programs. We decide that we will be satisfied if we have enough data to have
an 80% chance of detecting an effect of 1/3 of a standard deviation, employing a .05
criterion of significance. How many scores do we need in each group, if we have the
same number of scores in each group? From the table in Howell, we obtain the value of
 
 2.8 
2.8 for  . n  2   2
  141 scores in each group, a total of 282 scores.
d 
 1/ 3 
Consider the following a posteriori (done after gathering the data) power
analysis. We have available only 36 scores from students in clinical programs and 48
2
2
5
scores from students in general programs. What are our chances of detecting a
difference of 40 points (which is that actually observed at ECU way back in 1981) if we
use a .05 criterion of significance and the standard deviation is 98? The standardized
2
~
 41.14
effect size, d, is 40/98 = .408. The harmonic mean sample size is n
1
1

36 48
  .408
41.14
 1.85 . From our power table, power is 46% (halfway between .44 and
2
.48).
Correlated Samples
The correlated samples t test is mathematically equivalent to a one-sample t test
conducted on the difference scores (for each subject, score under one condition less
score under the other condition). The greater ρ12, the correlation between the scores in
the one condition and those in the second condition, the smaller the standard deviation
of the difference scores and the greater the power, ceteris paribus. By the variance
sum law, the standard deviation of the difference scores is  Diff   12   22  2 12 1 2 .
If we assume equal variances, this simplifies to  Diff   2(1   ) .
When conducting a power analysis for the correlated samples design, we can
take into account the effect of ρ12 by computing dDiff, an adjusted value of d:
  2
d
d Diff  1

, where d is the effect size as computed above, with
 Diff
2(1  12 )
independent samples. We can then compute power via   d Diff n or the required
 
sample size via n  
 d Diff
2

 .

Please note that using the standard error of the difference scores, rather than the
standard deviation of the criterion variable, as the denominator of dDiff, is simply
Howell’s method of incorporating into the analysis the effect of the correlation produced
by matching. If we were computing estimated d as an estimate of the standardized
effect size given the obtained results, we would use the standard deviation of the
criterion variable in the denominator, not the standard deviation of the difference scores.
I should admit that on rare occasions I have argued that, in a particular research
context, it made more sense to use the standard deviation of the difference scores in
the denominator of d.
Consider the following a posteriori power analysis. One of my colleagues has
data on level of cortisol in humans under two conditions: 1.) in a relaxed setting, and 2.)
in a setting thought to increase anxiety. The standard deviations in her samples are
108 and 114. If the effect size in the population were 20 points (which she considers
the smallest nontrivial effect), how much power would she have? The estimation
correlation resulting from the within-subjects design is .75.
6
The pooled standard deviation is
.5(1082 )  .5(1142 )  111. d = 20/111 = .18.
.18
 .255 .   .255 15  1.02. From the power table, power = 17%.
2(1  .75)
My colleague is not very happy about this low power. I point out that maybe she is
lucky and has such a big difference in means that it is significant despite the low power.
Regretfully, that is not so. She starts planning to repeat the study with a larger sample
size.
d Diff 
How much data will my colleague need? She wants to have enough power to be
able to detect an effect as small as 20 points of a standard deviation (d = .18) with 95%
power – she considers Type I and Type II errors equally serious and is employing a .05
criterion of statistical significance, so she wants beta to be not more than .05. .
 
n  
 d Diff
2

 3.6 
  
  199.3.
.
255



2
Here is another example of power analysis with matched pairs.
Addendum
What would power be if you incorrectly predicted the direction of effect? For
example, for the current design, suppose that we (in our alternative hypothesis)
predicted that the drug would increase IQ but, in fact, it decreases IQ (µ is less than
100). The null hypothesis is that µ is 100 or less. The one-tailed critical value for mean
IQ is still 100 + 1.645(3) = 104.935 or more to reject the null. If the effect of the drug
were 5 points, but 5 points below 100 instead of above 100, the probability of rejecting
the null would be the probability of getting a z greater or equal to (104.935 – 95)/3; z =
2.31; probability = 1%. This probability is not, however, power. Power is the probability
of rejecting the null given that the null is false. If, however, the alternative hypothesis
has incorrectly predicted the direction, then the null has correctly predicted the direction,
and the null cannot be false, that is, you could never correctly reject the null.
Copyright 2013, Karl L. Wuensch - All rights reserved.