Download A The Logic of Power Calculations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
A possible experimental design 2 for verifying the above null hypothesis is to administer the
new drug to a group of N patients who achieved good results with the traditional treatment
after a suitable wash-out period has passed. Denote by yi1 and yi2 the blood pressure
measured respectively before and after the new treatment for the ith patient. Let di = yi2 - yi1
be the corresponding change in response. Suppose it is sensible to assume a Gaussian
model for the distribution of blood pressure in both periods. Then under the null hypothesis
di has a normal distribution with mean 0 and some unknown standard deviation σd. Let d
and sd be the sample mean and standard deviation of the differences di. Under the null
hypothesis, and provided that the N subjects are independent, the statistic t = N d / s d has
3
a Student t distribution with N-1 degrees of freedom.
APPENDIX
A The Logic of Power Calculations
A.1 What is “power”?
The concept of “power” originates from the theory of statistical tests. The purpose of
hypothesis testing is to verify the plausibility of the null hypothesis in the light of
Suppose that for a group of N = 40 patients we obtained the following values: d = 3.67 and
sd = 12.42. The observed value of the test statistic is toss =1.87. The probability of observing
a result at least as extreme as the observed one is Pr ( T ≥ toss ) = 0.035, where T is a
student t distribution with N-1=39 degrees of freedom. The null hypothesis is hence rejected
at the specified 5% level.
experimental data. Depending on the outcome, the null hypothesis either will or will
not be rejected. With power we mean the probability of correctly rejecting a false null
hypothesis. The primary purpose of power calculations is to determine the sample
size that allows us to achieve the desired level for power.
To make the above statements clearer we have to go back to the basics of the theory
The lower the significance level, the more the data must diverge from the null
of statistical tests.
hypothesis to be significant. Therefore, the 1% level is more conservative than the
5% level.
Example 1.1 (cont.): Note that if the significance level in the above example were 0.01, we
would not reject the null hypothesis. That is, at the 5% level the evidence provided by the
data is strong enough to say that the value d = 3.67 is not simply due to randomness, but
originates from a distribution with mean value larger than zero. At the 1% level, we cannot
exclude that that d = 3.67 is a random observation from a central Student t distribution with
39 degrees of freedom.
A.2 Hypothesis testing
A.2.1 Significance tests
The common approach to significance testing is to start with a hypothesis about a
population parameter, called the null hypothesis. Data are then collected and the
plausibility of the null hypothesis is determined in light of the data via a suitable test
When the null hypothesis is rejected, the outcome is said to be "statistically
statistic. The measure used to assess how different the data are from what would be
significant"; otherwise it is said to be "statistically not significant”.
expected under the assumption that the null hypothesis is true is the p-value. It is
defined as the probability of observing a value of the test statistics that is at least as
unlikely under the null hypothesis as the observed one. The criterion used for
A.2.2 Null hypothesis and alternative hypothesis
rejecting the null hypothesis is the significance level indicated by the Greek letter α.
In order to compute the p-value we need to identify the values of the test statistic that
Traditionally, either the 0.05 level (called the 5% level) is used or the 0.01 level (1%
are “at least as unlikely under the null hypothesis as the observed one”. Hence, in
level), although the choice of levels is largely subjective. The p-value is compared
hypothesis testing we are not only asked to formulate the null hypothesis, but also
with the significance level: if it is less than or equal to α, then the null hypothesis is
must have an idea of which hypothesis, called the alternative hypothesis, the null
rejected; otherwise the null hypothesis it is not rejected.
1
hypothesis should be contrasted with. Traditionally the
symbols H0 and H1 are used to indicate respectively the
Example 1.0: Suppose that a pharmaceutical industry wants to assess the effect of a new
hypertensive drug they produce. The null hypothesis they want to test is that µ2 - µ1 = 0
where µ1 is the mean level of blood pressure of patients treated with a well established drug
and µ2 is the mean level for patients treated with the new drug. Thus, the null hypothesis
concerns the parameter δ = µ2 - µ1, and the null hypothesis is that the parameter equals
zero. The significance level is fixed at 0.05.
null hypothesis and the alternative hypothesis.
Example 1.2 (cont.): For calculating the p-value in Example
1.1 we considered all values larger than toss. This is, because
the new drug is supposed to be as effective in lowering the
blood pressure as the established one. Implicitly, we identified
failure of the new treatment with its inability of lowering the
1
Note that not rejecting a null hypothesis does not automatically imply that we accept it. It only means that the data
do not provide enough evidence to reject it. An experiment might yield evidence sufficient to reject the null
hypothesis. But no experiment can demonstrate that the null hypothesis is true, as the true value of the population
parameter remains unknown.
1
2
This is for illustrative purposes only. Better designs are available.
3
This test statistic is known as paired t test.
2
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
blood pressure. If this is the case, the mean δ of the differences di = yi2 - yi1 would not be
4
zero, but larger . The notation is: H0: δ = 0 against H1: δ > 0. Under H1 the t statistic would
still follow a Student t distribution with N-1 degrees of freedom but centered at a value larger
than zero. It follows that the values that are unlikely under the null hypothesis, but likely if
this fails, are in the right-hand tail of the distribution as shown in Figure a). The p-value we
computed is a one-tailed p-value and corresponds to the shaded area in the figure.
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
probability of making a Type I error is denoted by α. Note that it is the same than the
significance level of the test. If we do not reject the null hypothesis when the
alternative hypothesis is true, we make a Type II error. The probability of making a
Type II error is denoted by β. The possible scenarios are summarized in the table
below.
The null hypothesis is the one we are interested in. It is
quite often a hypothesis of no difference as in our
Statistical decision
guiding example. That is why the word "null" is used,
although there are many occasions when the parameter
is not hypothesized to be 0. The alternative hypothesis
includes all possible distributions except the null. For the
True state of null hypothesis
H0 true
H0 false
Reject H0
Type I error (α)
Correct
Do not reject H0
Correct
Type II error (β)
example just given, these are the distributions that are
shifted to the right with respect to the curve given in
A Type II error is only an error in the sense that an opportunity to reject the null
Figure a). Some examples are given in Figure b) where the bold curve corresponds
hypothesis correctly was lost. It is not an error in the sense that an incorrect
to the null distribution.
conclusion was drawn since no conclusion is drawn when the null hypothesis is not
Although the interest usually focuses on the null hypothesis, the formulation of the
alternative hypothesis is important to determine how to calculate the p-value.
conclusion is drawn that the null hypothesis is false when, in fact, it is true. This
explains why Type I errors are generally considered more serious than Type II errors,
and why the probability of a Type I error, the significance level, is set by the
Example 1.3 (cont.): Suppose that the new drug may not
only fail in lowering the blood pressure of the patients (δ > 0)
but also in that it lowers it too much (δ < 0). In this case H0: δ
= 0, while H1: δ ≠ 0. Under the second scenario (δ < 0) we
may observe values that fall into the left-hand tail of the
reference distribution. As shown in Figure c), the p-value is
now given by the sum of the two tail probabilities identified by
the shaded area, i.e. 2*Pr ( |T| ≥ toss ) = 0.070, where T is still
a student t distribution with 39 degrees of freedom. This time
the null hypothesis cannot be rejected at the 5% level. Instead
of using a one-tailed p-value we used a two-tailed p-value.
experimenter.
A.2.4 Power
It may happen that the null hypothesis is the reverse of what the experimenter
actually believes: researchers very frequently put forward a null hypothesis in the
hope that they can discredit it. To make this clearer, let us slightly change our guiding
example.
A.2.3 Type I and Type II errors
Hypothesis testing is a method of inferential statistics and hence subject to
randomness. There are two kinds of errors that may occur: (1) a true null
hypothesis can be incorrectly rejected and (2) a false null hypothesis can fail to be
rejected.
Example 1.4 (cont.): Suppose that a second experiment on further N = 40 patients yielded
the following values: d = 4.37 and sd = 9.1. The observed value of the test statistic is equal
to toss = 3.04. The two-sided p-value is now 0.0021. The data are highly significant, and the
null hypothesis can be rejected at the 1% level. Which experiment should we trust?
Example 2.0: Suppose that a researcher is interested in whether the time to respond to a
tone is affected by the consumption of alcohol. The null hypothesis concerns again the
parameter δ = µ2 - µ1, where µ2 is the mean time to respond after consuming alcohol and µ1
is the mean time to respond otherwise, and the null hypothesis is that the parameter equals
zero. However, the experimenter probably expects alcohol to have a harmful effect. If the
experimental data show a sufficiently large effect of alcohol, then the null hypothesis that
alcohol has no effect can be rejected. In this view, making a Type II error is more serious
than making a Type I error.
As mentioned in the beginning, the power of a test is the probability of correctly
accepting the alternative hypothesis when it is true. This probability equals 1-β, that
is, it equals the complement of the Type II error rate. If the power of an experiment is
If we reject the null hypothesis when it is true we make a Type I error. The
4
rejected. A Type I error, on the other hand, is an error in every sense of the word. A
low, then there is a good chance that the experiment will be inconclusive. That is why
it is so important to consider power in the design of experiments.
Several methods are available for estimating the power of an experiment before the
(unless the new drug proves to be better than the established one, but this is another story)
3
4
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
Example 1.7 (cont.): Let us suppose again that δ = 2, σd = 10, N = 40, and that we want to
test H0: δ = 0 against H1: δ > 0. The significance level α this time is 1%. By construction, the
null hypothesis is not rejected for values of the t statistic smaller than 2.43 corresponding to
the 99% quantile of a central Student t distribution with 39 degrees of freedom. The Type II
error probability becomes β = Pr ( T∆ ≤ 2.43 ) = 0.867 where T∆ is a Student t distribution
with N-1 = 39 degrees of freedom centered at ∆ = 1.26. The power is equal to 1 – β = 0.133.
experiment is conducted.
Example 1.5 (cont.): Suppose that for the hypertensive drug example we know that the
true mean and standard deviation of the di’s are respectively δ = 2 and σd = 10. What is the
probability of erroneously accepting the null hypothesis
H0: δ = 0 against H1: δ > 0 if α = 0.05 and N=40?
By construction, the null hypothesis is not rejected at the
5% level for values of the t statistic smaller than 1.685
which is the 95% quantile of a central Student t
distribution with 39 degrees of freedom. The Type II error
probability is the probability of the acceptance region but
computed under H1. Under the alternative hypothesis that
δ = 2, the distribution of the test statistic is shifted to right
as shown in Figure d). That is, β = Pr ( T∆ ≤ 1.685 ) where
now T∆ is a Student t distribution with N-1 = 39 degrees
of freedom and non centrality parameter ∆ = Nδ/σ d = 1.26 . It follows that β = 0.658, and,
hence, that the power is equal to 1 – β = 0.342. This corresponds to the shaded area in the
figure.
Requiring very strong evidence to reject the null hypothesis makes it very unlikely
that a true null hypothesis will be rejected. However, it
increases the chance that a false null hypothesis will not
be rejected. A graphical way of summarizing the Type I
and Type II error rates of a particular test statistic is by
using a ROC (receiver operating characteristics)
curve as shown in Figure e). The x-axis reports the
Type I error rate (or significance level) α, while on the yaxis we find 1-β, that is the power of the test statistic.
Note that one- and two-tailed test have the same Type I error rate but differ in power.
Example 1.6 (cont.): In the above example what is the probability of erroneously accepting
the null hypothesis H0: δ = 0 against H1: δ ≠ 0?
The farer away the curve is from the straight line (corresponding to a random guess),
the better is the test statistic. 5
If the power is too low, then the experiment can be re-designed by changing one of
the factors that determine power.
Now the acceptance region is the interval (-2.02, 2.02) which covers 95% of the probability
of a central Student t distribution with 39 degrees of freedom. Under the alternative
hypothesis, the distribution of the test statistic may be either shifted to right (δ > 0) or to the
left (δ < 0) of the reference distribution. The Type II error probability becomes β = Pr ( | T∆ | ≤
2.02 ) where T∆ is again a Student t distribution with N-1 = 39 degrees of freedom centered
at ∆ = 1.26. We have that β = 0.766, while the power is 1 – β = 0.234.
A.3.2 Factors that affect power
The factors affecting power will be illustrated in terms of the hypertensive drug
example. Notice that the same considerations apply to test statistics in general.
One-sided test are more powerful than two-sided test as it is easier to reject the null
hypothesis if it is not true.
a) Difference between population means
In the above two examples (1.5 and 1.6) we computed the power of the test statistic
The size of the difference between the population means, δ in our example, is an
for the particular choice δ = 2 and σd = 10. In fact, it is only possible to derive the
important factor in determining the power of a test
power for alternative hypotheses that completely specify the distribution of the test
statistic. Naturally, the more the means differ from each
statistic. As shown in Figure b), at each given value of ∆ corresponds a different
other, the easier it is to detect the difference. As shown
distribution. The farer away ∆ is from zero, the more the distribution is shifted to the
in Figure f), the farther the value of δ is from zero,
keeping the standard deviation σd fixed, (hence the
right and, consequently the larger becomes the power.
larger ∆ is), the smaller is the Type II error rate (β), and,
therefore, the larger is the power for a fixed value α. The
A.3 Power calculations
experimenter usually has little control over the size of the
effect, although sometimes he/she can choose the levels of the independent variable
A.3.1 ROC curves
to increase the size of the effect.
The main motivation for power calculations is the trade-off between Type I and Type
II errors. The more an experimenter wants to protect him or herself against Type I
errors by choosing a low significance level α, the greater is the chance of committing
a Type II error.
5
5
This corresponds to finding the optimal test statistic, that is, the test statistic that minimizes the Type II error rate for
a given significance level. However, as this problem is very hard to solve, optimal test only exist for very special
problems. One example is the paired t test statistic used as guiding example in the text.
6
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
b) Standard deviation
A.3.3 Sample size calculations
The larger the standard deviation σd, the lower the
Although there are several factors that may affect the power of a test statistics, most
power. Suppose we keep δ fixed at 2 and let σd assume
of them cannot be easily influenced by the experimenter. The difference of sample
the values 10, 15, 20 and 30. The corresponding ROC
means δ is usually the quantity he/she is interested in detecting, rather than a means
curves are shown in Figure g). There are ways that an
to regulate the power of the test statistic. The significance level α is usually kept fixed
experimenter can reduce variance to increase power.
to the two common values .05 or .01. Also, it may not always be possible to reduce
One is to define a relatively homogeneous population. A
the standard deviation σd by changing the experimental design. The only quantity that
second way to reduce variance is by using a within-
can easily be accessed is the sample size N. Specified the power one wishes to
subjects design. In these designs, the overall level of
achieve, the sample size needed for that level of power can be estimated. This is
performance of each person is subtracted out.
exactly what power calculations do.
c) Significance level
Example 1.8 (cont.): Given the significance level α, let us determine the sample size N that
gives a power of 1-β for testing H0: δ = 0 against H1: δ > 0 in the hypertensive drug example.
A further important factor affecting power is the significance level α. The more
As mentioned several times, the null hypothesis is not rejected at the level α if the observed
value of the t statistic is smaller than z1-α, the 1-α quantile of the central Student t distribution
with N-1 degrees of freedom. Power is defined as 1-β = Pr ( T∆ ≥ z1-α ), where T∆ is a
Student t distribution with N-1 degrees of freedom centered at ∆ = Nδ/σ d . This yields an
equation in N whose solution is approximated by
conservative (lower) α is chosen, the lower the power. As we saw in Example 1.7,
using the 1% level will result in lower power than using the 5% level. The cost of
stronger protection against Type I errors is more Type II errors.
N = (zα + zβ)2 σd2 / δ2 = (zα + zβ)2 / CV2,
d) Sample size
(*)
where CV=δ/σd is the coefficient of variation of the di’s. 6
Increasing N, the sample size, increases power. In fact,
If instead of performing a one-sided test we performed a two-sided test (H1: δ ≠ 0), formula
(*) would be changed into
the cut-off point and the variance of the distribution of
the test statistic change as N increases, while the mean
N = (zα/2 + zβ)2 σd2 / δ2 = (zα + zβ)2 / CV2.
of the distribution does not. The effect of sample size on
(**)
As mentioned at the end of Section A.2.4, two-sided tests are less powerful than one-sided,
and a larger sample size is consequently needed to achieve the same level of power.
power for the hypertensive drug example is seen in
more detail in Figure h). Choosing a sample size is a
As illustrated by the above example, the determination of the sample size depends
difficult decision. There is a trade-off between the gain in
power and the time and cost of testing a large number of
on the specified significance level α and the desired power β through the two
subjects.
quantiles zα and zβ, but also on the effect size, measured in terms of the mean δ and
e) Other factors
choosing the sample size is to estimate the size of the effect. Estimating the effect
Several other factors affect power. One factor is whether the population distribution is
size includes estimating the population variance as well as the population means.
the standard deviation σd of the response. Hence, the first and most difficult step in
normal. Deviations from the assumption of normality usually lower power. A second
Information may be retrieved from different sources.
factor affecting power is the type of statistical procedure used. Some of the
1. If there are published experiments similar to the one to be conducted, then the
distribution-free tests are less powerful than other tests when the distribution is
effects obtained in these published studies can be used as a guide. There is,
normal, but more powerful when the distribution is highly skewed. One-tailed tests
however, a need for caution, since there is a tendency for published studies to
are more powerful than two-tailed tests as long as the effect is in the expected
contain overestimates of effect sizes. Often previous studies are not sufficiently
direction. Otherwise, their power is zero. The choice of an experimental design can
similar to a new study to provide a valid basis for estimating the effect size. In this
have a profound effect on power. Within-subject designs are usual much more
case, it is possible to specify the minimum effect size that is considered
powerful than between-subject designs.
important.
6
7
The details are omitted.
8
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
Created on January 15, 2003 by A.R. Brazzale
FOR INTERNAL USE ONLY
2. Frequently, it is easiest to specify the effect size in terms of the number of
to reject the null hypothesis does not mean that the research hypothesis should be
standard deviations separating the population means. Thus, one might find it
abandoned. If the results are suggestive, further experiments should be conducted to
easier to estimate that the population mean for the experimental group is 0.5
see if the existence of the effect can be confirmed. The power of the two experiments
standard deviations above the population mean for the control group, that is to
taken together will be greater than the power of either one.
estimate the coefficient of variation CV, than to estimate the two population
means and the population variance.
Last but not least, one should keep in mind that power calculations are often based
upon approximations. These may be approximations to the solution of the equation
As shown by formulae (*) and (**), the power remains the same for any experiment in
defining the sample size as was the case in Example 1.8. Or we may need to
which the coefficient of variation CV is constant, that is, where the difference
approximate, and more generally simply, the reference model. This is for instance the
between population means is the same number of
case if power calculations are to be performed for a regression model. As an
population standard deviations apart. Power curves are
example take a between-subjects design where the interest focuses on a particular
used to report the power of the test statistic as a function
covariate, while the remaining ones are included to take account of subject-specific
of the different values of the parameter under the
response patterns. In these situations it is quite common to resort to a simpler model
alternative hypothesis. Figure i) report the power curves
that allows one to do the calculations necessary to identify the sample size.
for testing H0: δ = 0 against H1: δ > 0 in the hypertensive
drug example. On the x-axis we put the effect size
defined through the coefficient of variation CV, while the
y-axis gives the sample size N. The five curves correspond to the powers 0.5, 0.8,
0.85, 0.9, 0.95 and 0.99; the significance level α is 5%.
A.4 Final remarks
It is important to keep in mind that power is not about whether or not the null
hypothesis is true; in fact it is assumed to be false. The focus is on the probability
that the data gathered in an experiment will be sufficient to reject the null hypothesis.
Power calculation try to answer the question: If the null hypothesis is false with
specified population means and standard deviation, what is the probability that the
data from the experiment will be sufficient to reject the null hypothesis? If the
experimenter discovers that the probability of rejecting the null hypothesis is low (low
power) even if the null hypothesis is false to the degree expected, then it is likely that
the experiment should be redesigned. Otherwise, considerable time and expense will
go into a project that has a small chance of being conclusive even if the theoretical
ideas behind it are correct.
A question that arises spontaneously is: How much power is enough power?
Obviously, the more power the better. However, in some experiments it is very time
consuming and expensive to run each subject. In these experiments, the
experimenter usually must accept less power than is typically found in experiments in
which subjects can be run cheaply and easily. In any case, power below 0.25 would
almost always be considered too low and power above 0.80 would be considered
satisfactory. Keep in mind that a Type II error is not necessarily so bad since a failure
9
10