Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY A possible experimental design 2 for verifying the above null hypothesis is to administer the new drug to a group of N patients who achieved good results with the traditional treatment after a suitable wash-out period has passed. Denote by yi1 and yi2 the blood pressure measured respectively before and after the new treatment for the ith patient. Let di = yi2 - yi1 be the corresponding change in response. Suppose it is sensible to assume a Gaussian model for the distribution of blood pressure in both periods. Then under the null hypothesis di has a normal distribution with mean 0 and some unknown standard deviation σd. Let d and sd be the sample mean and standard deviation of the differences di. Under the null hypothesis, and provided that the N subjects are independent, the statistic t = N d / s d has 3 a Student t distribution with N-1 degrees of freedom. APPENDIX A The Logic of Power Calculations A.1 What is “power”? The concept of “power” originates from the theory of statistical tests. The purpose of hypothesis testing is to verify the plausibility of the null hypothesis in the light of Suppose that for a group of N = 40 patients we obtained the following values: d = 3.67 and sd = 12.42. The observed value of the test statistic is toss =1.87. The probability of observing a result at least as extreme as the observed one is Pr ( T ≥ toss ) = 0.035, where T is a student t distribution with N-1=39 degrees of freedom. The null hypothesis is hence rejected at the specified 5% level. experimental data. Depending on the outcome, the null hypothesis either will or will not be rejected. With power we mean the probability of correctly rejecting a false null hypothesis. The primary purpose of power calculations is to determine the sample size that allows us to achieve the desired level for power. To make the above statements clearer we have to go back to the basics of the theory The lower the significance level, the more the data must diverge from the null of statistical tests. hypothesis to be significant. Therefore, the 1% level is more conservative than the 5% level. Example 1.1 (cont.): Note that if the significance level in the above example were 0.01, we would not reject the null hypothesis. That is, at the 5% level the evidence provided by the data is strong enough to say that the value d = 3.67 is not simply due to randomness, but originates from a distribution with mean value larger than zero. At the 1% level, we cannot exclude that that d = 3.67 is a random observation from a central Student t distribution with 39 degrees of freedom. A.2 Hypothesis testing A.2.1 Significance tests The common approach to significance testing is to start with a hypothesis about a population parameter, called the null hypothesis. Data are then collected and the plausibility of the null hypothesis is determined in light of the data via a suitable test When the null hypothesis is rejected, the outcome is said to be "statistically statistic. The measure used to assess how different the data are from what would be significant"; otherwise it is said to be "statistically not significant”. expected under the assumption that the null hypothesis is true is the p-value. It is defined as the probability of observing a value of the test statistics that is at least as unlikely under the null hypothesis as the observed one. The criterion used for A.2.2 Null hypothesis and alternative hypothesis rejecting the null hypothesis is the significance level indicated by the Greek letter α. In order to compute the p-value we need to identify the values of the test statistic that Traditionally, either the 0.05 level (called the 5% level) is used or the 0.01 level (1% are “at least as unlikely under the null hypothesis as the observed one”. Hence, in level), although the choice of levels is largely subjective. The p-value is compared hypothesis testing we are not only asked to formulate the null hypothesis, but also with the significance level: if it is less than or equal to α, then the null hypothesis is must have an idea of which hypothesis, called the alternative hypothesis, the null rejected; otherwise the null hypothesis it is not rejected. 1 hypothesis should be contrasted with. Traditionally the symbols H0 and H1 are used to indicate respectively the Example 1.0: Suppose that a pharmaceutical industry wants to assess the effect of a new hypertensive drug they produce. The null hypothesis they want to test is that µ2 - µ1 = 0 where µ1 is the mean level of blood pressure of patients treated with a well established drug and µ2 is the mean level for patients treated with the new drug. Thus, the null hypothesis concerns the parameter δ = µ2 - µ1, and the null hypothesis is that the parameter equals zero. The significance level is fixed at 0.05. null hypothesis and the alternative hypothesis. Example 1.2 (cont.): For calculating the p-value in Example 1.1 we considered all values larger than toss. This is, because the new drug is supposed to be as effective in lowering the blood pressure as the established one. Implicitly, we identified failure of the new treatment with its inability of lowering the 1 Note that not rejecting a null hypothesis does not automatically imply that we accept it. It only means that the data do not provide enough evidence to reject it. An experiment might yield evidence sufficient to reject the null hypothesis. But no experiment can demonstrate that the null hypothesis is true, as the true value of the population parameter remains unknown. 1 2 This is for illustrative purposes only. Better designs are available. 3 This test statistic is known as paired t test. 2 Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY blood pressure. If this is the case, the mean δ of the differences di = yi2 - yi1 would not be 4 zero, but larger . The notation is: H0: δ = 0 against H1: δ > 0. Under H1 the t statistic would still follow a Student t distribution with N-1 degrees of freedom but centered at a value larger than zero. It follows that the values that are unlikely under the null hypothesis, but likely if this fails, are in the right-hand tail of the distribution as shown in Figure a). The p-value we computed is a one-tailed p-value and corresponds to the shaded area in the figure. Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY probability of making a Type I error is denoted by α. Note that it is the same than the significance level of the test. If we do not reject the null hypothesis when the alternative hypothesis is true, we make a Type II error. The probability of making a Type II error is denoted by β. The possible scenarios are summarized in the table below. The null hypothesis is the one we are interested in. It is quite often a hypothesis of no difference as in our Statistical decision guiding example. That is why the word "null" is used, although there are many occasions when the parameter is not hypothesized to be 0. The alternative hypothesis includes all possible distributions except the null. For the True state of null hypothesis H0 true H0 false Reject H0 Type I error (α) Correct Do not reject H0 Correct Type II error (β) example just given, these are the distributions that are shifted to the right with respect to the curve given in A Type II error is only an error in the sense that an opportunity to reject the null Figure a). Some examples are given in Figure b) where the bold curve corresponds hypothesis correctly was lost. It is not an error in the sense that an incorrect to the null distribution. conclusion was drawn since no conclusion is drawn when the null hypothesis is not Although the interest usually focuses on the null hypothesis, the formulation of the alternative hypothesis is important to determine how to calculate the p-value. conclusion is drawn that the null hypothesis is false when, in fact, it is true. This explains why Type I errors are generally considered more serious than Type II errors, and why the probability of a Type I error, the significance level, is set by the Example 1.3 (cont.): Suppose that the new drug may not only fail in lowering the blood pressure of the patients (δ > 0) but also in that it lowers it too much (δ < 0). In this case H0: δ = 0, while H1: δ ≠ 0. Under the second scenario (δ < 0) we may observe values that fall into the left-hand tail of the reference distribution. As shown in Figure c), the p-value is now given by the sum of the two tail probabilities identified by the shaded area, i.e. 2*Pr ( |T| ≥ toss ) = 0.070, where T is still a student t distribution with 39 degrees of freedom. This time the null hypothesis cannot be rejected at the 5% level. Instead of using a one-tailed p-value we used a two-tailed p-value. experimenter. A.2.4 Power It may happen that the null hypothesis is the reverse of what the experimenter actually believes: researchers very frequently put forward a null hypothesis in the hope that they can discredit it. To make this clearer, let us slightly change our guiding example. A.2.3 Type I and Type II errors Hypothesis testing is a method of inferential statistics and hence subject to randomness. There are two kinds of errors that may occur: (1) a true null hypothesis can be incorrectly rejected and (2) a false null hypothesis can fail to be rejected. Example 1.4 (cont.): Suppose that a second experiment on further N = 40 patients yielded the following values: d = 4.37 and sd = 9.1. The observed value of the test statistic is equal to toss = 3.04. The two-sided p-value is now 0.0021. The data are highly significant, and the null hypothesis can be rejected at the 1% level. Which experiment should we trust? Example 2.0: Suppose that a researcher is interested in whether the time to respond to a tone is affected by the consumption of alcohol. The null hypothesis concerns again the parameter δ = µ2 - µ1, where µ2 is the mean time to respond after consuming alcohol and µ1 is the mean time to respond otherwise, and the null hypothesis is that the parameter equals zero. However, the experimenter probably expects alcohol to have a harmful effect. If the experimental data show a sufficiently large effect of alcohol, then the null hypothesis that alcohol has no effect can be rejected. In this view, making a Type II error is more serious than making a Type I error. As mentioned in the beginning, the power of a test is the probability of correctly accepting the alternative hypothesis when it is true. This probability equals 1-β, that is, it equals the complement of the Type II error rate. If the power of an experiment is If we reject the null hypothesis when it is true we make a Type I error. The 4 rejected. A Type I error, on the other hand, is an error in every sense of the word. A low, then there is a good chance that the experiment will be inconclusive. That is why it is so important to consider power in the design of experiments. Several methods are available for estimating the power of an experiment before the (unless the new drug proves to be better than the established one, but this is another story) 3 4 Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY Example 1.7 (cont.): Let us suppose again that δ = 2, σd = 10, N = 40, and that we want to test H0: δ = 0 against H1: δ > 0. The significance level α this time is 1%. By construction, the null hypothesis is not rejected for values of the t statistic smaller than 2.43 corresponding to the 99% quantile of a central Student t distribution with 39 degrees of freedom. The Type II error probability becomes β = Pr ( T∆ ≤ 2.43 ) = 0.867 where T∆ is a Student t distribution with N-1 = 39 degrees of freedom centered at ∆ = 1.26. The power is equal to 1 – β = 0.133. experiment is conducted. Example 1.5 (cont.): Suppose that for the hypertensive drug example we know that the true mean and standard deviation of the di’s are respectively δ = 2 and σd = 10. What is the probability of erroneously accepting the null hypothesis H0: δ = 0 against H1: δ > 0 if α = 0.05 and N=40? By construction, the null hypothesis is not rejected at the 5% level for values of the t statistic smaller than 1.685 which is the 95% quantile of a central Student t distribution with 39 degrees of freedom. The Type II error probability is the probability of the acceptance region but computed under H1. Under the alternative hypothesis that δ = 2, the distribution of the test statistic is shifted to right as shown in Figure d). That is, β = Pr ( T∆ ≤ 1.685 ) where now T∆ is a Student t distribution with N-1 = 39 degrees of freedom and non centrality parameter ∆ = Nδ/σ d = 1.26 . It follows that β = 0.658, and, hence, that the power is equal to 1 – β = 0.342. This corresponds to the shaded area in the figure. Requiring very strong evidence to reject the null hypothesis makes it very unlikely that a true null hypothesis will be rejected. However, it increases the chance that a false null hypothesis will not be rejected. A graphical way of summarizing the Type I and Type II error rates of a particular test statistic is by using a ROC (receiver operating characteristics) curve as shown in Figure e). The x-axis reports the Type I error rate (or significance level) α, while on the yaxis we find 1-β, that is the power of the test statistic. Note that one- and two-tailed test have the same Type I error rate but differ in power. Example 1.6 (cont.): In the above example what is the probability of erroneously accepting the null hypothesis H0: δ = 0 against H1: δ ≠ 0? The farer away the curve is from the straight line (corresponding to a random guess), the better is the test statistic. 5 If the power is too low, then the experiment can be re-designed by changing one of the factors that determine power. Now the acceptance region is the interval (-2.02, 2.02) which covers 95% of the probability of a central Student t distribution with 39 degrees of freedom. Under the alternative hypothesis, the distribution of the test statistic may be either shifted to right (δ > 0) or to the left (δ < 0) of the reference distribution. The Type II error probability becomes β = Pr ( | T∆ | ≤ 2.02 ) where T∆ is again a Student t distribution with N-1 = 39 degrees of freedom centered at ∆ = 1.26. We have that β = 0.766, while the power is 1 – β = 0.234. A.3.2 Factors that affect power The factors affecting power will be illustrated in terms of the hypertensive drug example. Notice that the same considerations apply to test statistics in general. One-sided test are more powerful than two-sided test as it is easier to reject the null hypothesis if it is not true. a) Difference between population means In the above two examples (1.5 and 1.6) we computed the power of the test statistic The size of the difference between the population means, δ in our example, is an for the particular choice δ = 2 and σd = 10. In fact, it is only possible to derive the important factor in determining the power of a test power for alternative hypotheses that completely specify the distribution of the test statistic. Naturally, the more the means differ from each statistic. As shown in Figure b), at each given value of ∆ corresponds a different other, the easier it is to detect the difference. As shown distribution. The farer away ∆ is from zero, the more the distribution is shifted to the in Figure f), the farther the value of δ is from zero, keeping the standard deviation σd fixed, (hence the right and, consequently the larger becomes the power. larger ∆ is), the smaller is the Type II error rate (β), and, therefore, the larger is the power for a fixed value α. The A.3 Power calculations experimenter usually has little control over the size of the effect, although sometimes he/she can choose the levels of the independent variable A.3.1 ROC curves to increase the size of the effect. The main motivation for power calculations is the trade-off between Type I and Type II errors. The more an experimenter wants to protect him or herself against Type I errors by choosing a low significance level α, the greater is the chance of committing a Type II error. 5 5 This corresponds to finding the optimal test statistic, that is, the test statistic that minimizes the Type II error rate for a given significance level. However, as this problem is very hard to solve, optimal test only exist for very special problems. One example is the paired t test statistic used as guiding example in the text. 6 Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY b) Standard deviation A.3.3 Sample size calculations The larger the standard deviation σd, the lower the Although there are several factors that may affect the power of a test statistics, most power. Suppose we keep δ fixed at 2 and let σd assume of them cannot be easily influenced by the experimenter. The difference of sample the values 10, 15, 20 and 30. The corresponding ROC means δ is usually the quantity he/she is interested in detecting, rather than a means curves are shown in Figure g). There are ways that an to regulate the power of the test statistic. The significance level α is usually kept fixed experimenter can reduce variance to increase power. to the two common values .05 or .01. Also, it may not always be possible to reduce One is to define a relatively homogeneous population. A the standard deviation σd by changing the experimental design. The only quantity that second way to reduce variance is by using a within- can easily be accessed is the sample size N. Specified the power one wishes to subjects design. In these designs, the overall level of achieve, the sample size needed for that level of power can be estimated. This is performance of each person is subtracted out. exactly what power calculations do. c) Significance level Example 1.8 (cont.): Given the significance level α, let us determine the sample size N that gives a power of 1-β for testing H0: δ = 0 against H1: δ > 0 in the hypertensive drug example. A further important factor affecting power is the significance level α. The more As mentioned several times, the null hypothesis is not rejected at the level α if the observed value of the t statistic is smaller than z1-α, the 1-α quantile of the central Student t distribution with N-1 degrees of freedom. Power is defined as 1-β = Pr ( T∆ ≥ z1-α ), where T∆ is a Student t distribution with N-1 degrees of freedom centered at ∆ = Nδ/σ d . This yields an equation in N whose solution is approximated by conservative (lower) α is chosen, the lower the power. As we saw in Example 1.7, using the 1% level will result in lower power than using the 5% level. The cost of stronger protection against Type I errors is more Type II errors. N = (zα + zβ)2 σd2 / δ2 = (zα + zβ)2 / CV2, d) Sample size (*) where CV=δ/σd is the coefficient of variation of the di’s. 6 Increasing N, the sample size, increases power. In fact, If instead of performing a one-sided test we performed a two-sided test (H1: δ ≠ 0), formula (*) would be changed into the cut-off point and the variance of the distribution of the test statistic change as N increases, while the mean N = (zα/2 + zβ)2 σd2 / δ2 = (zα + zβ)2 / CV2. of the distribution does not. The effect of sample size on (**) As mentioned at the end of Section A.2.4, two-sided tests are less powerful than one-sided, and a larger sample size is consequently needed to achieve the same level of power. power for the hypertensive drug example is seen in more detail in Figure h). Choosing a sample size is a As illustrated by the above example, the determination of the sample size depends difficult decision. There is a trade-off between the gain in power and the time and cost of testing a large number of on the specified significance level α and the desired power β through the two subjects. quantiles zα and zβ, but also on the effect size, measured in terms of the mean δ and e) Other factors choosing the sample size is to estimate the size of the effect. Estimating the effect Several other factors affect power. One factor is whether the population distribution is size includes estimating the population variance as well as the population means. the standard deviation σd of the response. Hence, the first and most difficult step in normal. Deviations from the assumption of normality usually lower power. A second Information may be retrieved from different sources. factor affecting power is the type of statistical procedure used. Some of the 1. If there are published experiments similar to the one to be conducted, then the distribution-free tests are less powerful than other tests when the distribution is effects obtained in these published studies can be used as a guide. There is, normal, but more powerful when the distribution is highly skewed. One-tailed tests however, a need for caution, since there is a tendency for published studies to are more powerful than two-tailed tests as long as the effect is in the expected contain overestimates of effect sizes. Often previous studies are not sufficiently direction. Otherwise, their power is zero. The choice of an experimental design can similar to a new study to provide a valid basis for estimating the effect size. In this have a profound effect on power. Within-subject designs are usual much more case, it is possible to specify the minimum effect size that is considered powerful than between-subject designs. important. 6 7 The details are omitted. 8 Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY Created on January 15, 2003 by A.R. Brazzale FOR INTERNAL USE ONLY 2. Frequently, it is easiest to specify the effect size in terms of the number of to reject the null hypothesis does not mean that the research hypothesis should be standard deviations separating the population means. Thus, one might find it abandoned. If the results are suggestive, further experiments should be conducted to easier to estimate that the population mean for the experimental group is 0.5 see if the existence of the effect can be confirmed. The power of the two experiments standard deviations above the population mean for the control group, that is to taken together will be greater than the power of either one. estimate the coefficient of variation CV, than to estimate the two population means and the population variance. Last but not least, one should keep in mind that power calculations are often based upon approximations. These may be approximations to the solution of the equation As shown by formulae (*) and (**), the power remains the same for any experiment in defining the sample size as was the case in Example 1.8. Or we may need to which the coefficient of variation CV is constant, that is, where the difference approximate, and more generally simply, the reference model. This is for instance the between population means is the same number of case if power calculations are to be performed for a regression model. As an population standard deviations apart. Power curves are example take a between-subjects design where the interest focuses on a particular used to report the power of the test statistic as a function covariate, while the remaining ones are included to take account of subject-specific of the different values of the parameter under the response patterns. In these situations it is quite common to resort to a simpler model alternative hypothesis. Figure i) report the power curves that allows one to do the calculations necessary to identify the sample size. for testing H0: δ = 0 against H1: δ > 0 in the hypertensive drug example. On the x-axis we put the effect size defined through the coefficient of variation CV, while the y-axis gives the sample size N. The five curves correspond to the powers 0.5, 0.8, 0.85, 0.9, 0.95 and 0.99; the significance level α is 5%. A.4 Final remarks It is important to keep in mind that power is not about whether or not the null hypothesis is true; in fact it is assumed to be false. The focus is on the probability that the data gathered in an experiment will be sufficient to reject the null hypothesis. Power calculation try to answer the question: If the null hypothesis is false with specified population means and standard deviation, what is the probability that the data from the experiment will be sufficient to reject the null hypothesis? If the experimenter discovers that the probability of rejecting the null hypothesis is low (low power) even if the null hypothesis is false to the degree expected, then it is likely that the experiment should be redesigned. Otherwise, considerable time and expense will go into a project that has a small chance of being conclusive even if the theoretical ideas behind it are correct. A question that arises spontaneously is: How much power is enough power? Obviously, the more power the better. However, in some experiments it is very time consuming and expensive to run each subject. In these experiments, the experimenter usually must accept less power than is typically found in experiments in which subjects can be run cheaply and easily. In any case, power below 0.25 would almost always be considered too low and power above 0.80 would be considered satisfactory. Keep in mind that a Type II error is not necessarily so bad since a failure 9 10