Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician http://gcrc.humc.edu/Biostat Readings for Session 3 from StatisticalPractice.com • Significance test / hypothesis testing • Significance tests simplified Example Consider a parallel study: 1. Randomize an equal number of subjects to treatment A or treatment B. 2. Follow all subjects for a specified period of time. 3. Measure X= post-pre change in an outcome, such as cholesterol. Primary Aim: Do treatments A and B differ in mean effectiveness? Restated aim: If μA and μB are the true, unknown, mean postpre changes that would occur if all potential subjects received treatment A or treatment B, do we have evidence from our limited sample whether μA ≠ μB? Extreme Outcome #1 Suppose results from the study are plotted as: X Each point is a separate subject. A B Obviously, B is more effective than A. Extreme Outcome #2 Suppose results from the study are plotted as: X Each point is a separate subject. A B Obviously, A and B are equally effective. More Realistic Possible Outcome I Suppose results from the study are plotted as: X Each point is a separate subject. A B Is the overlap small enough to claim that B is more effective? More Realistic Possible Outcome II Suppose the ranges are narrower, with the same group mean difference: X Each point is a separate subject. A B Now, is this minor overlap sufficient to come to a conclusion? More Realistic Possible Outcome III Suppose the ranges are wider, but so is the group difference: X Each point is a separate subject. A B Is the overlap small enough to claim that B is more effective? More Realistic Possible Outcome IV Here, the ranges for X are the same as the last slide, but there are many more subjects: X Each point is a separate subject. A B So, just examining the overlap isn’t sufficient to come to a conclusion, since intuitively the larger N should affect the results. Our Goal Goal: We need a rule that can be consistently applied to most studies to make the decision whether or not μA ≠ μB. From the previous 4 slides, relevant measures that will go into our decision rule are: 1. Number of subjects, N; could be different for the groups. 2. Difference between groups in observed means (X-bar for A and for B subjects). 3. Variability among subjects (SD for A and B subjects). Goal, Continued Goal: We need a rule that can be consistently applied to most studies to make the decision whether or not μA ≠ μB. Other relevant issues: 1. Our conclusion could be wrong. We need to incorporate a mechanism for minimizing that possibility. 2. Small differences are probably unimportant. Can we incorporate that as well? A Graphical Look at All of the Issues The figure on the following slide shows most of the issues that are involved in testing hypotheses. It is complicated, but we will through each of the factors that it addresses, on slides after the figure: 1. Null hypothesis H0 vs. alternative hypothesis HA. 2. Decision rule: Choose HA if ….[involves Ns, means and SDs] . 3. α=Probability (Type I error)= Prob (choosing HA when H0 is true). 4. β=Probability (Type II error)= Prob (choosing H0 when HA is true). 5. What changes if N was larger? Graphical Representation of Hypothesis Tests 1: Null hypothesis H0 vs. alternative hypothesis HA. All statistical tests have two hypotheses to choose from: The null hypothesis states a negative conclusion, that there is “no effect”, which could mean various specific outcomes in different studies. It always includes at least one mathematical expression that is 0. Here, the null hypothesis is H0: μA- μB = 0. This states that the post-pre changes are, on the average, the same for A as for B. The left (red) curve has it’s peak at this 0. The alternative hypothesis includes every possibility other than 0, i.e., HA: μA- μB ≠ 0. In the figure, we chose just one alternative for illustration, namely that μA- μB = 3. The right (blue) curve has it’s peak at this 0. For each curve, the height represents the relative frequency of subjects, so more subjects have X’s near the peak. 2: Decision Rule for Choosing H0 or HA. A poor, but reasonable rule. First suppose that we only consider choosing between H0 and the particular HA: μA- μB = 3, as in the figure. Common sense might say that we calculate x-bar (which is the mean of changes for B subjects, minus the mean of changes for A subjects), and then choose H0 if x-bar is closer to 0, the hypothesized value under H0, or choose HA if it closer to 3, the hypothesized value for HA. The green line in the figure is on the x-bar from the sample, which is 1.128, and so HA would be chosen with this rule, since it is closer to 0 than 3. A problem with this rule is that we cannot state how certain we are about our decision. It seems like the reasonable choice between the 2 possibilities, but if we used the rule in many studies, we could not say that most (90%?, 95%?) were correct. 2: Decision Rule for Choosing H0 or HA. The correct rule. To start to quantify the certainty of some conclusions we will make, recall the reasoning for confidence intervals. If H0 is true, we expect that x-bar will not only be close to 0, but that with 95% probability, it will be within about* ±2SE of 0, i.e., between about -2.8 and +2.8. This is the non-\\\’d region under the H0 (red) curve. Thus, the decision rule is: Choose HA if x-bar is outside 0±2SE, the critical region. The reason for using this rule is that if H0 is really true, then there is only a 5% chance we would get an x-bar in the critical region. Thus, if we decide on HA, there is only a 5% chance we are wrong for any particular test. Roughly, if the rule is applied consistently, then only 5% of statistical tests will be false positive conclusions, although which ones are wrong is unknown. *See a textbook for exact calculations. The multiplier is slightly larger than 2. 3: Probabilities of False Positive Conclusions A false positive conclusion, i.e., choosing HA (positive conclusion) when H0 is really true (so the conclusion is false) is considered the more serious error, denoted “Type I”. We have guaranteed (previous slide) that the rate for this error, denoted α=level of significance, is 0.05, or that there is a 5% chance of it occurring. The 0.05 or 5% value is just the conventional level of risk for positive conclusions that scientists have decided is acceptable. The FDA also requires this level in most clinical studies. The concept carries over for other levels of risk, though, and statistical tables can determine the critical region for other levels, e.g., approximately 0±1.65SE for α=0.10, where we would choose HA more often, and make twice as many mistakes in the long run in so doing. 4: Probabilities of False Negative Conclusions In our figure example, we choose H0, i.e., no treatment difference, i.e., a negative conclusion, since x-bar=1.128 is between -2.8 and +2.8. If we had chosen HA, we would know there was only a 5% chance we were wrong. Can we also quantify the chances of a false negative conclusion, which we might be making here? Yes, but it will depend on what really constitutes “false negative”. I.e., we conclude μA- μB = 0, but if really μA- μB = 0.0001, are we wrong in a practical sense? Often, a value for a clinically relevant effect is specified, such as 3 in the figure example. Then, if HA: μAμB=3 is really true, but we choose H0, we have made a type-2 error. It’s probability is the area under the correct (HA now, blue) curve in the region where H0 is chosen (///). The computer needs to calculate this, and it is 0.41 here. 3 and 4: Tradeoffs Between Risks of Two Errors In our figure example, if μA- μB=3 is the smallest difference that we care about (smaller differences are 0 in a practical sense), then we have an α=0.05 chance of wrongly declaring that treatments differ when in fact they are identical, and a β=0.41 chance of declaring them the same when they really differ by 3. If we try to decrease the risk of one of the errors, the risk of the other error increases, i.e., α↑ as β↓. [This is the same as sensitivity and specificity of diagnostic tests.] To visualize it on our figure, imagine shifting the ///\\\ demarcation at 2.8 to the left, to say 2.7. That increases α. Then the /// area, i.e., β, decreases. Practical application: If A is a current treatment, and B is a potential new one, then smaller αs mean that we are more concerned with marketing a non-superior new drug. Smaller βs mean we are more concerned with missing a superior new drug. 5: Effect of Study Size on Risks of Error In the previous slide, the FDA may want a small α, and drug company might want a small β. To achieve this, a larger study could be performed. We can verify this with our graph. In our figure example, suppose we had had a larger study, say twice as many subjects in each group. Then, both curves will be narrowed, since their widths depend on SE, which has N in the denominator. If we maintain α=0.05, the ///\\\ demarcation will shift to the left due to the narrowed left curve, and β will be much smaller, due to both the narrower right curve, and the demarcation shift. The demarcation could then be shifted to the right to lower α, which increases the current β, but still keeps it small. There are algorithms to choose the right N to achieve any desired α and β. Power of a Study Statistical power = 1 – β. Power is thus the probability of correctly detecting an effect in a study. In our example, the drug company is really thinking not in terms of β, but in the ability of the study to detect that the new drug is actually more effective, if in fact it is. Since the FDA requires α=0.05, then a major component of designing a study is the determination of it’s size so that it has sufficient power. This is the topic for the next session #4.