Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Objectives: A Brief Tutorial on Sample Size and Statistical Power Calculations for Animal Experiments 1. Overview of basic ideas of power and sample size calculations in the context of simple examples. 2. Introduction to easily available tools for power and sample size: • Software available on CSU campus for a reasonable cost (SAS, $48/yr.; MINITAB, $127 purchase) Phillip Chapman • Interactive software available for free on the web. (Russell Lenth, U. of Iowa, and UCLA Online Power Calculator.) Department of Statistics, CSU 3. Give some practical advice about how to proceed in some common situations. 4. Overall objective: help you plan better experiments (and in the process help you fill out the ACUC A-100 form). March 8, 2007 1 2 1. Two-sample t-test for two independent samples Outline: Mice will be randomly assigned to two groups (n mice per group): 1. Two-sample t-test for two independent samples. SAS Analyst (interactive menu) MINITAB Russell Lenth Online Power Programs UCLA Online Power Calculator (http://calculators.stat.ucla.edu/ - no demo today) 1. Treated group (T) 2. Control group (C) – (untreated or reference treatment) We plan to record Y = log(CFU) of bacteria in the lungs. We will compare the two groups using a two-sample t-test (α=0.05): 2. Selecting inputs to power programs (two-sample t-test). 3. Confidence interval width for two independent samples. t= 4. Hypothesis tests for comparing two-proportions (z-test and Fisher’s Exact Test for small samples) – SAS Proc Power. ( yC − yT ) s pooled 5. Confidence interval width for comparing two-proportions. 2 n Compare t to tcrit (from t-table) Significant difference if t > t crit (or compare p-value to α , usually 0.05) Significant difference = success of experiment. 6. Additional Examples (fewer details, primarily for reference): A. T-test for paired samples B. One-way ANOVA, Contrasts in a one-way ANOVA – SAS Proc Power Probability of significance = “POWER” How do we assure a high probability of success? 3 (i.e. How do we assure high power?) 4 Notation: µC = pop. mean in control gp. Power = probability of rejecting H0 if false (i.e. HA true) Distribution of calculated t, as if H0 is true. 0.4 0.3 0.2 3. n = sample size (= 9 above) -tcrit The center of the “HA true” curve is λ. 5 4 3 2 1 0 -1 -2 -3 -4 -5 0.1 0 tcrit α = “Type I error rate”= probability of rejecting H0, when H0 is true. 4.7 4 3.3 2.6 1.9 1.2 tcrit 5.4 Area = Power Power depends on: 1. σ = estimate of the within group std. dev. (= 1.0 above) 2. "True values" of µ C and µT (µ C =6.0 and µT =4.5 above) Area = α/2 Area = α/2 -tcrit 0.5 (or ∆=µC − µT ≠ 0) -0.2 H A : µT ≠ µC -0.9 (or ∆=µC − µT = 0) 0.02 0.015 0.01 0.005 0 Distribution of t, if HA is true Area = α/2=0.025 -1.6 H0 : µT = µC 0.04 0.035 0.03 0.025 -2.3 Hypothesis test: Are the means the same? Distribution of t, if H0 is true -3 µ T = pop. mean in treated gp. σ C = pop. std. dev. within the control. Assume σ C = σ T = σ σ T = pop. std. dev. within the treated. (common std. dev.) 5 Power (area under the HA curve) is easily calculated in SAS. λ = µC − µT σ Result: Power=0.847 2 n = 6 − 4 .5 1 .0 2 9 = 3 .1 8 6 Initial SAS window with “Analyst” selected: 1. Double-click on SAS Icon to start the SAS program, 2. From the “Solutions” menu, select “Analysis”, then “Analyst”. 3. From the “Statistics” menu, select “Sample Size”, then “TwoSample t-test. 4. Put appropriate values into the boxes (give a range of n values requested), then select “OK”. 5. Read power results from “Power Analysis” window. 6. Look at graph in “Plot of Power vs. N per group” window. 7 8 Two-sample t-test window: Analyst window with “Two-sample t-test” selected: 1. It doesn’t matter which means are labeled “1” and “2” 2. You can specify n, and have it calculate power, or vice versa. 3. Select a range of n values, and an increment (“by”). 4. Use “Help” button for explanations, if needed. 9 Two-sample t-test results: Interpretation: We have a 0.847 probability of observing a statistically significant two-sided (α=0.05) t-test if the true group means are 4.5 and 6.0, respectively, the within-group standard deviation is 1.0 for both groups, and the sample size is n=9 per group. 5. Select “OK” when done. 10 The optional plot from SAS Analyst: 11 12 http://www.stat.uiowa.edu/~rlenth/Power Russell Lenth’s Online Power Calculator (Java applets) Location: http://www.stat.uiowa.edu/~rlenth/Power 1. Very complete with respect to t-tests and ANOVA’s; less complete for proportions. 2. See immediate response to “sliders”, or enter numbers by clicking on the little grey boxes in the corner. 3. Requires Java for interactive applets. 4. Programs can be downloaded to run on your own computer. Suggested (by Lenth) citation format: Lenth, R. V. (2006). Java Applets for Power and Sample Size [Computer software]. Retrieved month day, year, from http://www.stat.uiowa.edu/~rlenth/Power. 13 Two-sample t-test power calculator from Russell Lenth’s web page: Note: This menu allows the two groups to have different variances (not an option in SAS Analyst). 14 Click on button to enter values (rather than use the slider) 15 16 More than one sample size (per group), difference between means, and power value may be specified In MINITAB 13, select “Power and Sample Size”, then “2-Sample t..” from the “Stat” menu. (Version 15 was just released.) Select “Options” to set α or change to one-sided test. 17 The options sub-menu from the previous screen. Change what you like, and then select OK. 18 MINITAB two-sample t-test power results: Power for n=9( per group) = 0.8476 19 20 2. Selecting inputs to power programs. λ = Significant difference if t > tcrit Observed σ From table 1. Selecting the Type I error rate α: There is a lot of tradition behind using α=0.05. Occasionally, α=0.10 is used when the results are thought to be preliminary. µC − µT 2 n 2. Estimating σ usually causes the most difficulty: A. Use estimates from previous or published work. (be careful to note whether it is the standard deviation, or the standard error of the mean that is being reported): s.e.(mean) = sd / n 1. Power increases with |λ| (It moves the HA curve away from zero.) 2. Power increases with α. (It moves tcrit value toward zero.) 3. Power increases with larger true differences ( ∆ = µC − µT ) B. Use estimates from similar experiments, even though they may not be completely comparable (but add appropriate caveats). 4. Power decreases with increased variability (σ) 5. Power increases with √n . C. Use a range of values that might be reasonable. (Larger values of σ yield pessimistic (i.e. large) sample sizes). D. Use relative values: If the mean is believed to be about 100, you might try to estimate the relative variability. (Use the coefficient of variation= σ /µ) and work backward from there.) Since power depends on these values, how do you decide which numbers to use? 21 3. Desired power: 0.80 is usually a minimum; 0.90 is more common. Power above 0.99 would generally be thought to be a waste of resources. Since power is the probability of success of the experiment, you have some leeway in selecting it. You might allow more leeway, if you are less sure of your variance estimates. Also, you may have multiple objectives, and you may have more power for some than others. 4. One-sided or two-sided? Two-sided is the conservative choice (less power and larger sample size). One-sided tests are suspect in some fields. (“Home run in Coors Field” analogy.) ( ) 5. Magnitude of difference: ∆ = µC − µT . Select the smallest difference that it is necessary to detect. Since the power only depends on the difference, individual means are not important. Even the difference is only important relative to σ. Sometimes ∆ / σ is called the “effect size.” Some references classify effect sizes as small, medium, and large, when they are 0.2, 0.4 and 0.8, respectively. My opinion is that you should decide the appropriate effect size for your scientific purpose; ignore what is identified as small, or large. 23 22 3. Confidence interval width for independent samples When the purpose of the experiment is to estimate a difference between treatment means, rather than test whether the difference is zero (or any other value), a more appropriate sample size tool is confidence interval width, rather than power. A 95% C.I. is of the form: yC − yT ± E where E =t0.025 spooled 2 n E is called the “margin of error”. Solving the above for n and letting σ approximate spooled: 2 2 ( t0.025 ) σ 2 n= t0.025 = 2.0 approx E2 The formula can only be used for large experiments due to two problems: 1. For small n, t0.025 is somewhat larger than 2.0. 2. For small n, spooled may be larger than σ, resulting in a wider CI than planned. (These problems are avoided if you use SAS ANALYST.) 24 After opening SAS Analyst, select Two-Sample Confidence Interval as below: In the two-sample problem considered earlier: Mice will be randomly assigned to two groups (n mice per group): 1. Treated group (T) 2. Control group (C) – (untreated or reference treatment) We plan to record Y = log(CFU) of bacteria in the lungs. σ = estimate of the within group std. dev. (assumed = 1.0) New objective: Estimate the difference between mean log(CFU)’s for the two groups with margin of error (E) approximately 0.75log(CFU)’s, and be 90% sure that the final 95% confidence interval will be no wider than E. Do the above using SAS Analyst 25 26 Desired precision: E = 0.75 Within group standard deviation: σ = 1.0 Select the “N per group” button to get sample size. “Power” in this context is the probability that the interval will be no wider than the specified E. (This is random because spooled is random and may be larger than σ.). Specified as 0.90 Interpretation (approximate): If the true within-group standard deviation is 1.0, and the sample size is 20 per group, then there is a 90% probability that the resulting confidence interval will be no wider than 1.5 (i.e. diff ±0.75) 27 28 5. My opinion: C. I. width criteria should be used more often, and power of hypothesis tests should be used less often. Many researchers that feel forced into power calculations when they are really trying to estimate a difference with a desired precison, rather than test whether the difference is zero. Some comments about the previous C.I. width procedure: 1. Selecting the confidence level %: Analogous to selecting α in the testing situation. Most common level is 95% confidence, but other levels are sometimes used (80%, 90%, 99%). 6. Post hoc (retrospective) power: Often researchers are asked the question, “Did your experiment have enough power to detect a difference”? This question is circular, because if your effect is significant, you had enough power, if it is not significant, you didn’t. A. If you calculate post hoc power, use the estimated σ from the experiment, but identify the magnitude of the difference to detect based on scientific reasons; do not use the estimated differences from the experiment. B. C.I. Widths can be used in place of post hoc (retrospective) power analyses: C.I. Exp #1 2. Estimating σ causes the most difficulty. (as before) 3. Selecting E (margin of error) – Depends on the objectives of the experiment. With how much precision do you need to estimate the mean difference? (Be careful what you choose, because n becomes dramatically larger as E becomes smaller.) 4. “Power” is the probability that the final interval will be no wider (i.e. not worse) than planned. It is not really power in the testing sense, but it is still probability of success, if success is measured by whether the final interval is no wider than planned. Most people pick 0.80, or 0.90, depending on how lucky they feel. C.I. Exp #2 29 4. Hypothesis tests for comparing two-proportions 0.0 Difference of interest To estimate power: Birds will be sampled from two populations and the rate of West Nile Virus compared between the two populations. 1. Choose type I error rate α: We assume α= 0.05. 1. Population 1 (y1 positives out of n1 sampled) 2. Choose values for π1 and π2 : We assume 0.20 and 0.30. 2. Population 2 (y2 positives out of n2 sampled) 3. Choose sample sizes: We assume both equal 100 (or choose desired power). We will compare the two groups using an α=0.05 two-sample z-test of proportions: Let: pˆ1 = y1 / n1 pˆ 2 = y2 / n2 30 Programs: p = ( y1 + y2 ) / ( n1 +n2 ) 1. Lenth’s power program (moderate to large samples). pˆ1 and pˆ 2 estimate the true proportions π 1 and π 2 . 2. MINITAB (moderate to large samples). p estimates the true proprotion, assuming that H 0 : π 1 = π 2 . 3. SAS Proc Power (This calculation is not available in SAS Analyst.) H 0 is tested against a two-sided alterntive using a z-test. ( pˆ1 − pˆ 2 ) z= 2 p (1- p ) n A. z-test program (moderate to large samples). Compare z to zcrit (from t-table) B. Fisher’s Exact Test Program (for small samples) Significant difference if z > z crit 4. UCLA Power Calculator for Fisher’s Exact Test (no demo today). (or compare p-value to α , usually 0.05) 31 32 From Lenth’s Web Page: From MINITAB: Power estimated to be 0.31 (too low!) Need to increase n, or be more realistic about the size of a difference you can detect. 33 Type I error rate (α) and one-sided, versus two-sided test options are chosen from the “Options” menu. 35 34 Power from MINITAB is slightly different than from Lenth’s web site, because Lenth uses an optional continuity correction. 36 Run the following program in SAS: Click here to run SAS Proc Power results match the MINITAB results: 37 38 392 per group required to achieve desired 0.90 power: Re-run the previous program setting power to 0.90 (SAS computes n.) 39 40 Fisher’s Exact Test for small sample problems: Consider an extreme case: π1 =0.10 and π2 = 0.90 (very different) Because the proportions are so different, the sample size required to achieve power equal 0.90 should not be very large. For small samples “Fisher’s Exact Test is generally preferred the the z-test or chi-square test. Sample size 10 per group required to achieve 0.90 power. Note: The UCLA online Power Calculator will also do power for a Fisher’s Exact Test. 41 5. Confidence interval width for comparing twoproportions pˆ1 − pˆ 2 ± Ε pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 ) + n1 n2 Assume n1 and n 2 are the same (=n), put in estimates for pˆ1 and pˆ 2 , and solve for n: n= Example: How large must n (per group) be to get a 95% C.I. for π1-π2 of width 0.20.(i.e. E=0.10) Say p1 is thought to be between 0.4 and 0.6 (width=2E) E = z0.025 × SE ( pˆ1 − pˆ 2 ) = z0.025 42 2 z0.025 ( pˆ1 (1 − pˆ1 ) + pˆ 2 (1 − pˆ 2 ) ) E2 What to use for pˆ1 and pˆ 2 in this formula? Guess the ranges of values for p1 and p2, OR take the worst case: p1 = 0.5 and p2 = 0.5. and p2is thought to be between 0.2 and 0.4. n= (1.96) 2 [ (0.5)(1 − 0.5) + (0.4)(1 − 0.4)] (0.10)2 = 177 Notes: 1. This sample size estimate can be inaccurate when p1 or p2 is near (or thought to be near) 0 or 1. 2. The margin of error (E) obtained using this sample size will be smaller than the desired E about half the time. 44 6. Additional Examples: A. T-test for paired samples Selecting a Paired t-test in SAS Analyst Testing for a mean difference in paired samples is testing whether the mean difference is zero. The σ in the analysis is the std. dev. of the differences. 45 The Paired t-test dialog box in SAS Analyst 46 Selecting a Paired Confidence interval in SAS Analyst 47 48 6. Additional Examples: B. One-way ANOVA The Paired Confidence interval dialog box in SAS Analyst Select “Balanced ANOVA” to get any of many ANOVA models 49 Lenth’s “Balanced ANOVA” dialog box. Use the arrow to see a long list of possible models. 50 Below I have used Excel to compute the standard deviation of the alternative hypothesis treatment means for use in Lenth’s One-way ANOVA dialog box. (next page) I have also used Excel to compute the corrected sum of squares (CSS) for the same alternative hypothesis treatment means for later use in SAS Analyst. t = # treatments µi = treatment means CSS = ∑ ( µi − µ ) 2 ∑(µ SD[treatment} = i −µ) 2 (t − 1) First choose “F-tests” to compare all treatments. 51 52 Lenth’s dialog box for power of contrasts in the One-Way ANOVA. (Select “Differences/Contrasts” at the bottom of Slide 51.) Lenth’s One-way ANOVA dialog box. SD[treatment]=2.137, from previous page. Power = 0.8024 in the One-way ANOVA F-test 53 Selecting One-way ANOVA in SAS Analyst: Power = 0.377 for detecting a difference of 3.0 between two treatments. 54 The corrected sum of squares of the alternative hypothesis treatment means (CSS) = 22.83 was previously computed in Excel: 55 56 Power for a contrast in a One-way ANOVA computed using SAS Proc Power Results of the SAS Analyst One-way ANOVA power calculation Contrast of treatment #1 with treatment #4, The true treatment means under the alternative hypothesis are 2.0 and 5.0, respectively. 57 Output for SAS Proc Power Program on the previous page. 58 A general comment about power for ANOVA models: Although it is possible to calculate power for a variety of multi-factor ANOVA models, significance in the Ftests for main effects and interactions is not the end of the analysis. It is generally necessary to do follow-up comparisons of individual treatments, or contrasts of treatment groups. Often that returns the problem to something that can be addressed with the two-sample comparisons, or the contrast comparisons, discussed in this presentation. 59 60