Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sample Size and Power Outline Haugesund, 14.6.2011 1. Introduction Why do we talk about sample size? Jörg Aßmus Two samples under H0 Two samples under H1 sample 1 sample 2 sample 1 sample 2 2. Short introduction to hypotheses testing 3. Power and Sample size ”I only believe in statistics that I doctored myself.” (said to be said by Winston Churchill) 4. What can we do? 2 Introduction Introduction - A simple example What is the target of a study? Body height of 20 years old men in Germany and the Netherlands RCT: (Randomized controlled trial) Sample @ @ Treatment group @ @ Output 1 @ @ Control group @ @ Output 2 Research Question: Is there a difference between the Outputs? Methodical Question: What do we need to be able to see a difference (if there is one)? 3 No. 1 2 3 4 5 6 7 8 9 10 GER 1.79 1.89 1.77 1.95 1.80 1.85 1.80 1.76 1.82 1.77 NED 1.77 1.89 1.73 1.87 2.00 1.80 1.89 1.87 1.88 1.89 Testing the difference (t-test) Sample size ∆ p-value Difference? 5 0.014 0.8145 no 10 0.040 0.1993 no 50 0.046 0.0058 yes 100 0.060 0.0000 yes 1000 0.054 0.0000 yes Conclusion: The test result depends on the sample size 4 Introduction Conclusion from the introductory example: We can see an existing difference if we use a sufficiently large sample size Introduction: What do we want to do? Question: Why don’t we take the sample size as large as possible? 1. Economy: - Why should we include more than we have to? - Every trial costs! We have to find the correct sample size to detect the desired effect. Not too small - Not too large What do we need on the way? 2. Ethics: - We should never punish more test persons or animals than necessary 3. Statistics: - We can proof almost every effect, if we only have sufficiently large sample size - stress field: statistical significance vs. clinical relevance - How does a test work? What means ”Power of a test”? What determines the sample size? How do we handle this in practical tasks? 5 6 A short introduction to hypotheses testing Possible results of a single test How can we know? ← obvious not obvious → Reality H0 true H0 false Strategy: Wrong decisions: · Rejection even if H0 is true (type I error) · No rejection even if H0 is false (type II error) 1. Formulate a hypothesis Expected heights equal Nullhypothesis H 0 : E h 1 = E h2 vs. vs. Expected heights different vs. Alternative H1 : Eh1 6= Eh2 2. Find an appropriate test statistics: 3. Compute the observed test statistics: T = Tobs = √ What do we want? · Reduction of the wrong decisions. ⇒ Minimal probability for both types of errors. Dilemma: n|Eh2 −Eh1 | σ √ Test decision accept reject RIGHT type I error type II error RIGHT For a given data set and a given test method it is impossible to reduce both error types at once. n|ĥ2 −ĥ1 | spooled 4. Reject the nullhypothesis H0 if Tobs ist too large. But what does this mean: ”too large”? 7 8 Dilemma: A Simulation experiment: For a given data set and a given test method it is impossible to reduce both error types at once. We generate data for two populations with the following properties: ⇒ We try to deal only with type I error ⇒ We assume that H0 ist true - all data are Gaussian - Equally for both populations: · Mean: µ1 = µ2 = 0 · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Nothing - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Statistical approach: Idea: What is the probability that everything happend by accident? The p-value is for a given data set the probability to get the observed test statistics or worse assuming that the nullhypothesis is true Remarks: Given data: p-value is a fixed number → characteristically for the data set Theoretically: p-value is a random variable ⇒ p-value has a distribution density of the teststatistic Solution: The p-value p:=P (T>T H sample 1 sample 2 Approach: ) obs 0 T_obs Two samples under H0 1. 2. 3. 4. T_krit values of the teststatistic Generate a data set for both populations Compute the p-value for a t-test (H0 : no difference) Plot p-value Repeat 1.-3. 9 10 Result of the experiment Power and Sample size Distribution of the p−values under H Two samples under H0 0 REJECT N =4963, 4.963% 0 sample 1 sample 2 What did we learn about tests? ACCEPT N =95037, 95.037% 1 count Test decision: made according to control the probability for appearance of a type I error ⇒ Interpretation: We control the probability of the incorrect detection of an effect. Question: What about the probability not to detect an existing effect? 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p−values We know: With - given data - a given test method - a given significance level we are not able to influence the probability of the type II error anymore (Recall the test dilemma!) - p-values are uniformly distributed under true null-hypothesis - 5% of the p-values under 0.05 - Independent of the Sample size But how does this probability look like? Let us do one more simulation experiment: We reject the null hypothesis if the p-value is under a given significance level α (usual convention α = 0.05) The probability of type I error (incorrect rejection) will be lower than the used significance level. 11 12 A Simulation experiment: Power and Sample size We generate data for two populations with the following properties: Result of the experiment: 1 sample 1 sample 2 sample 1 sample 2 REJECT ACCEPT N =29001, 29.001% N =70999, 70.999% 0 ⇒ Approach: 1. 2. 3. 4. Distribution of the p−values under H Two samples under H1 Two samples under H1 1 count - All data are Gaussian - Equally for both populations: · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Mean µ1 = 0 µ2 = 0.8 - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Difference detected: No difference detected: Generate a data set for both populations Compute the p-value for a t-test (H0 : no difference) Plot p-value Repeat 1.-3. 0 0.1 ≈ 69% of the trials ≈ 31% of the trials 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ← Correct ← Type II error Ability of the test to detect the difference @ @ Power of the test 14 13 Power and Sample size Power and Sample size σ = 0.04 M=5 Figure: Mean of simulated data for two samples repeated 100.000 times. ∆µ = 0 Recall: α Probability for wrong rejections (type I error) 1.7 1.8 1.9 2 1.7 1.8 1.9 2 1.8 σ = 0.08 M = 10 1.9 2 2.1 Question: What does the Power of a test depend on? ∆µ = 0.05 Definition β: Probability for wrong acceptations (type II error) 1.7 1.8 1.9 2 1.7 The power of a test is the probability to detect a wrong hypothesis 1.7 1.8 1.9 2 1.7 1.7 1.8 1.9 1.9 2 1.8 1.8 1.9 2 1.7 2 1.8 1.8 1.9 2 2.1 1.9 2 2.1 ∆µ = 0.15 2 1.8 σ = 0.2 M = 1000 1.9 ∆µ = 0.1 σ = 0.16 M = 100 Power = 1 − β 1.8 σ = 0.12 M = 50 1.9 2 2.1 ∆µ = 0.2 Question: What does the Power of a test depend on? 1.7 15 1.8 1.9 2 1.7 1.8 1.9 2 1.8 1.9 2 2.1 - 1 p−values sample size standard deviation effect (mean difference) significance level ⇒ Power and sample size are a ”complementary” pair of values Thumb rule: If you know one of them you know the other 16 Power and Sample size Let us turn it around: What are the ingredients needed for the computation of the needed sample size? 1. Desired detectable effect - What effect (mean difference, risk, coefficiant size) is clinically relevant? - Which effect must be detectable to make the study meaningful? - This is a choice by the researcher (not statistically determined!) What did we learn? • Power: Ability of the test to detect a wrong nullhypothesis (e.g. t-test: ability to detect a difference) 2. Variance in the data: - e.g. standard deviation of both samples for a t-test - taken som experience former or pilot studies • Criteria: Type II error: Power = 1 − β • Power and sample size: Corresponding values 3. Significance level α - usually set to α = 0.05 - Adjustments must be taken into account (e.g. multiple testing) • Needed sample size: depends on - Desired effect (effect size ↓ - Sample variance (variance ↑ - Significance level (α ↓ - Desired trest power (Power ↑ - Test type 4. Desired power of the test - often used 1 − β = 0.8 - This is a choice by the researcher (not statistically determined!) - needed needed needed needed sample sample sample sample size size size size ↑) ↑) ↑) ↑) 5. Type of test - Different test for the same problem often have different power 18 17 Computation of the sample size - Pocock’s formula Continuous outcome: (t-test) Computation of the sample size N = Problem: There is no general formula for the power or sample size 2σ 2 · f (α, β) (µ2 − µ1)2 Computation possibilities: - 1. 2. 3. 4. Dichotomous outcome: (χ2-test) The old-fashioned way: Pocock’s formula The modern way: Statistical packages If no other things help: Simulations, Bootstrap Ask somebody µ1, µ2...population means σ...population standard deviation α...signicance level β...type II error probability (β = 1−power) p (1 − p1) + p2(1 − p2) N = 1 · f (α, β) (p2 − p1)2 - p1, p2...proportions, risks (determine effect and variance) - f (α, β) factor taken from Pocock’s table 19 20 Computation of the sample size - Pocock’s table f (α, β) 0.10 0.05 0.02 0.01 α - Larger f (α, β) - Smaller α - Larger power β ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ 0.05 10.8 13.0 15.8 17.8 0.10 8.6 10.5 13.0 14.9 0.20 6.2 7.9 10.0 11.7 Computation of the sample size - Program packages 0.50 2.7 3.8 5.4 6.6 - SPSS Sample power · former Power and Precision · stand alone but included in the SPSS-license) · http://www.spss.com/software/statistics/samplepower/ - Included in different program packages · R (package pwr) · Stata (sampsi, powerreg) · SAS (power) · Matlab (sampsizepwr) larger sample size larger f (α, β) larger sample size smaller β larger f (α, β) larger sample size - Interactive online Calculators · http://statpages.org/#Power (overview) Problem: How to deal with different test types? 22 21 Computation of the sample size - SamplePower 2.0 Computation of the sample size - Interactive tools 23 24 Computation of the sample size - Simulation example Computation of the sample size - Simulation - requires programming - should usually be done by a statistician - used if there is no adequate program or formula Power simulation (t-test), Effect: ∆µ=0.8, σ=1, α=0.05 1 0.9 0.8 0.7 0.6 Power Idea: 1. Define a power, e.g. 0.8 2. Generate artificial data with given parameters · means µ1 , µ2 · variance σ · significance level α · predefined sample size N 3. Compute the test result 4. Repeat 2.-3. and count number of rejections 5. power= (number of rejections)/(number of simulations) 6. Repeat 2.-5. for different sample sizes 7. Select the lowest sample size with a power above the predefined (step 1) 0.5 0.4 0.3 0.2 Needed sample size: N=25 0.1 - Distinction between Simulation and Bootstrap: · Bootstrap: Use a random subsample of real data · Simulation: Generate new data 0 0 10 20 30 40 50 60 Sample size 70 80 90 100 25 Computation of the sample size - Simulation example SamplePower 2.0 • http://www.spss.com/software/statistics/samplepower/ Empirical power estimation (1000 Repetitions) 1 • Former ”Power and Precision” 0.9 • Standalone program included in the SPSS license ⇒ available in Helse Vest (→ email Helse Vest IKT)! 0.8 0.7 • Different groups of methods: - Mean comparison (only t-test) - Proportions (risks, cross tables) - Correlations - ANOVA - Regression (linear, logistic) - Survival analysis - Some noncentral tests power 0.6 0.5 0.4 0.3 0.2 M=250 • Help: - Did not find a proper book - Textbook ”Power and precision” (Borenstein,M.) - Embedded help system (not always easy to understand) - Tutorials on the web M=12 0.1 0 26 0 0.5 1 1.5 2 2.5 ∆σ 3 3.5 4 4.5 5 27 28 Starting with a simple example: Comparison of the mean of 2 independent samples (t-test) SamplePower 2.0 Textbook - All data are Gaussian - Equally for both populations: · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Mean µ1 = 0 µ2 = 0.8 - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Power and precision Authors: Michael Borenstein Hannah Rothstein Jacob Cohen David Schoenfeld Jesse Berlin Edward Lakatos Two samples under H1 sample 1 sample 2 What do we need for the calculation? Compatible with Sample Power 2.0 - Test design Means Standard deviation Sample size? Power? Recall: Experimentally computed power: 1 − β = 0.69 (69%) 31 29 Starting with a simple example: Comparison of the mean of 2 independent samples (t-test) What can we do with SamplePower 2.0 with a 2-independent samples t-test? Computing the power for a given sample size Compute the power for given: · Effect · Standard deviation · Sample size Compute the sample size for given: · Effect · Standard deviation · Power Adjust: · Significance level · Confidence intervals · Precision of the numbers (µ, SD, N) Create power tables and plots for different: · · · · Computing the sample size for a given power (0.9) 32 Significance level Sample size Effect Standard deviation 33 Cross tables (RxC) Cross tables (2x2) Question: Is the appearance of side effect of the treatment associated with the sex of the patient? p < 0.0001 Health Region Helse SørØst Helse Vest Helse Midt Helse Nord Total Side effects Sex Male Female Total No Count Percent 238 74.8% 226 65.5% 464 70.0% Yes Count Percent 80 25.2% 119 34.5% 199 30.0% Question: Is the appearance of side effect of the treatment associated with the sex of the patient? Total Count Percent 318 48.0% 345 52.0% 663 100% From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet) No Count Percent 224 64.9% 85 65.9% 60 74.1% 99 63.1% 468 65.7% Side effects Yes Count Percent 107 31.0% 41 31.8% 18 22.2% 33 21.0% 199 27.9% Don’t know Count Percent 14 4.1% 3 2.3% 3 3.7% 25 15.9% 45 6.3% Total Count 345 129 81 157 712 From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet) 35 34 Cross tables (RxC) Power with a sample size of 100 Cross tables (RxC) Power for different sample sizes and significance levels Sample size for a power of 0.9 36 37 Percent 48.8 18.1 11.4 22.1 100.0