* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sample Size and Statistical Power
Survey
Document related concepts
Transcript
Sample Size and Statistical Power John Floretta J-PAL South Asia Overview of power • • • • • What can power analysis do? Statistical background to power Understanding the determines power Power in a clustered randomized trial study What can power analysis do? • How large does the sample need to be to “credibly” detect a given treatment effect? • How many different treatments arms can I include in my project? • Should I sample more communities or more people in each community? • Is the experiment worth running? Will I get precise enough estimates of impact to make it worthwhile? 3 Lecture Outline 1. Statistical Background to Power 2. Deriving the Components of Power 3. Common Trade-offs and Rules of Thumb 4 Statistical Background to Power Concepts covered • Sampling variation • Standard deviation • Standard error • Central limit theorem • Confidence interval and confidence level • Hypothesis testing • Statistical inference Sampling variation: example • We want to know the average test score of grade 3 children in Springfield • How many children would we need to sample to get an accurate picture of the average test score? Population: test scores of all 3rd graders Population Mean of population is 26 (true mean) Population Population mean Pick sample 20 students: plot frequency Population Population mean Sample Sample mean Zooming in on sample of 20 students Population mean Sample Sample mean Pick a different sample of 20 students Population mean Sample Sample mean Another sample of 20 students Population mean Sample Sample mean Sampling variation: definition • Sampling variation is the variation we get between different estimates (e.g. mean of test scores) due to the fact that we do not test everyone but only a sample • There may be other reasons our estimates vary (e.g. the quality of those doing the test may vary) but this is not sampling variation • Sampling variation depends on: – The number of people we sample – The variation in test scores in the underlying population Lets pick a sample of 50 students Population mean Sample Sample mean A different sample of 50 students Population mean Sample Sample mean A third sample of 50 students Population mean Sample Sample mean Lets pick a sample of 100 students Population mean Sample Sample mean Lets pick a different 100 students Population mean Sample Sample mean Lets pick a different 100 students Population mean Sample Sample mean What if our population instead of looking like this….. Population Population mean …looked like this Population Population mean Different samples of 20 gives similar estimates Population mean Sample Sample mean Different samples of 20 gives similar estimates Population mean Sample Sample mean Different samples of 20 gives similar estimates Population mean Sample Sample mean Standard deviation: population 1 • Measure of dispersion in the population 1 Standard deviation 1 Standard deviation Population Population mean 1 Standard deviation Standard deviation: population II 1 sd 1 sd Population Population mean 1 Standard deviation Standard Deviation • Measure dispersion in underlying population • Weighted average distance to the mean – gives more weight to those points furthest from mean 𝑠𝑑 = 2 𝑛 𝑖−1 𝑋𝑖 −𝑋 2 𝑛 Standard Error • Measure dispersion in our estimate – In this case in our estimate of means • SE depends on distribution in our underlying population and the size of the sample • We don’t normally observe sd of the entire population so we usually estimate this using sd of the sample • Formula for SE when sampling a small proportion of the population is: 𝑠𝑑 𝑠𝑒 = 2 𝑛 Central limit theory • If we take many samples and estimate the mean many times, the frequency plot of our estimates will resemble the normal distribution • This is true even if the underlying population distribution is not normal • In a normal distribution 68% of all estimates will be within 1 SD of the true value – 95% of all estimates will be within 2SD of the true value – We use this fact to derive confidence intervals around our estimates Population of test scores is not normal Population Take the mean of one sample Population Population mean Sample Sample mean Plot that one mean Population mean Sample Sample mean Take another sample and plot that mean Population mean Sample Sample mean Repeat many times Population mean Sample Sample mean Repeat many times Population mean Sample Sample mean Repeat many times Sample mean Repeat many times Sample mean Repeat many times Sample mean Repeat many times Sample mean Distribution of sample means Sample mean Confidence around estimates • Point estimate is the value of a single estimate • Confidence interval is a range around a single point estimate which gives a sense of the precision of the estimate: – A narrow confidence interval suggests the estimate is quite precise – A confidence interval is associated with a particular confidence level • Confidence level is reported for given confidence interval – A 95 confidence level means that if we constructed 100 confidence intervals we would expect 95 of them to contain our estimate Hypothesis testing • Until now we have been estimating a mean of one sample • When we do impact evaluations we compare means from two different groups (the treatment and comparison groups) • Null hypothesis: the two means are the same and any observed difference is due to chance – H0: treatment effect = 0 • Research hypothesis: the true means are different from each other – H1: treatment effect ≠ 0 • Other possible tests – H2: treatment effect > 0 Hypothesis testing: 4 possible outcomes Underlying truth Significant (reject H0) Statistical Test Not significant (fail to reject H0) Effective (H0 false) No Effect (H0 true) True positive Probability = (1 – κ) False positive Type I Error (low power) Probability = α False zero Type II Error True zero Probability = κ Probability = (1-α) 44 Deriving the components of power Review: one population, one sample Population Population mean Sample Sample mean Review: distribution of sample means Sample mean Review: samples vs populations • We have two different populations: treatment and comparison • We only see the samples: sample from the treatment population and sample from the comparison population • We will want to know if the populations are different from each other • We will compare sample means of treatment and comparison • We must take into account that different samples will give us different means (sample variation) 48 One experiment, 2 samples, 2 means Comparison mean Treatment mean Comparison Treatment One experiment Difference between the sample means Comparison mean Estimated effect Treatment mean One experiment What if we ran a second experiment? Comparison mean Estimated effect Treatment mean Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Many experiments give distribution of estimates 100 90 80 70 Frequency 60 50 40 30 20 10 0 -3 -2 -1 0 1 2 3 4 Difference 5 6 7 8 9 10 Normal distribution of estimates around true effect Distribution of estimates if true effect is zero Distributions under two alternatives We don’t see these distributions, just our estimate 𝛽 Is our estimate 𝛽 consistent with the true effect being β*? If true effect is β*, we would get 𝛽 with frequency A Is it also consistent with the true effect being 0? If true effect is 0, we would get 𝛽 with frequency A’ Q: which is more likely, true effect=β* or true effect=0? A is bigger than A’ so true effect=β* is more likely that true effect=0 But can we rule out that true effect=0? Is A’ so small that true effect=0 is unlikely? Probability true effect=0 is area to the right of A’ over total area under the curve Critical value • There is always a chance the true effect is zero, however, large our estimated effect • Recollect that, traditionally, if the probability that we would get 𝛽 if the true effect were 0 is less than 5% we say we can reject that the true effect is zero • Definition: the critical value is the value of the estimated effect which exactly corresponds to the significance level • If testing whether bigger than 0 a significant at 95% level it is the level of the estimate where exactly 95% of area under the curve lies to the left • 𝛽 is significant at 95% if it is further out in the tail than the critical value 72 95% critical value for true effect>0 In this case 𝛽 is > critical value so…. …..we can reject that true effect=0 with 95% confidence What if the true effect=β*? How often would we get estimates that we could not distinguish from 0? (if true effect=β*) How often would we get estimates that we could distinguish from 0? (if true effect=β*) Q: which is more likely, that we can distinguish an effect from 0 or we cant? Chance of getting estimates we can distinguish from 0 is the area under H β* that is above critical value for H0 Proportion of area under H β* that is above critical value is power Definition of Power • Statistical power is the probability that, if the true effect is of a given size, our proposed experiment will be able to distinguish the estimated effect from zero • Traditionally, we aim for 80% power. Some people aim for 90% power 82 Recap hypothesis testing: power Underlying truth Power= probability of true positive Significant (reject H0) Statistical Test Not significant (fail to reject H0) Effective (H0 false) No Effect (H0 true) True positive Probability = (1 – κ) False positive Type I Error (low power) Probability = α False zero Type II Error True zero Probability = κ Probability = (1-α) 83 What if the true effect could be - β* or + - β*? What is our power to distinguish from 0? More overlap between H0 curve and Hβ* curve, the lower the power. Q: what effects overlap? Larger hypothesized effect, further apart the curves, higher the power Greater variance in population, increases spread of possible estimates, reduces power Residual variance • Some variation in outcomes can be explained by factors we can observe – For example, older children have higher test scores on average, boys tend to be taller than girls • Because we can explain this variation, it does not contribute to our uncertainty about what the true effect is, so does not reduce power • Unexplained or residual variance determines power • Residual variance is the variance in the outcome that is left after we have included all the control variables we will include in our final analysis equation • Stratifying by key control variables ensures balance Bigger the sample size, more power Power also depends on the critical value, ie level of significance we are looking for… 10% significance gives higher power than 5% significance Why does significance change power? • Q: what trade off are we making when we chance significance level and increase power? • Remember: 10% significance means we’ll make Type I (false positive) errors 10% of the time • So moving from 5-10% significance means get more power but at the cost of more false positives • Its like widening the gap between the goal posts and saying “now we have a higher chance of getting a goal” Allocation ratio and power • Definition of allocation ratio: the fraction of the total sample that allocated to the treatment group is the allocation ratio • Usually, for a given sample size, power is maximized when half sample allocated to treatment, half to control Why does equal allocation maximize power? • Treatment effect is the difference between two means (mean of treatment and control) • Adding sample to treatment group increases accuracy of treatment mean, same for control • But diminishing returns to adding sample size • If treatment group is much bigger than control group, the marginal person adds little to accuracy of treatment group mean, but more to the control group mean • Thus we improve accuracy of the estimated difference when we have equal numbers in treatment and control groups Summary of power factors • Hypothesized effect size – Q: A larger effect size makes power increase/decrease? • Variance – Q: greater residual variance makes power increase/decrease? • Sample size – Q: Larger sample size makes power increase/decrease? • Critical value – Q: A looser critical value makes power increase/decrease • Unequal allocation ration – Q: an unequal allocation ratio makes power increase/decrease? 95 Power equation: MDE Significance Level Effect Size Variance Power 1 EffectSize t1 t * * P1 P N 2 Proportion in Treatment Sample Size Clustered RCT experiments • Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups • The unit of randomization (e.g. the village) is broader than the unit of analysis (e.g. farmers) • That is: randomize at the village level, but use farmer-level surveys as our unit of analysis 97 Clustered design: intuition • We want to know how much rice the average farmer in Sierra Leone grew last year • Method 1: Randomly select 9,000 farmers from around the country • Method 2: Randomly select 9,000 farmers from one district 98 Clustered design: intuition II • Some parts of the country may grow more rice than others in general; what if one district had a drought? Or a flood? – ie we worry both about long term correlations and correlations of shocks within groups • Method 1 gives most accurate estimate • Method 2 much cheaper so for given budget could sample more farmers • What combination of 1 and 2 gives the highest power for given budget constraint? • Depends on the level of intracluster correlation, ρ (rho) 99 Low intracluster correlation Variation in the population Clusters Sample clusters HIGH intracluster correlation Variation in the population Intracluster correlation • Total variance can be divided into within cluster variance (𝜏 2 ) and between cluster variance (σ2 ) • When variance within clusters is small and the variance between clusters is large, the intra cluster correlation is high (previous slide) • Definition of intracluster correlation (ICC): the proportion of total variation explained by within cluster level variance – Note, when within cluster variance is high, within cluster correlation is low and between cluster correlation is high • 𝑖𝑐𝑐 = 𝜌 = 𝜏2 𝜎 2 +𝜏2 How does ICC effect power? • Randomizing at the group level reduces power for a given sample size because outcomes tend to be correlated within a group. • Power depends on 1. The number of clusters 2. How similar people in each cluster are to each other compared to the population in general Power with clustering Significance Level Effect Size Variance Power EffectSize 1 t1 t * * P1 P N 1 (m 1) 2 ICC Average Cluster Size Proportion in Treatment Sample Size Clustered RCTs vs. clustered sampling • Must cluster at the level at which you randomize – Many reasons to randomize at group level • Could randomize by farmer group, village, district • If randomize one district to T and one to C have too little power however many farmers you interview – Can never distinguish treatment effect from possible district wide shocks • If randomize at individual level don’t need to worry about within village correlation or village level shocks, as that impacts both T and C 105 Bottom line for clustering • If experimental design is clustered, we now need to consider ρ when choosing a sample size (as well as the other effects) • Must cluster at level of randomization • It is extremely important to randomize an adequate number of groups • Often the number of individuals within groups matter less than the total number of groups 106 Common tradeoffs and rules of thumb Common tradeoffs – Answer one question really well? Or many questions with less accuracy? – Large sample size with possible attrition? Or small sample size that we track very closely? – Few clusters with many observations? Or many clusters with few observations? – How do we allocate our sample to each group? Rules of thumb (1/2) 1. A larger sample is needed to detect differences between two variants of a program than between the program and the comparison group. 2. For a given sample size, the highest power is achieved when half the sample is allocated to treatment and half to comparison. Rules Rules of thumb (2/2) 3. The more measurements are taken, the higher the power. In particular, if there is a baseline and endline rather than just an endline, you have more power. 4. The lower compliance, the lower the power. The higher the attrition, the lower the power. 5. For a given sample size, we have less power if randomization is at the group level than at the individual level.