Download Sample Size and Statistical Power

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sample Size and Statistical Power
John Floretta
J-PAL South Asia
Overview of power
•
•
•
•
•
What can power analysis do?
Statistical background to power
Understanding the determines power
Power in a clustered randomized trial
study
What can power analysis do?
• How large does the sample need to be to
“credibly” detect a given treatment effect?
• How many different treatments arms can I
include in my project?
• Should I sample more communities or more
people in each community?
• Is the experiment worth running? Will I get
precise enough estimates of impact to make it
worthwhile?
3
Lecture Outline
1. Statistical Background to Power
2. Deriving the Components of Power
3. Common Trade-offs and Rules of Thumb
4
Statistical Background to Power
Concepts covered
• Sampling variation
• Standard deviation
• Standard error
• Central limit theorem
• Confidence interval and confidence level
• Hypothesis testing
• Statistical inference
Sampling variation: example
• We want to know the average test score of
grade 3 children in Springfield
• How many children would we need to sample
to get an accurate picture of the average test
score?
Population: test scores of all 3rd graders
Population
Mean of population is 26 (true mean)
Population
Population
mean
Pick sample 20 students: plot frequency
Population
Population
mean
Sample
Sample mean
Zooming in on sample of 20 students
Population
mean
Sample
Sample mean
Pick a different sample of 20 students
Population
mean
Sample
Sample mean
Another sample of 20 students
Population
mean
Sample
Sample mean
Sampling variation: definition
• Sampling variation is the variation we get between
different estimates (e.g. mean of test scores) due to
the fact that we do not test everyone but only a
sample
• There may be other reasons our estimates vary (e.g.
the quality of those doing the test may vary) but this is
not sampling variation
• Sampling variation depends on:
– The number of people we sample
– The variation in test scores in the underlying population
Lets pick a sample of 50 students
Population
mean
Sample
Sample mean
A different sample of 50 students
Population
mean
Sample
Sample mean
A third sample of 50 students
Population
mean
Sample
Sample mean
Lets pick a sample of 100 students
Population
mean
Sample
Sample mean
Lets pick a different 100 students
Population
mean
Sample
Sample mean
Lets pick a different 100 students
Population
mean
Sample
Sample mean
What if our population instead of looking like
this…..
Population
Population
mean
…looked like this
Population
Population
mean
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample mean
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample mean
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample mean
Standard deviation: population 1
• Measure of dispersion in the population
1 Standard
deviation
1 Standard
deviation
Population
Population
mean
1 Standard
deviation
Standard deviation: population II
1 sd
1 sd
Population
Population
mean
1 Standard
deviation
Standard Deviation
• Measure dispersion in underlying population
• Weighted average distance to the mean
– gives more weight to those points furthest from
mean
𝑠𝑑 =
2
𝑛
𝑖−1
𝑋𝑖 −𝑋 2
𝑛
Standard Error
• Measure dispersion in our estimate
– In this case in our estimate of means
• SE depends on distribution in our underlying
population and the size of the sample
• We don’t normally observe sd of the entire population
so we usually estimate this using sd of the sample
• Formula for SE when sampling a small proportion of
the population is:
𝑠𝑑
𝑠𝑒 = 2
𝑛
Central limit theory
• If we take many samples and estimate the mean
many times, the frequency plot of our estimates
will resemble the normal distribution
• This is true even if the underlying population
distribution is not normal
• In a normal distribution 68% of all estimates will
be within 1 SD of the true value
– 95% of all estimates will be within 2SD of the true
value
– We use this fact to derive confidence intervals around
our estimates
Population of test scores is not normal
Population
Take the mean of one sample
Population
Population
mean
Sample
Sample mean
Plot that one mean
Population
mean
Sample
Sample mean
Take another sample and plot that mean
Population
mean
Sample
Sample mean
Repeat many times
Population
mean
Sample
Sample mean
Repeat many times
Population
mean
Sample
Sample mean
Repeat many times
Sample mean
Repeat many times
Sample mean
Repeat many times
Sample mean
Repeat many times
Sample mean
Distribution of sample means
Sample mean
Confidence around estimates
• Point estimate is the value of a single estimate
• Confidence interval is a range around a single point
estimate which gives a sense of the precision of the
estimate:
– A narrow confidence interval suggests the estimate is quite
precise
– A confidence interval is associated with a particular confidence
level
• Confidence level is reported for given confidence interval
– A 95 confidence level means that if we constructed 100
confidence intervals we would expect 95 of them to contain our
estimate
Hypothesis testing
• Until now we have been estimating a mean of one sample
• When we do impact evaluations we compare means from
two different groups (the treatment and comparison groups)
• Null hypothesis: the two means are the same and any
observed difference is due to chance
– H0: treatment effect = 0
• Research hypothesis: the true means are different from each
other
– H1: treatment effect ≠ 0
• Other possible tests
– H2: treatment effect > 0
Hypothesis testing: 4 possible outcomes
Underlying truth
Significant
(reject H0)
Statistical
Test
Not significant
(fail to reject
H0)
Effective
(H0 false)
No Effect
(H0 true)
True positive
Probability = (1 – κ)
False positive
Type I Error


(low power)
Probability = α
False zero
Type II Error
True zero


Probability = κ
Probability = (1-α)
44
Deriving the components of
power
Review: one population, one sample
Population
Population
mean
Sample
Sample mean
Review: distribution of sample means
Sample mean
Review: samples vs populations
• We have two different populations: treatment
and comparison
• We only see the samples: sample from the
treatment population and sample from the
comparison population
• We will want to know if the populations are
different from each other
• We will compare sample means of treatment and
comparison
• We must take into account that different samples
will give us different means (sample variation)
48
One experiment, 2 samples, 2 means
Comparison mean
Treatment mean
Comparison
Treatment
One experiment
Difference
between the sample means
Comparison mean
Estimated effect
Treatment mean
One experiment
What
if we ran a second experiment?
Comparison mean
Estimated effect
Treatment mean
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Normal distribution of estimates around true effect
Distribution of estimates if true effect is zero
Distributions under two alternatives
We don’t see these distributions, just our
estimate 𝛽
Is our estimate 𝛽 consistent with the true effect
being β*?
If true effect is β*, we would get 𝛽 with
frequency A
Is it also consistent with the true effect being 0?
If true effect is 0, we would get 𝛽 with
frequency A’
Q: which is more likely, true effect=β* or true
effect=0?
A is bigger than A’ so true effect=β* is more
likely that true effect=0
But can we rule out that true effect=0?
Is A’ so small that true effect=0 is unlikely?
Probability true effect=0 is area to the right of A’
over total area under the curve
Critical value
• There is always a chance the true effect is zero,
however, large our estimated effect
• Recollect that, traditionally, if the probability that we
would get 𝛽 if the true effect were 0 is less than 5% we
say we can reject that the true effect is zero
• Definition: the critical value is the value of the
estimated effect which exactly corresponds to the
significance level
• If testing whether bigger than 0 a significant at 95%
level it is the level of the estimate where exactly 95%
of area under the curve lies to the left
• 𝛽 is significant at 95% if it is further out in the tail than
the critical value
72
95% critical value for true effect>0
In this case 𝛽 is > critical value so….
…..we can reject that true effect=0 with 95%
confidence
What if the true effect=β*?
How often would we get estimates that we could
not distinguish from 0? (if true effect=β*)
How often would we get estimates that we could
distinguish from 0? (if true effect=β*)
Q: which is more likely, that we can distinguish
an effect from 0 or we cant?
Chance of getting estimates we can distinguish
from 0 is the area under H β* that is above
critical value for H0
Proportion of area under H β* that is above
critical value is power
Definition of Power
• Statistical power is the probability that, if the
true effect is of a given size, our proposed
experiment will be able to distinguish the
estimated effect from zero
• Traditionally, we aim for 80% power. Some
people aim for 90% power
82
Recap hypothesis testing: power
Underlying truth
Power= probability of
true positive
Significant
(reject H0)
Statistical
Test
Not significant
(fail to reject
H0)
Effective
(H0 false)
No Effect
(H0 true)
True positive
Probability = (1 – κ)
False positive
Type I Error


(low power)
Probability = α
False zero
Type II Error
True zero


Probability = κ
Probability = (1-α)
83
What if the true effect could be - β* or + - β*?
What is our power to distinguish from 0?
More overlap between H0 curve and Hβ* curve,
the lower the power. Q: what effects overlap?
Larger hypothesized effect, further apart the
curves, higher the power
Greater variance in population, increases spread
of possible estimates, reduces power
Residual variance
• Some variation in outcomes can be explained by
factors we can observe
– For example, older children have higher test scores on
average, boys tend to be taller than girls
• Because we can explain this variation, it does not
contribute to our uncertainty about what the true
effect is, so does not reduce power
• Unexplained or residual variance determines power
• Residual variance is the variance in the outcome that is
left after we have included all the control variables we
will include in our final analysis equation
• Stratifying by key control variables ensures balance
Bigger the sample size, more power
Power also depends on the critical value, ie level
of significance we are looking for…
10% significance gives higher power than 5%
significance
Why does significance change power?
• Q: what trade off are we making when we chance
significance level and increase power?
• Remember: 10% significance means we’ll make
Type I (false positive) errors 10% of the time
• So moving from 5-10% significance means get
more power but at the cost of more false
positives
• Its like widening the gap between the goal posts
and saying “now we have a higher chance of
getting a goal”
Allocation ratio and power
• Definition of allocation ratio: the fraction of
the total sample that allocated to the
treatment group is the allocation ratio
• Usually, for a given sample size, power is
maximized when half sample allocated to
treatment, half to control
Why does equal allocation maximize power?
• Treatment effect is the difference between two means
(mean of treatment and control)
• Adding sample to treatment group increases accuracy of
treatment mean, same for control
• But diminishing returns to adding sample size
• If treatment group is much bigger than control group, the
marginal person adds little to accuracy of treatment group
mean, but more to the control group mean
• Thus we improve accuracy of the estimated difference
when we have equal numbers in treatment and control
groups
Summary of power factors
• Hypothesized effect size
– Q: A larger effect size makes power increase/decrease?
• Variance
– Q: greater residual variance makes power
increase/decrease?
• Sample size
– Q: Larger sample size makes power increase/decrease?
• Critical value
– Q: A looser critical value makes power increase/decrease
• Unequal allocation ration
– Q: an unequal allocation ratio makes power
increase/decrease?
95
Power equation: MDE
Significance
Level
Effect Size
Variance
Power
1

EffectSize  t1   t *
*
P1  P 
N
2
Proportion in
Treatment
Sample
Size
Clustered RCT experiments
• Cluster randomized trials are experiments in
which social units or clusters rather than
individuals are randomly allocated to
intervention groups
• The unit of randomization (e.g. the village) is
broader than the unit of analysis (e.g. farmers)
• That is: randomize at the village level, but use
farmer-level surveys as our unit of analysis
97
Clustered design: intuition
• We want to know how much rice the average
farmer in Sierra Leone grew last year
• Method 1: Randomly select 9,000 farmers
from around the country
• Method 2: Randomly select 9,000 farmers
from one district
98
Clustered design: intuition II
• Some parts of the country may grow more rice than
others in general; what if one district had a drought?
Or a flood?
– ie we worry both about long term correlations and
correlations of shocks within groups
• Method 1 gives most accurate estimate
• Method 2 much cheaper so for given budget could
sample more farmers
• What combination of 1 and 2 gives the highest power
for given budget constraint?
• Depends on the level of intracluster correlation, ρ (rho)
99
Low intracluster correlation
Variation in the
population
Clusters
Sample clusters
HIGH intracluster correlation
Variation in the
population
Intracluster correlation
• Total variance can be divided into within cluster
variance (𝜏 2 ) and between cluster variance (σ2 )
• When variance within clusters is small and the variance
between clusters is large, the intra cluster correlation is
high (previous slide)
• Definition of intracluster correlation (ICC): the
proportion of total variation explained by within cluster
level variance
– Note, when within cluster variance is high, within cluster
correlation is low and between cluster correlation is high
• 𝑖𝑐𝑐 = 𝜌 =
𝜏2
𝜎 2 +𝜏2
How does ICC effect power?
• Randomizing at the group level reduces power
for a given sample size because outcomes
tend to be correlated within a group.
• Power depends on
1. The number of clusters
2. How similar people in each cluster are to each
other compared to the population in general
Power with clustering
Significance
Level
Effect Size
Variance
Power
EffectSize
1

 t1   t *
*
P1  P 
N
1   (m  1)
2
ICC
Average
Cluster Size
Proportion in
Treatment
Sample
Size
Clustered RCTs vs. clustered sampling
• Must cluster at the level at which you randomize
– Many reasons to randomize at group level
• Could randomize by farmer group, village, district
• If randomize one district to T and one to C have
too little power however many farmers you
interview
– Can never distinguish treatment effect from possible
district wide shocks
• If randomize at individual level don’t need to
worry about within village correlation or village
level shocks, as that impacts both T and C
105
Bottom line for clustering
• If experimental design is clustered, we now
need to consider ρ when choosing a sample
size (as well as the other effects)
• Must cluster at level of randomization
• It is extremely important to randomize an
adequate number of groups
• Often the number of individuals within
groups matter less than the total number of
groups
106
Common tradeoffs and
rules of thumb
Common tradeoffs
– Answer one question really well? Or many
questions with less accuracy?
– Large sample size with possible attrition?
Or small sample size that we track very
closely?
– Few clusters with many observations? Or
many clusters with few observations?
– How do we allocate our sample to each
group?
Rules of thumb (1/2)
1. A larger sample is needed to detect differences
between two variants of a program than
between the program and the comparison group.
2. For a given sample size, the highest power is
achieved when half the sample is allocated to
treatment and half to comparison.
Rules Rules of thumb (2/2)
3. The more measurements are taken, the higher
the power. In particular, if there is a baseline and
endline rather than just an endline, you have
more power.
4. The lower compliance, the lower the power. The
higher the attrition, the lower the power.
5. For a given sample size, we have less power if
randomization is at the group level than at the
individual level.