Download Sampling, Statistics, Sample Size, Power

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sampling, Statistics, Sample Size,
Power
Natalia Rigol (MIT)
Executive Education Course in Evaluating Social Programs
23rdJuly, 2014
Course Overview
1.
What is evaluation?
2.
Measuring impacts (outcomes, indicators)
3.
Why randomize?
4.
How to randomize?
5.
Sampling and sample size
6.
Threats and Analysis
7.
Cost-Effectiveness Analysis
8.
Project from Start to Finish
Our Goal in This Lecture: From Sample to Population
1. To understand how samples and populations are related
1. Population- All people who meet a certain criteria. Ex: The
population of all 3rd graders in India who take a certain exam
2. Sample- A subset of the population. Ex: 1000 3rd graders in India
who take a certain exam
 We want the sample to tell us something about the overall population
 Specifically, we want a sample from the treatment and a sample from the
control to tell us something about the true effect size of an intervention in
a population
2. To build intuition for setting the optimal sample size for your
study
 This will help us confidently detect a difference between treatment and
control
Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
Lesson 1: Basic Statistics
To understand how to interpret data, we need to understand three
basic concepts:
 What is a distribution?
 What’s an average result?
 What is a standard deviation?
What is a distribution?
 A distribution graph or table shows each possible outcome and the
frequency that we observe that outcome

A probability distribution- same as a distribution but converts
frequency to probability
Baseline Test Scores
500
450
400
350
300
frequency
250
200
150
100
50
0
0
7 14 21 28 35 42 49 56 63 70 77 84 91 98
test scores
What’s the Average Result?
 What is the “expected result”? (i.e. the average)?
 Expected Result=the sum of all possible values each multiplied by the
probability of its occurrence
Mean = 26
500
450
400
350
300
26
250
frequency
mean
200
150
100
50
0
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
Mean=26
Population
Population
mean
What’s a standard deviation?
 Standard deviation: Measure of dispersion in the population
 Weighted average distance to the mean gives more weight to those points
furthest from mean.
Standard Deviation = 20
500
600
1 Standard
Deviation
450
500
400
350
400
300
frequency
26
250
300
sd
200
mean
200
150
100
100
50
0
0
0
5
10
15
20
25
30
35
40
45
50
55
test scores
60
65
70
75
80
85
90
95
100
Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
Our Goal in This Lecture: From Sample to Population
1. To understand how samples and populations are related
1. Population- All people who meet a certain criteria. Ex: The
population of all 3rd graders in India who take a certain exam
2. Sample- A subset of the population. Ex: 1000 3rd graders in India
who take a certain exam
 We want the sample to tell us something about the overall population
 Specifically, we want a sample from the treatment and a sample from the
control to tell us something about the true effect size of an intervention in
a population
2. To build intuition for setting the optimal sample size for your
study
 This will help us confidently detect a difference between treatment and
control
Sampling variation: example
 We want to know the average test score of grade 3 children in Springfield
 How many children would we need to sample to get an accurate picture
of the average test score?
Population: test scores of all 3rd graders
Population
Mean of population is 26 (true mean)
Population
Population
mean
Pick sample 20 students: plot frequency
Population
Population
mean
Sample
Sample mean
Zooming in on sample of 20 students
Population
mean
Sample
Sample mean
Pick a different sample of 20 students
Population
mean
Sample
Sample mean
Another sample of 20 students
Population
mean
Sample
Sample mean
Sampling variation: definition
 Sampling variation is the variation we get between different estimates
(e.g. mean of test scores) due to the fact that we do not test everyone but
only a sample
 Sampling variation depends on:
• The variation in test scores in the underlying population
• The number of people we sample
What if our population instead of looking like this…
Population
Population
mean
…looked like this
Population
Population
mean
Standard deviation: population 1
 Measure of dispersion in the population
1 Standard 1 Standard
deviation
deviation
Population
Population
mean
1 Standard
deviation
Standard deviation: population II
1 sd
1 sd
Population
Population
mean
1 Standard
deviation
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample
mean
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample
mean
Different samples of 20 gives similar estimates
Population
mean
Sample
Sample
mean
Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
Population
Population
Pick sample 20 students: plot frequency
Population
Population
mean
Sample
Sample mean
Zooming in on sample of 20 students
Population
mean
Sample
Sample mean
Pick a different sample of 20 students
Population
mean
Sample
Sample mean
Another sample of 20 students
Population
mean
Sample
Sample mean
Lets pick a sample of 50 students
Population
mean
Sample
Sample mean
A different sample of 50 students
Population
mean
Sample
Sample mean
A third sample of 50 students
Population
mean
Sample
Sample mean
Lets pick a sample of 100 students
Population
mean
Sample
Sample mean
Lets pick a different 100 students
Population
mean
Sample
Sample mean
Lets pick a different 100 students- What do we notice?
Population
mean
Sample
Sample mean
Law of Large Numbers
 The more students you sample (so long as it is randomized), the closer
most averages are to the true average (the distribution gets “tighter”)
 When we conduct an experiment, we can feel confident that on average,
our treatment and control groups would have the same average outcomes
in the absence of the intervention
Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
Central Limit Theorem
 If we take many samples and estimate the mean many times, the frequency
plot of our estimates (the sampling distribution) will resemble the
normal distribution
 This is true even if the underlying population distribution is not normal
Population of test scores is not normal
Population
Take the mean of one sample
Population
Population
mean
Sample
Sample
mean
Plot that one mean
Population
mean
Sample
Sample
mean
Take another sample and plot that mean
Population
mean
Sample
Sample
mean
Repeat many times
Population
mean
Sample
Sample
mean
Repeat many times
Population
mean
Sample
Sample
mean
Repeat many times
Sample
mean
Repeat many times
Sample
mean
Repeat many times
Sample
mean
Repeat many times
Sample
mean
Distribution of Sample Means
Sample
mean
Normal Distribution
Central Limit Theorem
 The more samples you take, the more the distribution of possible
averages (the sampling distribution) looks like a bell curve (a normal
distribution)
 This result is INDEPENDENT of the underlying distribution
 The mean of the distribution of the means will be the same as the mean
of the population
 The standard deviation of the sampling distribution will be the standard
error (SE)
𝑠𝑑
𝑠𝑒 = 2
𝑛
57
Central Limit Theorem
 The central limit theorem is crucial for statistical inference
 Even if the underlying distribution is not normal, IF THE SAMPLE SIZE IS
LARGE ENOUGH, we can treat it as being normally distributed
58
THE Basic Questions in Statistics
 How big does your sample need to be?
 Why is this the ultimate question?
• How confident can you be in your results?
 We need it to be large enough that both the law of large numbers and the
central limit theorem can be applied
 We need it to be large enough that we could detect a difference in
outcome of interest between the treatment and control samples
Samples vs Populations
 We have two different populations: treatment and comparison
 We only see the samples: sample from the treatment population and
sample from the comparison population
 We will want to know if the populations are different from each other
 We will compare sample means of treatment and comparison
 We must take into account that different samples will give us different
means (sample variation)
60
One Experiment, 2 Samples, 2 Means
Comparison
mean
Treatment mean
Comparison
Treatment
Difference between the sample means
Comparison
mean
Estimated effect
Treatment
mean
What if we ran a second experiment?
Comparison
mean
Estimated effect
Treatment
mean
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Many experiments give distribution of estimates
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
What does this remind you of?
100
90
80
70
Frequency
60
50
40
30
20
10
0
-3
-2
-1
0
1
2
3
4
Difference
5
6
7
8
9
10
Hypothesis testing
 When we do impact evaluations we compare means from two different
groups (the treatment and comparison groups)
 Null hypothesis: the two means are the same and any observed difference is
due to chance
• H0: treatment effect = 0
 Research hypothesis: the true means are different from each other
• H1: treatment effect ≠ 0
 Other possible tests
• H2: treatment effect > 0
Distribution of estimates if true effect is zero
Distributions under two alternatives
We don’t see these distributions, just our
estimate 𝛽
Is our estimate 𝛽 consistent with the true effect
being β*?
If true effect is β*, we would get 𝛽 with
frequency A
Is it also consistent with the true effect being 0?
If true effect is 0, we would get 𝛽 with
frequency A’
Q: which is more likely, true effect=β* or true
effect=0?
A is bigger than A’ so true effect=β* is more
likely that true effect=0
But can we rule out that true effect=0?
Is A’ so small that true effect=0 is unlikely?
Probability true effect=0 is area to the right of A’
over total area under the curve
Critical value
 There is always a chance the true effect is zero, however, large our
estimated effect
 Recollect that, traditionally, if the probability that we would get 𝛽 if the
true effect were 0 is less than 5% we say we can reject that the true effect
is zero
 Definition: the critical value is the value of the estimated effect which
exactly corresponds to the significance level
 If testing whether bigger than 0 a significant at 95% level it is the level of
the estimate where exactly 95% of area under the curve lies to the left
 𝛽 is significant at 95% if it is further out in the tail than the critical value
84
95% critical value for true effect>0
In this case 𝛽 is > critical value so….
…..we can reject that true effect=0 with 95%
confidence
What if the true effect=β*?
How often would we get estimates that we could
not distinguish from 0? (if true effect=β*)
How often would we get estimates that we could
distinguish from 0? (if true effect=β*)
Chance of getting estimates we can distinguish
from 0 is the area under H β* that is above
critical value for H0
Proportion of area under H β* that is above
critical value is power
Recap hypothesis testing: power
Underlying truth
Significant
(reject H0)
Statistical
Test
Not significant
(fail to reject
H0)
Effective
(H0 false)
No Effect
(H0 true)
True positive
Probability = (1 – κ)
False positive
Type I Error


(low power)
Probability = α
False zero
Type II Error
True zero


Probability = κ
Probability = (1-α)
93
Definition of Power
 Power: If there is a measureable effect of our intervention (the null
hypothesis is false), the probability that we will detect an effect (reject the
null hypothesis)
 Reduce Type II Error: Failing to reject the null hypothesis (concluding there
is no difference), when indeed the null hypothesis is false.
 Traditionally, we aim for 80% power. Some people aim for 90% power
94
More overlap between H0 curve and Hβ* curve,
the lower the power. Q: what effects overlap?
Larger hypothesized effect, further apart the
curves, higher the power
Greater variance in population, increases spread
of possible estimates, reduces power
Power also depends on the critical value, ie level
of significance we are looking for…
10% significance gives higher power than 5%
significance
Why does significance change power?
 Q: what trade off are we making when we chance significance level and
increase power?
 Remember: 10% significance means we’ll make Type I (false positive) errors
10% of the time
 So moving from 5-10% significance means get more power but at the cost
of more false positives
 Its like widening the gap between the goal posts and saying “now we have a
higher chance of getting a goal”
Allocation ratio and power
 Definition of allocation ratio: the fraction of the total sample that allocated
to the treatment group is the allocation ratio
 Usually, for a given sample size, power is maximized when half sample
allocated to treatment, half to control
Why does equal allocation maximize power?
 Treatment effect is the difference between two means (mean of treatment
and control)
 Adding sample to treatment group increases accuracy of treatment mean,
same for control
 But diminishing returns to adding sample size
 If treatment group is much bigger than control group, the marginal person
adds little to accuracy of treatment group mean, but more to the control
group mean
 Thus we improve accuracy of the estimated difference when we have equal
numbers in treatment and control groups
Summary of power factors
 Hypothesized effect size
• Q: A larger effect size makes power increase/decrease?
 Variance
• Q: greater residual variance makes power increase/decrease?
 Sample size
• Q: Larger sample size makes power increase/decrease?
 Critical value
• Q: A looser critical value makes power increase/decrease
 Unequal allocation ration
• Q: an unequal allocation ratio makes power increase/decrease?
103
Power equation: MDE
Significance
Effect Size Power Level
Variance
1

EffectSize  t1   t *
*
P1  P 
N
2
Proportion in
Treatment
Sample
Size
Clustered RCT experiments
 Cluster randomized trials are experiments in which social units or clusters
rather than individuals are randomly allocated to intervention groups
 The unit of randomization (e.g. the village) is broader than the unit of
analysis (e.g. farmers)
 That is: randomize at the village level, but use farmer-level surveys as our
unit of analysis
105
Clustered design: intuition
 We want to know how much rice the average farmer in Sierra Leone grew
last year
 Method 1: Randomly select 9,000 farmers from around the country
 Method 2: Randomly select 9,000 farmers from one district
106
Clustered design: intuition II
 Some parts of the country may grow more rice than others in general;
what if one district had a drought? Or a flood?
• ie we worry both about long term correlations and correlations of
shocks within groups
 Method 1 gives most accurate estimate
 Method 2 much cheaper so for given budget could sample more farmers
 What combination of 1 and 2 gives the highest power for given budget
constraint?
 Depends on the level of intracluster correlation, ρ (rho)
107
Low intracluster correlation
Variation in
the
population
Clusters Sample
HIGH intracluster correlation
Intracluster correlation
 Total variance can be divided into within cluster variance (𝜏 2 ) and between
cluster variance (σ2 )
 When variance within clusters is small and the variance between clusters
is large, the intra cluster correlation is high (previous slide)
 Definition of intracluster correlation (ICC): the proportion of total
variation explained by within cluster level variance
• Note, when within cluster variance is high, within cluster correlation is
low and between cluster correlation is high
 𝑖𝑐𝑐 = 𝜌 =
𝜏2
𝜎2 +𝜏2
HIGH intracluster correlation
Low intracluster correlation
Power with clustering
Significance
Effect Size
Level
Power
Variance
EffectSize
1

 t1   t *
*
P1  P 
N
1   (m  1)
2
ICC
Average
Cluster Size
Proportion in
Treatment
Sample
Size
Clustered RCTs vs. clustered sampling
 Must cluster at the level at which you randomize
• Many reasons to randomize at group level
 Could randomize by farmer group, village, district
 If randomize one district to T and one to C have too little power
however many farmers you interview
• Can never distinguish treatment effect from possible district wide
shocks
 If randomize at individual level don’t need to worry about within village
correlation or village level shocks, as that impacts both T and C
114
Bottom line for clustering
 If experimental design is clustered, we now need to consider ρ when
choosing a sample size (as well as the other effects)
 Must cluster at level of randomization
 It is extremely important to randomize an adequate number of groups
 Often the number of individuals within groups matter less than
the total number of groups
115
Common tradeoffs and
rules of thumb
Common tradeoffs
•
Answer one question really well? Or many
questions with less accuracy?
•
Large sample size with possible attrition? Or
small sample size that we track very closely?
• Few clusters with many observations? Or
many clusters with few observations?
•
How do we allocate our sample to each
group?
Rules of thumb (1/2)
1.
A larger sample is needed to detect differences between two variants
of a program than between the program and the comparison group.
2.
For a given sample size, the highest power is achieved when half the
sample is allocated to treatment and half to comparison.
Rules Rules of thumb (2/2)
3.
The more measurements are taken, the higher the power. In
particular, if there is a baseline and endline rather than just an endline,
you have more power.
4.
The lower compliance, the lower the power. The higher the attrition,
the lower the power.
5.
For a given sample size, we have less power if randomization is at the
group level than at the individual level.