Download Sample size calculation

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Frequentist Statistics
Core principles for reproducible science
Katherine S. Button
[email protected]
@ButtonKate
Outline
1.
2.
3.
4.
5.
6.
Statistical inferences is not intuitive
Basic principles of the frequentist approach
Null hypothesis significance testing
From NHST to effect estimation
Designing studies for reproducible results
Interpreting/reporting results for cumulative science
1. STATISTICAL INFERENCE IS NOT
INTUITIVE
• Poor appreciation of the role of chance
• [un]conscious biases
Fooled by chance…
Win! Win!
Win! Win! Win!
Lose!
Did Derren lie?
No, she was selected at random, he did have a system to guarantee
winning…
…just not the system she thought.
Race 1: 7776 people, randomly allocated a horse
Race 2: 1296 race 1 winners, randomly allocated a horse
Race 3: 216 race 2 winners, randomly allocated a horse
Race 4: 36 race 3 winners, randomly allocated a horse
Race 5: 6 race 4 winners, randomly allocated a horse
She was the 1 / 7776 who by chance had 5 consecutive wins
Fooled by chance
http://blogs.discovermagazine.com/neuroskeptic/2013/10/16/the-f-problem;
Gelman & Loken 2014; Borges 1941.
Fanelli (2010). PLOS ONE, 5, e10068.
Fooled by chance
Bem, D. J. J. Pers. Soc. Psych. 100, 407–425 (2011).
Students were more likely to remember words in the
test if they had later practiced them. Effect preceded
cause.
Saul Perlmutter
Protecting yourself from being fooled by chance
Statistical plan
Collect
data
Study
design
Pre-register / registered
reports
Fanelli (2010). PLOS ONE, 5, e10068.
2. BASIC PRINCIPLES OF THE
FREQUENTIST APPROACH
• Why use statistical inference
• Sampling distribution, standard error,
confidence intervals
Why use statistical inference?
Population: the universe to which we wish to generalise
• In the population we have parameters, the true values which
we wish to estimate
Sample: the finite study that we perform
• From the sample we obtain estimates of the population
parameters
Sampling variation: Samples and thus their estimates vary
• Statistics aims to quantify this variation
Frequentist approach
Use limited amounts of data to make general conclusions.
Assume that an infinitely large population of values exists and
that our data (our 'sample') was randomly selected from this
population. Analyze our sample and use the rules of
probability to make inferences about the overall
population.
The sampling distribution
µσ
χ̄2 , SD
Sampling distribution a
statistic across an infinite
number of samples, χ̄ = µ
χ̄1 , SD
χ̄3 , SD
SD of sampling distribution =
Standard Error of the Mean
σ
= √𝜂
The sampling distribution: proportions
pσ
P̄2 , SD
Provided that the sample
size is not too small and the
sampling proportion not too
close to 0 or 1, the sampling
distribution will approximate
to normal
P̄3 , SD
P1 , SD
𝑆𝐸 𝑝 =
𝑝(1 − 𝑝)
𝑛
Estimating a population parameter from a single
study
• Sample statistic unlikely to be exactly equal to population
parameter
• We seldom know population standard deviation, σ, so we
use the sample standard deviation, SD, instead:
SD
SE =
√𝜂
• SE measures variability in the sample statistic and how
precisely it estimates the true population parameter
• Precision depends on variation in underlying population
AND sample size
The normal distribution
The normal distribution is important in statistics
• not because may variables (height, IQ ect) are normally
distributed
• But because the sampling distribution of many estimates
is normal (Central Limit Theorem)
The standard normal distribution
𝑧 =(𝑥 − 𝜇)/𝜎
• 95% of values fall within ± 1.96 SD of the mean
• If this were the sampling distribution (and thus the SD = SE), then 95%
of study means would fall within ± 1.96 SD of the mean
• This is the basis of 95% confidence intervals
95% Confidence Interval = mean ± 1.96 SE
Confidence intervals and reference ranges
Sample means
Individuals
95% Confidence Interval ± 1.96 SEM (inferential)
95% Reference Range ± 1.96 SD (descriptive)
Interpreting confidence intervals
• 95% of sample means are within 1.96 SE of the true
population mean
• The true population mean is within 1.96 SE of the sample
mean 95% of the time
• We can be 95% confident that the true population mean lies
in our 95% CI
Confidence intervals
Consider these 25 simulated
results of studies where µ = 10.
Individual 95% CIs vary from
study to study but 23 / 25 of
them contain (92%) the true
population mean of 10.
We will return to this example
later…
Descriptive vs inferential statistics
We want to know (estimate) these, µ σ
χ̄1 , SD
These inferential statistics are
what we based our
conclusions about the true
population
Mean and SD describe
what we observe in our
sample…
…we use these descriptive/sample
statistics to estimate the population
parameters, and to estimate our
uncertainty. This all starts with SE,
which is used to calculate 95% CI, and
is related to p-values
3. NULL HYPOTHESIS SIGNIFICANCE
TESTING
• Fisher vs Neyman-Pearson
• P-vales, power, effect size, sample size
• Positive Predictive Value
A brief history of significance testing (according to
Sterne and Davey-Smith 2001)
Introduced by RA Fisher, p-values as an indexing strength of
evidence against the null hypothesis
Advocated p<0.05 as standard level for concluding there was
evidence against H0 but not as an absolute rule
A brief history of significance testing
Neyman and Pearson thought Fisher’s approach to p-values was subjective
and introduced “hypothesis tests” as an alternative objective decisionbased approach
In addition to type I error (focus of Fisher’s approach) they argued that type
II error (FN) warrants consideration
Argued that by fixing in advance the rates of type I and II error, mistakes
over many different experiments would be limited i.e. the basis of power
calculations aimed at minimising error (considered across a number of
studies…)
The null distribution
Data are compared to a hypothetical distribution, the null
distribution
The null distribution is the probability distribution of the test
statistic when the null hypothesis is true
The null distribution is the sampling distribution of a statistic
under the null hypothesis (this is where SE comes in)
Null Hypothesis Significance Testing
Power
Null
Distribution
False Negatives
Alternative
Distribution
False Positives
n.b. this is shown for a one tailed test
Statistical power (1 - β)
probability that a test will
correctly reject the null
hypothesis when the null
hypothesis is false
Effect size
Sample size (N)
alpha significance criterion
(α = 0.05)
Type I and Type II Error
• Type I error is falsely
rejecting the null hypothesis
(false negative)
• Type II error is failing to
reject a false null hypothesis
(false positive)
Statistical Power
• Probability of committing a
Type I error given that H0 is
true is labelled α (alpha)
• Probability of committing a
Type II error is labelled β
(beta)
• Power is the probability of
correctly rejecting H0 (1 - β)
Statistical Power, effect size, sample size, alpha
The power of any test of statistical significance will be affected
by four main parameters:
the effect size
the sample size (N)
the alpha significance criterion (α)
statistical power (1 - β), or the chosen or implied beta (β)
All four parameters are mathematically related. If you know any
three of them you can figure out the fourth.
THIS IS THE BEDROCK OF META-SCIENCE!
Statistical Power: Effect Size
Assume we are testing for a
difference between two
means
Power increases as we
increase the size of the
difference 𝜇1 − 𝜇2
Effect size =
𝝁𝟏 −𝝁𝟐
𝑝𝑜𝑜𝑙𝑒𝑑 𝑆𝐷
Statistical Power: Effect Size
Or we can reduce the variability (SD) :
SMD =
SE =
𝜇1 −𝜇2
𝒑𝒐𝒐𝒍𝒆𝒅 𝑺𝑫
𝑺𝑫
√𝜂
This often hard to do, but experimental design
can help, e.g., with-in subject / repeated
measures designs
Statistical Power: Increases with sample size
As sample sizes increase, the
means of those sample
tend more towards the true
population mean, thus their
variability deceases
SE =
𝑆𝐷
√𝜼
Note. SD remains unchanged
Statistical Power and false negatives
Probability of committing a
Type I error given that H0 is
true is labelled α (alpha)
Probability of committing a
Type II error is labelled β
(beta)
Power is the probability of
correctly rejecting H0 (1 - β)
Power and false positives
Suppose:
In 90% of cases the null hypothesis is true
The significance level is set at 5%
The average power of studies is 80%
If, in 1000 studies, 100 true associations will exist, we will
detect 80 (80%)
Of the remaining 900 non-associations, we will falsely
declare 45 (5%) as significant
Sterne & Davey Smith (2001). British Medical Journal, 322, 226-231.
Power and false positives
Suppose:
In 90% of cases the null hypothesis is true
The significance level is set at 5%
The average power of studies is 20%
If, in 1000 studies, 100 true associations will exist, we will
detect 20 (20%)
Of the remaining 900 non-associations, we will falsely
declare 45 (5%) as significant
That’s approx. a 1/3 chance a statistically significant finding is true! This is
known as the Positive Predictive Value (PPV).
Power and false positives and PPV
Higher proportion of false positives,
particularly for exploratory research
Krzywinski & Altman (2013). Nat Methods, 10, 1139-1140.
Power and effect inflation
µ
α = 0.05
Power
Null Distribution
Small studies p<0.05 tend to overestimate effect size
Alternative
Distribution
Implications of low power across studies
1. More false negatives (but often unpublished…)
More false positives
Bigger (but biased) effects
Button et al (2013). Nat Methods, 10, 1139-1140.
Publication bias
Rosenthal (1979):
“The most extreme view of this
problem is that journals are filled
with the 5% of studies that show
Type I errors, while the file drawers
back at the lab are filled with the 95%
of the studies that show
nonsignificant (i.e., p > 0.05) results”
Note that this would only be the case
if H0 were always true (which it
hopefully isn’t)
Adapted from Marcus Munafo
What’s the prior probability of our H1 being
true?
H1: men can be pregnant
0.8 × 0
𝑃𝑃𝑉 =
=0
0.8 × 0 + 0.05
H1: Students will remember
more words in a test if they
later practiced them
4. FROM NHST TO EFFECT
ESTIMATION
• The problem with p-values
• The new statistics, effect size estimation and
confidence intervals
Limitations with NHST
• The nature of null hypothesis testing means ‘non-significant’
results are harder to interpret
• Failure to reject H0 (i.e., p > 0.05) is not the same as
accepting H0
• ‘Significant’ results are seen as more interesting and
publishable = publication bias
• Focus on p<0.05 yes/no is not particularly informative – p =
0.06 not so very different from p = 0.05 )LAZY CONCLUSIONS!
• ‘Statistically significant’ often confused with ‘clinically /
theoretically / biologically significant’
Interpreting null findings (Chalder et al 2014)
We considered whether our result was sufficiently precise to rule out the possibility of
a beneficial effect. The most statistically powerful analysis was using the Beck
depression inventory [BDI] as a continuous outcome measure. …our primary analysis
indicated an adjusted between group difference in mean BDI scores of −0.54 (95%
confidence interval −3.06 to 1.99).
It is difficult to define precisely what would constitute a clinically important treatment
effect, but the National Institute for Health and Clinical Excellence guideline panel
have suggested that this could correspond to around 3 points or 0.35 standard
deviations at baseline on the Hamilton depression rating scale and close to the 0.33
standard deviations used in our power calculation. The equivalent difference in terms
of Beck depression inventory score would be between 4.1 and 3.9 points,
respectively, based on our observed standard deviation of 11.8 points at four months
post-randomisation. This suggests that we have excluded the possibility, at least with
95% confidence, that the intervention added to usual care is clinically effective in
improving symptoms of depression compared with usual care alone.”
Far more informative than “there was no effect, p < 0.05) !
The ‘New Statistics’ according to Cummings 2013
Estimation based on effect sizes, confidence intervals, and
meta-analysis, rather than p<0.05
Confidence intervals are more informative than p-values
5. DESIGNING STUDIES FOR
REPRODUCIBLE RESULTS
• Sample size calculations
• Realistic effect sizes
My favourite Fisher [mis?]-quote
To call in the statistician after
the experiment is done
may be no more than
asking him to perform a
post-mortem examination:
[s]he may be able to say
what the experiment died
of.
Ronald Fisher
Planning a study
Clear working research question, which specifies population of
interest, primary outcome / dependent variable (e.g.,
depression), main exposure / intervention / explanatory
variable, the nature of the effect under investigation (causal,
correlational?), setting (lab, clinic?), importance confounders
/ covariates (e.g., gender, handedness)
• Design study to test primary hypothesis / research question
• Pre-specify statistical plan to test primary hypothesis / RQ
• Calculate sample size required to yield sufficient power for
the primary analysis to be informative
Pre-register study (RR, OSF ect.)
Sample size calculation
Decide how many participants / animals / samples to be studied
• Unethical to study too few – waste of time and resources and
potential risk to participants unjustified as study unable to
answer posed question
• Unethical to study too many – withholding potentially
beneficial treatments from patients, waste of time and
money
We use sample size calculations when we want to estimate
some statistic with a particular precision, demonstrate the
equivalence between two groups, or detect a difference of a
given size between two groups
Adapted form Kate Tilling
Sample size calculations
effect size
The true population effect size is unknown (that’s why we are doing the
study!) BUT we must work hard to estimate a realistic effect size OR
decide what constitutes the minimum clinically/theoretically relevant
effect
Can design studies to reduce SD – with-in subjects repeated measures
sample size (N) – the MAIN thing we can change in our design
alpha significance criterion (α) – fix, usually 5%
statistical power (1 - β) – fix, hopefully 80 -90% (β 10 – 20%)
Sample size calculation
What size sample do we require to be able to detect our target
difference between two groups with pre-specified power and
significance level, given the variation in our sample?
MUST BE PERFORMED PRIOR TO STUDY!
(Or as we will do later, using an effect size
estimate that is INDEPENDENT of the
individual study)
Example adapted form Kate Tilling
Sample size calculation
What size sample do we require to be able to detect our target
difference between two groups with pre-specified power
and significance level, given the variation in our sample?
Set power and significance level to minimise error:
Reality
Conclusion
Null true
Null false
Reject null
Type I error (α)
Correct (1-β)
Accept null
Correct (1-α)
Type II error (β)
alpha significance criterion (α) = Pr(Type I error)
statistical power = 1 – Pr(Type II error)
= 5%
= 80%
Minimising type I and type II to maximise profit
Lindborg et al 2014
Sample size calculation
What size sample do we require to be able to detect our target
difference between two groups with pre-specified power
and significance level, given the variation in our sample?
Target difference must:
1. be determined from previous experience / literature
2. Be sufficiently large to be clinically / theoretically / biologically important
3. Be sufficiently small be realistic
So ask yourself, what is this minimum clinically/theoretically/biologically
important difference?
Admittedly, this is not trivial…
Anchor / Gold standard:
Asked patients – How do you feel:
better, same, worse?
MCID
“Not better”
“Better”
Change in BDI score
ROC analysis / signal detection (JND)
to find BDI change which optimally
classified patients as better/not
𝑌𝑜𝑢𝑑𝑒𝑛 = max(𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 − 1)
Baseline dependency: GLM
Ratio Scale
10
Absolute Scale
-50
-10
-5
0
5
Difference in % Change
-40 Difference
-30 in change
-20
of -10
BDI score
0
10
All Studies - Pooled
0
5
10
15
20
25
30
35
40
Baseline BDI score
Better - same
Better - worse
45
50
55
0
60
Same - worse
5
10
Better - same
Interaction term
Mean (*)
SD
2.5%
97.5%
Absolute scale
-0.48
0.19
-0.85
-0.09
Ratio Scale
-0.21
0.22
-0.65
0.23
(*): is the difference in mean change (in each scale whether
this is absolute or ratio, respectively) from baseline for
those feeling better relative to those feeling worse, per unit
change of baseline
15
20
25
30
35
Baseline BDI score
40
Better -Worse
45
50
55
60
Same - worse
Fixed Effects GLM
Models for “observed”
change
Absolute no interaction
Absolute with interaction
6649
6596
Ratio no interaction
(i.e., log-link)
6469
AIC
Minimal clinically important difference
MCID ~ 20% reduction in scores
Implications
NICE guidance of ≥3 BDI points for significant treatment effect
does not account for baseline dependency
A between-group difference (i.e., treatment effect) of 3 BDI
points would be trivial in a sample with an average BDI score
of 60, but more relevant in a sample averaging BDI scores of
14.
Use MCID to inform sample size to assess clinically meaningful
treatment effects
Sample size calculation
What size sample do we require to be able to detect our target
difference between two groups with pre-specified power and
significance level, given the variation in our sample?
Variation (SD):
• Depends on the individuals we recruit to the study
• But we can estimate in advance from a similar study in similar
population or from a pilot study
• If you have no idea about variation – then arguably you are not ready for
NHST!
Realistic effects sizes
The median experimental study in social psychology has an N of
20 per group
What can you reliably detect with this sample size?
• Men taller than women (N = 6)
• People above median age closer to retirement (N = 10)
N = 20 per group is not enough to detect that
• People who like spicy food are more likely to like Indian food
(N = 27 per cell)
• Men weigh more than women (N = 47 per cell)
Is the effect we are studying likely to be bigger than “Men
weigh more than women?”
Credit: Slide adapted from Simmons, J., Nelson, L., & Simonsohn, U. (2011). Life after p-hacking. Presentation at the Meeting of the Society for
Personality and Social Psychology, New Orleans, LA, 17-19 January 2013
Realistic effect sizes: Ethical implications
Principle of the ‘three Rs’ (reduce, refine, replace)
80% power = 20% chance of false negatives
20% power = 80% chance of false negatives
Low powered studies are inefficient and wasteful
Need to consider too many and too few animals
Button et al (2013). Nat Methods, 10, 1139-1140.
Powering replication
To achieve adequate power, the replication
study will need a sample size twice that of
the original (where original p ~ 0.05)
Replication
Sample
Discovery P-Value
Button et al (2013). Nature Reviews Neuroscience, 14, 365 – 376
0.05
0.01
0.005
n
50%
74%
81%
2n
80%
96%
98%
4n
98%
100%
100%
6. INTERPRETING AND REPORTING
RESULTS
Interpreting and reporting results
• Think ‘cumulative science’
• Describe sample, N(%) or M(SD) for all baseline
measures – pirate plot!
• Describe main variables in analysis in a
maximally informative way (e.g., outcome M
(SD) by main explanatory variables)
• Report results for inferential statistics as point
estimate, 95% Confidence Interval, and if
phrased as hypothesis testing, report p-value.
Try drafting your results tables BEFORE you collect data
•
•
If you’re truly clear in your experimental design and
analysis plan this shouldn’t be a problem
If it is, then arguably you’re not ready for data collection
Try writing you papers without using the phrase ‘statistically
significant’
•
•
•
Harder than you think!
Forces you to write about effect size, precision, and
direction
P<0.05 can occur in any direction!
Interpreting results e.g. association between exposure and
outcomes
1. Is it the result of chance?
Pre-Registration
P values and confidence intervals
2. Is it due to bias?
Selection bias? Experimenter bias?
Lack of blinding ect.
3. Is it due to confounding?
An unmeasured third factor that
is associated with both?
4. Is it an example of reverse causality?
Recall bias?
5. Is it causal?
Researchers use causal language all the
time but rarely have they adequately
testing causation …
Adapted from Sara Brookes
Acknowledgements
Marcus Munafò
Kate Tilling
Sara Brookes
Daphne Kounali
University of Bristol
University of Bristol, SSCM Short Course, Intro to Statistics
University of Bristol, SSCM Short Course, Intro to Statistics
University of Bristol
[email protected]
@ButtonKate
PRACTICAL: POWER FAILURE
Calculating statistical power
Using meta-analysis to
estimate statistical power:
•
•
•
Meta-effect size
Study sample size (N)
alpha significance criterion
(α = 0.05)
Meta-analysis in neuroscience
Searched Web of Science
“neuroscience” &
“meta-analysis”
published in 2011
49 meta-analyses
730 individual studies
Low statistical power
Neuroscience
Brain imaging
Excess of significance in brain volume
abnormalities (Ioannidis (2011).
Archives of General Psychiatry, 14,
1105-1107).
41 meta-analyses, 461 studies
Median power = 8%
Animals
49 meta-analyses, 730 studies
Median power = 21%
Expected 254, observed 349
“significant” studies, p < 0.0001
Button et al (2013). Nature Reviews Neuroscience, 14, 365 – 376
Meta-analysis of sex differences in
rodent maze. (Jonasson (2005).
Neurosci Biobehav Rev, 28, 811-825 )
40 studies
median power = 18% - 31%
Exploring sample size bias and IF
p < 0.001
Munafò et al (2009). Molecular Psychiatry, 14, 119-120.
Estimate median statistical power in field of…
1.
2.
3.
4.
Search ‘meta-analysis’ & [‘TOPIC’ & ‘JOURNALS’….]
Screen titles/abstracts against inclusion / exclusion criteria
Install G*Power or use R
For each meta-analysis extract lead author, date, metaestimate (SMD, OR)
5. For each study in meta-analysis extract lead author, date,
total n, groups n, base rate (proportion of events in control
group, required for OR), p <0.05 (y/n)
1.
Other metrics? IF, #citations ect.
6. Pair up to extract data in duplicate
Record search strategy and data for PRISMA diagram
Power and other biomedical domains
All measures
Dumas et al (2016).
Cognitive excluded
Power and other biomedical domains
All measures
Dumas et al (2016).
Cognitive excluded
Power and other biomedical domains
Dumas et al (2016).
Effect size estimates
Two major groups of effect size estimate exist for continuous
outcomes:
Difference between means (effect size d)
Zero-order correlations (effect size r)
The odds ratio is also used, which is the likelihood of a
dichotomous outcome
Standardized Mean Difference
The difference in means observed in two groups, standardized
across these groups:
The difference between means for two groups is calculated
This difference is standardized by dividing by the pooled standard deviation
SMD =
𝝁𝟏 −𝝁𝟐
𝒑𝒐𝒐𝒍𝒆𝒅 𝑺𝑫
The SMD can be converted to absolute values using the pooled
standard deviation
Odds Ratio
The difference in the probability of a dichotomous outcome in two
groups:
For example, two groups may include a placebo arm and a treatment arm in
a clinical trial
The dichotomous outcome may be number of deaths (versus non-deaths)
An OR of 1.2 means the outcome is 1.2 times (i.e., 20%) more likely
in the treatment arm