Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequentist Statistics Core principles for reproducible science Katherine S. Button [email protected] @ButtonKate Outline 1. 2. 3. 4. 5. 6. Statistical inferences is not intuitive Basic principles of the frequentist approach Null hypothesis significance testing From NHST to effect estimation Designing studies for reproducible results Interpreting/reporting results for cumulative science 1. STATISTICAL INFERENCE IS NOT INTUITIVE • Poor appreciation of the role of chance • [un]conscious biases Fooled by chance… Win! Win! Win! Win! Win! Lose! Did Derren lie? No, she was selected at random, he did have a system to guarantee winning… …just not the system she thought. Race 1: 7776 people, randomly allocated a horse Race 2: 1296 race 1 winners, randomly allocated a horse Race 3: 216 race 2 winners, randomly allocated a horse Race 4: 36 race 3 winners, randomly allocated a horse Race 5: 6 race 4 winners, randomly allocated a horse She was the 1 / 7776 who by chance had 5 consecutive wins Fooled by chance http://blogs.discovermagazine.com/neuroskeptic/2013/10/16/the-f-problem; Gelman & Loken 2014; Borges 1941. Fanelli (2010). PLOS ONE, 5, e10068. Fooled by chance Bem, D. J. J. Pers. Soc. Psych. 100, 407–425 (2011). Students were more likely to remember words in the test if they had later practiced them. Effect preceded cause. Saul Perlmutter Protecting yourself from being fooled by chance Statistical plan Collect data Study design Pre-register / registered reports Fanelli (2010). PLOS ONE, 5, e10068. 2. BASIC PRINCIPLES OF THE FREQUENTIST APPROACH • Why use statistical inference • Sampling distribution, standard error, confidence intervals Why use statistical inference? Population: the universe to which we wish to generalise • In the population we have parameters, the true values which we wish to estimate Sample: the finite study that we perform • From the sample we obtain estimates of the population parameters Sampling variation: Samples and thus their estimates vary • Statistics aims to quantify this variation Frequentist approach Use limited amounts of data to make general conclusions. Assume that an infinitely large population of values exists and that our data (our 'sample') was randomly selected from this population. Analyze our sample and use the rules of probability to make inferences about the overall population. The sampling distribution µσ χ̄2 , SD Sampling distribution a statistic across an infinite number of samples, χ̄ = µ χ̄1 , SD χ̄3 , SD SD of sampling distribution = Standard Error of the Mean σ = √𝜂 The sampling distribution: proportions pσ P̄2 , SD Provided that the sample size is not too small and the sampling proportion not too close to 0 or 1, the sampling distribution will approximate to normal P̄3 , SD P1 , SD 𝑆𝐸 𝑝 = 𝑝(1 − 𝑝) 𝑛 Estimating a population parameter from a single study • Sample statistic unlikely to be exactly equal to population parameter • We seldom know population standard deviation, σ, so we use the sample standard deviation, SD, instead: SD SE = √𝜂 • SE measures variability in the sample statistic and how precisely it estimates the true population parameter • Precision depends on variation in underlying population AND sample size The normal distribution The normal distribution is important in statistics • not because may variables (height, IQ ect) are normally distributed • But because the sampling distribution of many estimates is normal (Central Limit Theorem) The standard normal distribution 𝑧 =(𝑥 − 𝜇)/𝜎 • 95% of values fall within ± 1.96 SD of the mean • If this were the sampling distribution (and thus the SD = SE), then 95% of study means would fall within ± 1.96 SD of the mean • This is the basis of 95% confidence intervals 95% Confidence Interval = mean ± 1.96 SE Confidence intervals and reference ranges Sample means Individuals 95% Confidence Interval ± 1.96 SEM (inferential) 95% Reference Range ± 1.96 SD (descriptive) Interpreting confidence intervals • 95% of sample means are within 1.96 SE of the true population mean • The true population mean is within 1.96 SE of the sample mean 95% of the time • We can be 95% confident that the true population mean lies in our 95% CI Confidence intervals Consider these 25 simulated results of studies where µ = 10. Individual 95% CIs vary from study to study but 23 / 25 of them contain (92%) the true population mean of 10. We will return to this example later… Descriptive vs inferential statistics We want to know (estimate) these, µ σ χ̄1 , SD These inferential statistics are what we based our conclusions about the true population Mean and SD describe what we observe in our sample… …we use these descriptive/sample statistics to estimate the population parameters, and to estimate our uncertainty. This all starts with SE, which is used to calculate 95% CI, and is related to p-values 3. NULL HYPOTHESIS SIGNIFICANCE TESTING • Fisher vs Neyman-Pearson • P-vales, power, effect size, sample size • Positive Predictive Value A brief history of significance testing (according to Sterne and Davey-Smith 2001) Introduced by RA Fisher, p-values as an indexing strength of evidence against the null hypothesis Advocated p<0.05 as standard level for concluding there was evidence against H0 but not as an absolute rule A brief history of significance testing Neyman and Pearson thought Fisher’s approach to p-values was subjective and introduced “hypothesis tests” as an alternative objective decisionbased approach In addition to type I error (focus of Fisher’s approach) they argued that type II error (FN) warrants consideration Argued that by fixing in advance the rates of type I and II error, mistakes over many different experiments would be limited i.e. the basis of power calculations aimed at minimising error (considered across a number of studies…) The null distribution Data are compared to a hypothetical distribution, the null distribution The null distribution is the probability distribution of the test statistic when the null hypothesis is true The null distribution is the sampling distribution of a statistic under the null hypothesis (this is where SE comes in) Null Hypothesis Significance Testing Power Null Distribution False Negatives Alternative Distribution False Positives n.b. this is shown for a one tailed test Statistical power (1 - β) probability that a test will correctly reject the null hypothesis when the null hypothesis is false Effect size Sample size (N) alpha significance criterion (α = 0.05) Type I and Type II Error • Type I error is falsely rejecting the null hypothesis (false negative) • Type II error is failing to reject a false null hypothesis (false positive) Statistical Power • Probability of committing a Type I error given that H0 is true is labelled α (alpha) • Probability of committing a Type II error is labelled β (beta) • Power is the probability of correctly rejecting H0 (1 - β) Statistical Power, effect size, sample size, alpha The power of any test of statistical significance will be affected by four main parameters: the effect size the sample size (N) the alpha significance criterion (α) statistical power (1 - β), or the chosen or implied beta (β) All four parameters are mathematically related. If you know any three of them you can figure out the fourth. THIS IS THE BEDROCK OF META-SCIENCE! Statistical Power: Effect Size Assume we are testing for a difference between two means Power increases as we increase the size of the difference 𝜇1 − 𝜇2 Effect size = 𝝁𝟏 −𝝁𝟐 𝑝𝑜𝑜𝑙𝑒𝑑 𝑆𝐷 Statistical Power: Effect Size Or we can reduce the variability (SD) : SMD = SE = 𝜇1 −𝜇2 𝒑𝒐𝒐𝒍𝒆𝒅 𝑺𝑫 𝑺𝑫 √𝜂 This often hard to do, but experimental design can help, e.g., with-in subject / repeated measures designs Statistical Power: Increases with sample size As sample sizes increase, the means of those sample tend more towards the true population mean, thus their variability deceases SE = 𝑆𝐷 √𝜼 Note. SD remains unchanged Statistical Power and false negatives Probability of committing a Type I error given that H0 is true is labelled α (alpha) Probability of committing a Type II error is labelled β (beta) Power is the probability of correctly rejecting H0 (1 - β) Power and false positives Suppose: In 90% of cases the null hypothesis is true The significance level is set at 5% The average power of studies is 80% If, in 1000 studies, 100 true associations will exist, we will detect 80 (80%) Of the remaining 900 non-associations, we will falsely declare 45 (5%) as significant Sterne & Davey Smith (2001). British Medical Journal, 322, 226-231. Power and false positives Suppose: In 90% of cases the null hypothesis is true The significance level is set at 5% The average power of studies is 20% If, in 1000 studies, 100 true associations will exist, we will detect 20 (20%) Of the remaining 900 non-associations, we will falsely declare 45 (5%) as significant That’s approx. a 1/3 chance a statistically significant finding is true! This is known as the Positive Predictive Value (PPV). Power and false positives and PPV Higher proportion of false positives, particularly for exploratory research Krzywinski & Altman (2013). Nat Methods, 10, 1139-1140. Power and effect inflation µ α = 0.05 Power Null Distribution Small studies p<0.05 tend to overestimate effect size Alternative Distribution Implications of low power across studies 1. More false negatives (but often unpublished…) More false positives Bigger (but biased) effects Button et al (2013). Nat Methods, 10, 1139-1140. Publication bias Rosenthal (1979): “The most extreme view of this problem is that journals are filled with the 5% of studies that show Type I errors, while the file drawers back at the lab are filled with the 95% of the studies that show nonsignificant (i.e., p > 0.05) results” Note that this would only be the case if H0 were always true (which it hopefully isn’t) Adapted from Marcus Munafo What’s the prior probability of our H1 being true? H1: men can be pregnant 0.8 × 0 𝑃𝑃𝑉 = =0 0.8 × 0 + 0.05 H1: Students will remember more words in a test if they later practiced them 4. FROM NHST TO EFFECT ESTIMATION • The problem with p-values • The new statistics, effect size estimation and confidence intervals Limitations with NHST • The nature of null hypothesis testing means ‘non-significant’ results are harder to interpret • Failure to reject H0 (i.e., p > 0.05) is not the same as accepting H0 • ‘Significant’ results are seen as more interesting and publishable = publication bias • Focus on p<0.05 yes/no is not particularly informative – p = 0.06 not so very different from p = 0.05 )LAZY CONCLUSIONS! • ‘Statistically significant’ often confused with ‘clinically / theoretically / biologically significant’ Interpreting null findings (Chalder et al 2014) We considered whether our result was sufficiently precise to rule out the possibility of a beneficial effect. The most statistically powerful analysis was using the Beck depression inventory [BDI] as a continuous outcome measure. …our primary analysis indicated an adjusted between group difference in mean BDI scores of −0.54 (95% confidence interval −3.06 to 1.99). It is difficult to define precisely what would constitute a clinically important treatment effect, but the National Institute for Health and Clinical Excellence guideline panel have suggested that this could correspond to around 3 points or 0.35 standard deviations at baseline on the Hamilton depression rating scale and close to the 0.33 standard deviations used in our power calculation. The equivalent difference in terms of Beck depression inventory score would be between 4.1 and 3.9 points, respectively, based on our observed standard deviation of 11.8 points at four months post-randomisation. This suggests that we have excluded the possibility, at least with 95% confidence, that the intervention added to usual care is clinically effective in improving symptoms of depression compared with usual care alone.” Far more informative than “there was no effect, p < 0.05) ! The ‘New Statistics’ according to Cummings 2013 Estimation based on effect sizes, confidence intervals, and meta-analysis, rather than p<0.05 Confidence intervals are more informative than p-values 5. DESIGNING STUDIES FOR REPRODUCIBLE RESULTS • Sample size calculations • Realistic effect sizes My favourite Fisher [mis?]-quote To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: [s]he may be able to say what the experiment died of. Ronald Fisher Planning a study Clear working research question, which specifies population of interest, primary outcome / dependent variable (e.g., depression), main exposure / intervention / explanatory variable, the nature of the effect under investigation (causal, correlational?), setting (lab, clinic?), importance confounders / covariates (e.g., gender, handedness) • Design study to test primary hypothesis / research question • Pre-specify statistical plan to test primary hypothesis / RQ • Calculate sample size required to yield sufficient power for the primary analysis to be informative Pre-register study (RR, OSF ect.) Sample size calculation Decide how many participants / animals / samples to be studied • Unethical to study too few – waste of time and resources and potential risk to participants unjustified as study unable to answer posed question • Unethical to study too many – withholding potentially beneficial treatments from patients, waste of time and money We use sample size calculations when we want to estimate some statistic with a particular precision, demonstrate the equivalence between two groups, or detect a difference of a given size between two groups Adapted form Kate Tilling Sample size calculations effect size The true population effect size is unknown (that’s why we are doing the study!) BUT we must work hard to estimate a realistic effect size OR decide what constitutes the minimum clinically/theoretically relevant effect Can design studies to reduce SD – with-in subjects repeated measures sample size (N) – the MAIN thing we can change in our design alpha significance criterion (α) – fix, usually 5% statistical power (1 - β) – fix, hopefully 80 -90% (β 10 – 20%) Sample size calculation What size sample do we require to be able to detect our target difference between two groups with pre-specified power and significance level, given the variation in our sample? MUST BE PERFORMED PRIOR TO STUDY! (Or as we will do later, using an effect size estimate that is INDEPENDENT of the individual study) Example adapted form Kate Tilling Sample size calculation What size sample do we require to be able to detect our target difference between two groups with pre-specified power and significance level, given the variation in our sample? Set power and significance level to minimise error: Reality Conclusion Null true Null false Reject null Type I error (α) Correct (1-β) Accept null Correct (1-α) Type II error (β) alpha significance criterion (α) = Pr(Type I error) statistical power = 1 – Pr(Type II error) = 5% = 80% Minimising type I and type II to maximise profit Lindborg et al 2014 Sample size calculation What size sample do we require to be able to detect our target difference between two groups with pre-specified power and significance level, given the variation in our sample? Target difference must: 1. be determined from previous experience / literature 2. Be sufficiently large to be clinically / theoretically / biologically important 3. Be sufficiently small be realistic So ask yourself, what is this minimum clinically/theoretically/biologically important difference? Admittedly, this is not trivial… Anchor / Gold standard: Asked patients – How do you feel: better, same, worse? MCID “Not better” “Better” Change in BDI score ROC analysis / signal detection (JND) to find BDI change which optimally classified patients as better/not 𝑌𝑜𝑢𝑑𝑒𝑛 = max(𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 − 1) Baseline dependency: GLM Ratio Scale 10 Absolute Scale -50 -10 -5 0 5 Difference in % Change -40 Difference -30 in change -20 of -10 BDI score 0 10 All Studies - Pooled 0 5 10 15 20 25 30 35 40 Baseline BDI score Better - same Better - worse 45 50 55 0 60 Same - worse 5 10 Better - same Interaction term Mean (*) SD 2.5% 97.5% Absolute scale -0.48 0.19 -0.85 -0.09 Ratio Scale -0.21 0.22 -0.65 0.23 (*): is the difference in mean change (in each scale whether this is absolute or ratio, respectively) from baseline for those feeling better relative to those feeling worse, per unit change of baseline 15 20 25 30 35 Baseline BDI score 40 Better -Worse 45 50 55 60 Same - worse Fixed Effects GLM Models for “observed” change Absolute no interaction Absolute with interaction 6649 6596 Ratio no interaction (i.e., log-link) 6469 AIC Minimal clinically important difference MCID ~ 20% reduction in scores Implications NICE guidance of ≥3 BDI points for significant treatment effect does not account for baseline dependency A between-group difference (i.e., treatment effect) of 3 BDI points would be trivial in a sample with an average BDI score of 60, but more relevant in a sample averaging BDI scores of 14. Use MCID to inform sample size to assess clinically meaningful treatment effects Sample size calculation What size sample do we require to be able to detect our target difference between two groups with pre-specified power and significance level, given the variation in our sample? Variation (SD): • Depends on the individuals we recruit to the study • But we can estimate in advance from a similar study in similar population or from a pilot study • If you have no idea about variation – then arguably you are not ready for NHST! Realistic effects sizes The median experimental study in social psychology has an N of 20 per group What can you reliably detect with this sample size? • Men taller than women (N = 6) • People above median age closer to retirement (N = 10) N = 20 per group is not enough to detect that • People who like spicy food are more likely to like Indian food (N = 27 per cell) • Men weigh more than women (N = 47 per cell) Is the effect we are studying likely to be bigger than “Men weigh more than women?” Credit: Slide adapted from Simmons, J., Nelson, L., & Simonsohn, U. (2011). Life after p-hacking. Presentation at the Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013 Realistic effect sizes: Ethical implications Principle of the ‘three Rs’ (reduce, refine, replace) 80% power = 20% chance of false negatives 20% power = 80% chance of false negatives Low powered studies are inefficient and wasteful Need to consider too many and too few animals Button et al (2013). Nat Methods, 10, 1139-1140. Powering replication To achieve adequate power, the replication study will need a sample size twice that of the original (where original p ~ 0.05) Replication Sample Discovery P-Value Button et al (2013). Nature Reviews Neuroscience, 14, 365 – 376 0.05 0.01 0.005 n 50% 74% 81% 2n 80% 96% 98% 4n 98% 100% 100% 6. INTERPRETING AND REPORTING RESULTS Interpreting and reporting results • Think ‘cumulative science’ • Describe sample, N(%) or M(SD) for all baseline measures – pirate plot! • Describe main variables in analysis in a maximally informative way (e.g., outcome M (SD) by main explanatory variables) • Report results for inferential statistics as point estimate, 95% Confidence Interval, and if phrased as hypothesis testing, report p-value. Try drafting your results tables BEFORE you collect data • • If you’re truly clear in your experimental design and analysis plan this shouldn’t be a problem If it is, then arguably you’re not ready for data collection Try writing you papers without using the phrase ‘statistically significant’ • • • Harder than you think! Forces you to write about effect size, precision, and direction P<0.05 can occur in any direction! Interpreting results e.g. association between exposure and outcomes 1. Is it the result of chance? Pre-Registration P values and confidence intervals 2. Is it due to bias? Selection bias? Experimenter bias? Lack of blinding ect. 3. Is it due to confounding? An unmeasured third factor that is associated with both? 4. Is it an example of reverse causality? Recall bias? 5. Is it causal? Researchers use causal language all the time but rarely have they adequately testing causation … Adapted from Sara Brookes Acknowledgements Marcus Munafò Kate Tilling Sara Brookes Daphne Kounali University of Bristol University of Bristol, SSCM Short Course, Intro to Statistics University of Bristol, SSCM Short Course, Intro to Statistics University of Bristol [email protected] @ButtonKate PRACTICAL: POWER FAILURE Calculating statistical power Using meta-analysis to estimate statistical power: • • • Meta-effect size Study sample size (N) alpha significance criterion (α = 0.05) Meta-analysis in neuroscience Searched Web of Science “neuroscience” & “meta-analysis” published in 2011 49 meta-analyses 730 individual studies Low statistical power Neuroscience Brain imaging Excess of significance in brain volume abnormalities (Ioannidis (2011). Archives of General Psychiatry, 14, 1105-1107). 41 meta-analyses, 461 studies Median power = 8% Animals 49 meta-analyses, 730 studies Median power = 21% Expected 254, observed 349 “significant” studies, p < 0.0001 Button et al (2013). Nature Reviews Neuroscience, 14, 365 – 376 Meta-analysis of sex differences in rodent maze. (Jonasson (2005). Neurosci Biobehav Rev, 28, 811-825 ) 40 studies median power = 18% - 31% Exploring sample size bias and IF p < 0.001 Munafò et al (2009). Molecular Psychiatry, 14, 119-120. Estimate median statistical power in field of… 1. 2. 3. 4. Search ‘meta-analysis’ & [‘TOPIC’ & ‘JOURNALS’….] Screen titles/abstracts against inclusion / exclusion criteria Install G*Power or use R For each meta-analysis extract lead author, date, metaestimate (SMD, OR) 5. For each study in meta-analysis extract lead author, date, total n, groups n, base rate (proportion of events in control group, required for OR), p <0.05 (y/n) 1. Other metrics? IF, #citations ect. 6. Pair up to extract data in duplicate Record search strategy and data for PRISMA diagram Power and other biomedical domains All measures Dumas et al (2016). Cognitive excluded Power and other biomedical domains All measures Dumas et al (2016). Cognitive excluded Power and other biomedical domains Dumas et al (2016). Effect size estimates Two major groups of effect size estimate exist for continuous outcomes: Difference between means (effect size d) Zero-order correlations (effect size r) The odds ratio is also used, which is the likelihood of a dichotomous outcome Standardized Mean Difference The difference in means observed in two groups, standardized across these groups: The difference between means for two groups is calculated This difference is standardized by dividing by the pooled standard deviation SMD = 𝝁𝟏 −𝝁𝟐 𝒑𝒐𝒐𝒍𝒆𝒅 𝑺𝑫 The SMD can be converted to absolute values using the pooled standard deviation Odds Ratio The difference in the probability of a dichotomous outcome in two groups: For example, two groups may include a placebo arm and a treatment arm in a clinical trial The dichotomous outcome may be number of deaths (versus non-deaths) An OR of 1.2 means the outcome is 1.2 times (i.e., 20%) more likely in the treatment arm