Download The P Value is Dead

Planning, Performing, and Publishing Research with Confidence Limits A tutorial lecture given at the annual meeting of the American College of Sports Medicine, Seattle, June 4 1999. © Will G Hopkins Physiology and Physical Education University of Otago Dunedin NZ [email protected] Outline  Definitions and Mis/interpretations  Planning  Sample size  Performing  Sample size "on the fly"  Publishing  Methods, Results, Discussion  Meta-analysis  Publishing non-significant outcomes  Conclusions  Dis/advantages Definitions and Mis/interpretations  Confidence limits: Definitions  "Margin of error"  Example: Survey of 1000 voters Democrats 43%, Republicans 33% Margin of error is ± 3% (for a result of 50%...)  Likely range of true value  "Likely" is usually 95%.  "True value" = population value = value if you studied the entire population.  Example: Survey of 1000 voters Democrats 43% (likely range 40 to 46%) Democrats - Republicans 10% (likely range 5 to 15%)  Example: in a study of 64 subjects, the correlation between height and weight was 0.68 (likely range 0.52 to 0.79). observed value upper lower confidence confidence limit limit 0.00 0.50 correlation coefficient 1  Confidence interval: difference between the upper and lower confidence limits.  Amazing facts about confidence intervals (for normally distributed statistics)  To halve the interval, you have to quadruple sample size.  A 99% interval is 1.3 times wider than a 95% interval. You need 1.7 times the sample size for the same width.  A 90% interval is 0.8 of the width of a 95% interval. You need 0.7 times the sample size for the same width.  How to Derive Confidence Limits  Find a function(true value, observed value, data) with a known probability distribution.  Calculate a critical value, such that for 2.5% of the time, function(true value, observed value, data) < critical value. probability area = 0.025 critical value probability distribution of function (e.g. 2) function (e.g. (n-1)s2/2)  Rearranging, for 2.5% of the time, true value > function'(observed value, data, critical value) = upper confidence limit  Mis/interpretation of confidence limits  Hard to misinterpret confidence limits for simple proportions and correlation coefficients.  Easier to misinterpret changes in means.  Example: The change in blood volume in a study was 0.52 L (likely range 0.12 to 0.92 L).      For 95% of subjects, the change was/would be between 0.12 and 0.92 L.  The average change in the population would be between 0.12 and 0.92 L.  The change for the average subject would be between 0.12 and 0.92 L.  There may be individual differences in the change.  P value: Definition  The probability of a more extreme absolute value than the observed value if the true value was zero or null.  Example: 20 subjects, correlation = 0.25, p = 0.29. no effect observed effect (r = 0.25) probability area = p value = 0.29 -0.5 0 0.5 correlation coefficient distribution of correlations for no effect and n = 20  "Statistically Significant": Definitions  P < 0.05  Zero lies outside the confidence interval.  Examples: four correlations for samples of size 20. -0.50 0.00 0.50 correlation coefficient 1 r likely range P 0.70 0.37 to 0.87 0.007 0.44 0.00 to 0.74 0.05 0.25 -0.22 to 0.62 0.29 0.00 -0.44 to 0.44 1.00  Incredibly interesting information about statistical significance and confidence intervals p < 0.05 p = 0.05 p > 0.05  Two independent estimates of a normally distributed statistic with equal confidence intervals are significantly different at the 5% level if the overlap of their intervals is less than 0.29 (1 - 2/2) of the length of the interval.  If the intervals are very unequal... p < 0.05 p = 0.05 p > 0.05  Type I and II Errors  You could be wrong about significance or lack of it.  Type I error = false alarm.  Rate = 5% for zero real effect.  Type II error = failed alarm.  Traditional acceptable rate = 20% for smallest worthwhile effect.  Lots of tests for significance implies more chance of at least one false alarm: "inflated type I error".  Ditto type II error?  Deal with inflated type I error by reducing the p value.  Should we adjust confidence intervals? No.  Mis/interpretation of P < 0.05 (for an observed positive effect)            The effect is probably big. There's a < 5% chance the effect is zero. There's a < 2.5% chance the effect is < zero. There's a high chance the effect is > zero. The effect is publishable. Mis/interpretation of P > 0.05 (for an observed positive effect)         The effect is not publishable. There is no effect. The effect is probably zero or trivial. There's a reasonable chance the effect is < zero. Planning Research  Sample Size via Statistical Significance  Sample size must be big enough to be sure you will detect the smallest worthwhile effect.  To be sure: 80% of the time.  Detect: P < 0.05.  Smallest worthwhile effect: what impacts your subjects     correlation = 0.10 relative risk = 1.2 (or frequency difference = 10%) difference in means = 0.2 of a between-subject standard deviation change in means = 0.5 of a within-subject standard deviation  Example: 760 subjects to detect a correlation of 0.10.  Example: 68 subjects to detect a 0.5% change in a crossover study when the within-subject variation is 1%.  But 95% likely range doesn't work properly with traditional sample-size estimation (maybe). Example: Correlation of 0.06, sample size of 760...  47.5% + 47.5% (=95%) likely range: -0.1 0 0.1 correlation coefficient Not significant, but could be substantial. Huh?  47.5% + 30% likely range: -0.1 0 0.1 correlation coefficient Not significant, and can't be substantial. OK!  Sample Size via Confidence Limits  Sample size must be big enough for acceptable precision of the effect.  Precision means 95% confidence limits.  Acceptable means any value of the effect within these limits will not impact your subjects.  Example: need 380 subjects to delimit a correlation of zero. smallest worthwhile effects -0.10 0 0.10 correlation coefficient confidence interval for N = 380  But sample size needed to detect or delimit smallest effect is overkill for larger effects.  Example: confidence limits for correlations of 0.10 and 0.80 with a sample size of 760... -0.1 0 0.1 0.3 0.5 0.7 correlation coefficient  So why not start with a smaller sample and do more subjects only if necessary? Yes, I call it... 0.9 1 Performing Research  Sample Size "On the Fly"  Start with a small sample; add subjects until you get acceptable precision for the effect.  Acceptable precision defined as before.  Need qualitative scale for magnitudes of effects.  Example: sample sizes to delimit correlations... 350 380 trivial -0.1 0 small 0.1 270 moderate 0.3 155 large 46 very large 0.5 0.7 correlation coefficient nearly perfect 0.9 1  Problems with sampling on the fly  Do not sample until you get statistical significance: the resulting outcomes are biased larger than life.  Sampling until the confidence interval is acceptable produces bias, but it is negligible.  But researchers will rush into print as soon as they get statistical significance.  And funding agencies prefer to give money once (but you could give some back!).  And all the big effects have been researched anyway? No, not really. Publishing Research  In the Methods  "We show the precision of our estimates of outcome statistics as 95% confidence limits (which define the likely range of the true value in the population from which we drew our sample)."  Amazingly useful tips on calculating confidence limits  Simple differences between means: stats program.  Other normally distributed statistics: mean and p value.  Relative risks: stats program.  Correlations: Fisher's z transform.  Standard deviations and other root mean square variations: chi-squared distribution.  Coefficients of variation: standard deviation of 100x natural log of the variable. Back transform for CV>5%.  Use the adjustment of Tate and Klett to get shorter intervals for SDs and CVs from small samples. Example: coefficient of variation for 10 subjects in 2 tests usual 0 1 2 coefficient of variation (%) adjusted 3  Ratios of independent standard deviations: F distribution.  R2 (variance explained): convert to a correlation.  Use the spreadsheet at sportsci.org/stats for all the above.  Effect-size (mean/standard deviation): non-central F distribution or bootstrapping.  Really awful statistics: bootstrapping.  Bootstrapping (Resampling) for confidence limits  Use for difficult statistics, e.g. for grossly non-normal repeated measures with missing values. Here's how...  For a large-enough sample, you can recreate (sort of) the population by duplicating the sample endlessly.  Draw 1000 samples (of same size as your original) from this population.  Calculate your outcome statistic for each of these samples, rank them, then find the 25th and 975th placegetters. These are the confidence limits.  Problems  Painful to generate.  No good for infrequent levels of nominal variables.  In the Results  In TEXT  Change or difference in means First mention: ...0.42 (95% confidence/likely limits/range -0.09 to 0.93) or ...0.42 (95% confidence/likely limits/range ± 0.51). Thereafter: ...2.6 (1.4 to 3.8) or 2.6 (± 1.2) etc.  Correlations, relative risks, odds ratios, standard deviations, ratios of standard deviations: can't use ± because the confidence interval is skewed: ...a correlation of 0.90 (0.67 to 0.97)... ...a coefficient of variation of 1.3% (0.9 to 1.9)...  In TABLES  Confidence intervals Variable A Variable B Variable C Variable D r likely range 0.70 0.44 0.25 0.00 0.37 to 0.87 0.00 to 0.74 -0.22 to 0.62 -0.44 to 0.44  P values Variable A Variable B Variable C Variable D  Asterisks r p 0.70 0.44 0.25 0.00 0.007 0.05 0.29 1.00 r Variable A Variable B Variable C Variable D 0.70** 0.44* 0.25 0.00  In FIGURES Told carbohydrate Told placebo Not told -10 -5 0 5 10 Change in power (%) Bars are 95% likely ranges 4 sea level altitude sea level 3 live low train low 2 change in 1 5000-m 0 time (%) -1 live high train high likely range of true change live high train low -2 -3 0 2 4 6 8 10 training time (weeks) 12 14  In the Discussion  Interpret the observed effect and its 95% confidence limits qualitatively.  Example: you observed a moderate correlation, but the true value of the correlation could be anything between trivial and very strong. trivial -0.1 0 small 0.1 moderate 0.3 large very large 0.5 0.7 correlation coefficient nearly perfect 0.9 1  Meta-Analysis  Deriving a single estimate and confidence interval for an effect from several studies.  Here's how it works for two: Equal Confidence Intervals Study 1 Study 2 Study 1+2 Unequal Confidence Intervals Study 1 Study 2 Study 1+2  Publishing non-significant outcomes  Publishing only significant effects from small-scale studies leads to publication bias.  Publishing effects with confidence limits regardless of magnitude is free of bias.  Many smaller studies are probably better than a few larger ones anyway.  So bully the editor into accepting the paper about your seemingly inconclusive small-scale study. Conclusions  Disadvantages of Statistical Significance  Emphasizes testing of hypotheses.  Aim is to detect an effect--effects are zero until proven otherwise.  Have to understand Type I and II errors.  Hard to understand; easy to misinterpret.  Have to consider sample size.  Focuses on statistically significant effects.  Advantages of Statistical Significance  Familiar.  All stats programs give p values.  Easy to put asterisks in tables and figures.  Disadvantages of Confidence Limits      Unfamiliar. Not always available in stats programs. Cluttersome in tables. Display in time series can be a challenge. Advantages of Confidence Limits  Emphasizes precision of estimation.  Aim is to delimit an effect--effects are never zero.  Only one kind of "error".  Meaning is reasonably clear, even to lay readers.  No confusion between significance and magnitude.  Journals now require them.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The P Value is Dead