Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Law & Cognition Prof. Dan Kahan Yale Law School Fall 2016 Some Observations on Significance What is significance? And how significant is it? 1. p-values Social science (like science generally) involves presenting empirical evidence from which we can draw inferences about how the world works. The evidence is typically in the form of some sort of statistical correlation between two or more variables or observations.”Statistical significance” is a threshold value used to identify the acceptable risk that an observed correlation could have occurred by chance even if in fact the variables are not genuinely correlated. The conventional threshold is 0.05, which means the probability is less than 0.05—hence “p < 0.05”—that a correlation as large as (or any larger than) the one in question would have been observed if in fact the correlation is zero. If a researcher reports a result that is “significant at p < 0.10,” then there is no more than a 0.10 or 10% probability one would have observed a correlation that large even if the “real” correlation is zero; if she reports “p = 0.07,” then there is a 7% chance a correlation of that magnitude or larger would have occurred by chance even assuming there really isn’t any. 2. The basic idea. “Statistical significance” treats the behavior of a random processes as the baseline for assessing the risk of error in an empirical test. See generally Robert P. Abelson, Statistics as Principled Argument 18-38 (1995). A random process is one that involves the interplay of various dynamics the existence and impact of which cannot be measured with certainty (e.g., all the forces that combine to determine how many times my cat will walk across the keyboard of my computer on a given day). As you likely know, the outcomes associated with such a processes normally (not invariably!) form a bellshaped pattern. If one knows the mean or average value (the one at the very top or center of the bell) and the “standard deviation” (a number that characterizes how gradually or steeply the bell curve slopes downward on either side of the mean), one can calculate the fraction of the outcomes that can be expected to have values that exceed or fall short of the mean by any specified amount. The bell curve distribution associated with a random process gives us a way to test our belief that we understand a particular process (or do at least to some extent). We are likely to conclude that we understand something about it if we can successfully predict how variance in one of its dynamics will affect its outcome values (e.g., “hey, look— the more often I pull my cat’s tail one day, the greater the number of times she will walk across the keyboard on the next!”). But to deal with the skeptic (within or without), we can test our claim of knowledge by seeing how unlikely it would be that we would have observed such results if in fact the process were simply random. Looking at the bell curve that has the right parameters (mean and standard deviation) for the process under examination, we can calculate the fraction of the outcomes we would expect to be as far apart as the ones we observed if they occurred by the chance operations of a random process. If we find that fewer than 5% would be, then we can say, “Ha! If the dynamic I am saying does not have the effect that I say it does—if this process is operating randomly with respect to it—then the probability of seeing a correlation at least this strong would be less than 5%!” Fig. 1 Random process, normal distribution, and p-value. Imagine the mean difference in how many times my cat walks across the keyboard on any two consecutive days is 0 with a standard deviation of 3. I perform an experiment and note that the difference between the number of times (2) she walks on my keyboard the day after I pulled her tail 4 times and the number (9) the day after I pulled her tail 8 times is 7, which is 2.33 SDs. The likelihood that the difference between the numbers of times she walked on my keyboard in consecutive days would be that large by chance is only 2% (p = 0.02, using a two-tail test, so to speak)! Image: http://syque.com/quality_tools/toolbook/Variation/measuring_spread.htm). 3. Type I vs. type II errors. By design, the statistical significance testing strategy is conservative. If we select a threshold value of p ≤ 0.05, then we are declaring ourselves unwilling to accept a risk any greater than 1 chance in 20—and if p ≤ 0.10, 1 chance in 10 —that the observed correlation would have been observed by chance. At the same time, then, we are tolerating a risk that we will be failing to recognize as unlikely to be a result of chance correlations that are still very unlikely—15% or 20% or 25%—to occur by chance. The former type of risk—of “recognizing” a correlation to exist when in fact it is due to chance—is known as Type I error, and the latter—of “failing to recognize” a correlation that is genuinely not due to chance—as a Type II error. The p < (or ≤ ) 0.05 standard reflects a relative aversion to Type I errors. Scientific craft norms prize modesty; it is as if science were saying, “better 94 (or 89 ) true insights go unrecognized than that 1 false insight be declared.” This sort of reticence might make sense for science (or might not; consider Jacob Cohen, The Earth Is Round (P < .05), 49 American Psychologist 997 (1994)). But this sort of asymmetry toward Type I and Type II errors would be idiotic in many domains in which the cost of both are comparably high. Accordingly, in many practical reasons of life—from public health to finance—people act on the basis of correlations that have p-values > 0.05. For this reason, the law, quite sensibly, is willing to consider correlational evidence that has a p-value > 0.05 where the failure to do so would impose a substantial risk of Type II error. See generally Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1319-21 (2011). Within science, moreover, the failure to appreciate the relationship between p-values and relative aversion to Type I and Type II errors sometimes leads researchers to make an embarrassing blunder. Because a finding of nonsignificance at p < 0.05 (or even p < 0.10) is consistent with a very high likelihood that a correlation exists, it is manifestly incorrect to treat the failure of a correlation to reach the threshold of significance as evidence that some dynamic is not correlated with another. Yet some researchers, particularly when they set out to test a hypothesis that no such correlation exists, make the mistake of proclaiming exactly that. In effect, these researchers are treating science’s aversion to Type I errors as a license to make Type II errors. If one wants to test the hypothesis that no correlation exists between some process and an outcome, then one needs a statistical test for the absence of significance that reflects the same aversion to mistakenly claiming to know something one doesn’t. The procedure for such testing, then, involves constructing a sufficiently discerning design to rule out the possibility that one failed to observe a significant correlation that actually exists. The most important consideration is “statistical power”—sample size, essentially—since the likelihood of observing a significant result at p < 0.05 (or any other threshold) diminishes as the sample size goes down. See generally David L Streiner, Unicorns Do Exist: A Tutorial on “Proving” the Null Hypothesis, 48 Canadian Journal of Psychiatry 756 (2003). 4. Effect size vs. significance. Another hazard associated with the convention of statistical significance is the conflation of p-value and effect size. Essentially, lots of correlations that are significant at p < 0.05 (particularly where the sample is large) can be too small to matter for any practical purpose (including insight into how the world works). The right thing to do, then, is to report a suitable measure of the size of the correlation. Nevertheless, fixation on statistical significance (the misunderstanding, really, of what it means) has traditionally led researchers to overemphasize significance measures, sometimes reporting only those (particularly in social psychology); the same fetishization of significance lies behind the (fortunately declining) practice of using multiple asterisks to designate progressively lower p-values— implying that “lower” p-value findings are in and of themselves of greater consequence (they aren’t). See generally Cohen, supra. 5. Graphic reporting and confidence intervals. The intelligent use of graphic reporting of results can conserve the value of significance testing while avoiding some of the hazards of p-value fetishism and related bad practices. Good graphic reporting consists of essentially three things: first, the selection of intuitively comprehensible and relevant units for the predictor and outcome variables; second, a display format that draws attention to important effects and omits needless and likely distracting details and frills; and third, an informative measure of precision, such as 0.95 confidence intervals. The first two features make it possible for readers to understand in a way that textual reporting of results and statistical output tables rarely do the practical significance of the study findings. The third feature not only conveys the information necessary to assess statistical significance at p < 0.05 (the result is significant unless the interval spans the 0 value of the outcome variable); it also allows the reader to see the probabilistic range of values associated with the relevant correlation estimate. She can then decide for herself, given her own interests and tolerance for error, what to make of the observed correlation. Indeed, when reported with confidence intervals, results that are not statistically significant can be represented in a manner that permits the reader to get a sense of the likelihood of Type II error (the closer 0 is to the end of the interval, the more likely the “true” correlation is not 0), and to make whatever she will of results that are not different from 0 at p < 0.05 but that are significantly different from many other potential non-0 values that may be of interest. See generally Andrew Gelman, Cristian Pasarica & Raul Dodhia, Let's Practice What We Preach: Turning Tables into Graphs, 56 Am. Stat. 121 (2002); Lee Epstein, Andrew Martin & Christina Boyd, On the Effective Communication of the Results of Empirical Studies, Part II, 60 Vand. L. Rev. 79 (2007); Lee Epstein, Andrew Martin & Mathew Schneider, On the Effective Communication of the Results of Empirical Studies, Part I, 59 Vand. L. Rev. 1811-71 (2007). Fig. 2 Graphic presentation of data. The table on the left becomes the graphic on the right via use of graphic reporting methods discussed in Gelman, Epstein et al. and others. From Dan M. Kahan, Culture, Cognition, and Consent: Who Perceives What, and Why, in 'Acquaintance Rape' Cases, 158 U. Penn. L.Rev. 729 (2010). 6. Causation and validity vs. significance. Finally, statistical significance has nothing to do with study validity! The p-value quantifies measurement error associated with the estimates derived from the statistical model used to analyze the data. Defective study design result in model error; model error might or might not be quantifiable but estimates of an erroneous model are not compensated for by the statistical significance or precision of its estimates. Accordingly, when trying to satisfy yourself that a design convincingly supports the inference that the researcher is drawing (internal validity) and is uncovering a dynamic that one can expect to affect the real-world process that the study is modeling, don’t give even a gram of credit to a low p-value. The causal interpretation of correlations in an empirical study is one important example of this point. The basis for inferring causation between correlated variables always is independent of the data (always, whether the study is experimental or observational!); it’s only after one is satisfied that the design supports a causal inference that one turns to statistical significance testing to assure oneself that there is a sufficiently low risk that the observed effect would not have occurred by chance. 7. Null hypothesis testing. Statistical significance is integral to the “null hypothesis testing” paradigm in the social (and much of the physical) sciences. The paradigm involves conducting a test to see whether the relationship between some influence of interest and some outcome of interest differs from “zero” by an amount that is “statistically significant.” If so, one can “reject the null hypothesis”—that the influence and the outcome are in fact not systematically related in some fashion. There are lots of problems with this mode of empirical investigation and even more with the thoughtless dominance that NHT has assumed in social science research. But we’ll talk about that another time!