Download Supplementary document - Cultural Cognition Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Law & Cognition
Prof. Dan Kahan
Yale Law School
Fall 2016
Some Observations on Significance
What is significance? And how significant is it?
1. p-values Social science (like science generally) involves presenting empirical evidence from which
we can draw inferences about how the world works. The evidence is typically in the form of some sort of
statistical correlation between two or more variables or observations.”Statistical significance” is a
threshold value used to identify the acceptable risk that an observed correlation could have occurred by
chance even if in fact the variables are not genuinely correlated. The conventional threshold is 0.05,
which means the probability is less than 0.05—hence “p < 0.05”—that a correlation as large as (or any
larger than) the one in question would have been observed if in fact the correlation is zero. If a researcher
reports a result that is “significant at p < 0.10,” then there is no more than a 0.10 or 10% probability one
would have observed a correlation that large even if the “real” correlation is zero; if she reports “p =
0.07,” then there is a 7% chance a correlation of that magnitude or larger would have occurred by chance
even assuming there really isn’t any.
2. The basic idea. “Statistical significance” treats the behavior of a random processes as the baseline for
assessing the risk of error in an empirical test. See generally Robert P. Abelson, Statistics as Principled
Argument 18-38 (1995). A random process is one that involves the interplay of various dynamics the
existence and impact of which cannot be measured with certainty (e.g., all the forces that combine to
determine how many times my cat will walk across the keyboard of my computer on a given day). As
you likely know, the outcomes associated with such a processes normally (not invariably!) form a bellshaped pattern. If one knows the mean or average value (the one at the very top or center of the bell) and
the “standard deviation” (a number that characterizes how gradually or steeply the bell curve slopes
downward on either side of the mean), one can calculate the fraction of the outcomes that can be expected
to have values that exceed or fall short of the mean by any specified amount. The bell curve distribution
associated with a random process gives us a way to test our belief that we understand a particular process
(or do at least to some extent). We are likely to conclude that we understand something about it if we can
successfully predict how variance in one of its dynamics will affect its outcome values (e.g., “hey, look—
the more often I pull my cat’s tail one day, the greater the number of times she will walk across the
keyboard on the next!”). But to deal with the skeptic (within or without), we can test our claim of
knowledge by seeing how unlikely it would be that we would have observed such results if in fact the
process were simply random. Looking at the bell curve that has the right parameters (mean and standard
deviation) for the process under examination, we can calculate the fraction of the outcomes we would
expect to be as far apart as the ones we observed if they occurred by the chance operations of a random
process. If we find that fewer than 5% would be, then we can say, “Ha! If the dynamic I am saying does
not have the effect that I say it does—if this process is operating randomly with respect to it—then the
probability of seeing a correlation at least this strong would be less than 5%!”
Fig. 1 Random process, normal distribution, and p-value. Imagine the mean difference in how many times my
cat walks across the keyboard on any two consecutive days is 0 with a standard deviation of 3. I perform an
experiment and note that the difference between the number of times (2) she walks on my keyboard the day after I
pulled her tail 4 times and the number (9) the day after I pulled her tail 8 times is 7, which is 2.33 SDs. The
likelihood that the difference between the numbers of times she walked on my keyboard in consecutive days would
be that large by chance is only 2% (p = 0.02, using a two-tail test, so to speak)! Image:
http://syque.com/quality_tools/toolbook/Variation/measuring_spread.htm).
3. Type I vs. type II errors. By design, the statistical significance testing strategy is conservative. If we
select a threshold value of p ≤ 0.05, then we are declaring ourselves unwilling to accept a risk any greater
than 1 chance in 20—and if p ≤ 0.10, 1 chance in 10 —that the observed correlation would have been
observed by chance. At the same time, then, we are tolerating a risk that we will be failing to recognize as
unlikely to be a result of chance correlations that are still very unlikely—15% or 20% or 25%—to occur
by chance. The former type of risk—of “recognizing” a correlation to exist when in fact it is due to
chance—is known as Type I error, and the latter—of “failing to recognize” a correlation that is genuinely
not due to chance—as a Type II error.
The p < (or ≤ ) 0.05 standard reflects a relative aversion to Type I errors. Scientific craft norms prize
modesty; it is as if science were saying, “better 94 (or 89 ) true insights go unrecognized than that 1 false
insight be declared.”
This sort of reticence might make sense for science (or might not; consider Jacob Cohen, The Earth Is
Round (P < .05), 49 American Psychologist 997 (1994)). But this sort of asymmetry toward Type I and
Type II errors would be idiotic in many domains in which the cost of both are comparably high.
Accordingly, in many practical reasons of life—from public health to finance—people act on the basis of
correlations that have p-values > 0.05. For this reason, the law, quite sensibly, is willing to consider
correlational evidence that has a p-value > 0.05 where the failure to do so would impose a substantial risk
of Type II error. See generally Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1319-21 (2011).
Within science, moreover, the failure to appreciate the relationship between p-values and relative aversion
to Type I and Type II errors sometimes leads researchers to make an embarrassing blunder. Because a
finding of nonsignificance at p < 0.05 (or even p < 0.10) is consistent with a very high likelihood that a
correlation exists, it is manifestly incorrect to treat the failure of a correlation to reach the threshold of
significance as evidence that some dynamic is not correlated with another. Yet some researchers,
particularly when they set out to test a hypothesis that no such correlation exists, make the mistake of
proclaiming exactly that. In effect, these researchers are treating science’s aversion to Type I errors as a
license to make Type II errors. If one wants to test the hypothesis that no correlation exists between some
process and an outcome, then one needs a statistical test for the absence of significance that reflects the
same aversion to mistakenly claiming to know something one doesn’t. The procedure for such testing,
then, involves constructing a sufficiently discerning design to rule out the possibility that one failed to
observe a significant correlation that actually exists. The most important consideration is “statistical
power”—sample size, essentially—since the likelihood of observing a significant result at p < 0.05 (or
any other threshold) diminishes as the sample size goes down. See generally David L Streiner, Unicorns
Do Exist: A Tutorial on “Proving” the Null Hypothesis, 48 Canadian Journal of Psychiatry 756 (2003).
4. Effect size vs. significance. Another hazard associated with the convention of statistical significance is
the conflation of p-value and effect size. Essentially, lots of correlations that are significant at p < 0.05
(particularly where the sample is large) can be too small to matter for any practical purpose (including
insight into how the world works). The right thing to do, then, is to report a suitable measure of the size
of the correlation. Nevertheless, fixation on statistical significance (the misunderstanding, really, of what
it means) has traditionally led researchers to overemphasize significance measures, sometimes reporting
only those (particularly in social psychology); the same fetishization of significance lies behind the
(fortunately declining) practice of using multiple asterisks to designate progressively lower p-values—
implying that “lower” p-value findings are in and of themselves of greater consequence (they aren’t). See
generally Cohen, supra.
5. Graphic reporting and confidence intervals. The intelligent use of graphic reporting of results can
conserve the value of significance testing while avoiding some of the hazards of p-value fetishism and
related bad practices. Good graphic reporting consists of essentially three things: first, the selection of
intuitively comprehensible and relevant units for the predictor and outcome variables; second, a display
format that draws attention to important effects and omits needless and likely distracting details and frills;
and third, an informative measure of precision, such as 0.95 confidence intervals. The first two features
make it possible for readers to understand in a way that textual reporting of results and statistical output
tables rarely do the practical significance of the study findings. The third feature not only conveys the
information necessary to assess statistical significance at p < 0.05 (the result is significant unless the
interval spans the 0 value of the outcome variable); it also allows the reader to see the probabilistic range
of values associated with the relevant correlation estimate. She can then decide for herself, given her own
interests and tolerance for error, what to make of the observed correlation. Indeed, when reported with
confidence intervals, results that are not statistically significant can be represented in a manner that
permits the reader to get a sense of the likelihood of Type II error (the closer 0 is to the end of the
interval, the more likely the “true” correlation is not 0), and to make whatever she will of results that are
not different from 0 at p < 0.05 but that are significantly different from many other potential non-0 values
that may be of interest. See generally Andrew Gelman, Cristian Pasarica & Raul Dodhia, Let's Practice
What We Preach: Turning Tables into Graphs, 56 Am. Stat. 121 (2002); Lee Epstein, Andrew Martin &
Christina Boyd, On the Effective Communication of the Results of Empirical Studies, Part II, 60 Vand. L.
Rev. 79 (2007); Lee Epstein, Andrew Martin & Mathew Schneider, On the Effective Communication of
the Results of Empirical Studies, Part I, 59 Vand. L. Rev. 1811-71 (2007).
Fig. 2 Graphic presentation of data. The table on the left becomes the graphic on the right via use of graphic
reporting methods discussed in Gelman, Epstein et al. and others. From Dan M. Kahan, Culture, Cognition, and
Consent: Who Perceives What, and Why, in 'Acquaintance Rape' Cases, 158 U. Penn. L.Rev. 729 (2010).
6. Causation and validity vs. significance.
Finally, statistical significance has nothing to do with study validity! The p-value quantifies measurement
error associated with the estimates derived from the statistical model used to analyze the data. Defective
study design result in model error; model error might or might not be quantifiable but estimates of an
erroneous model are not compensated for by the statistical significance or precision of its estimates.
Accordingly, when trying to satisfy yourself that a design convincingly supports the inference that the
researcher is drawing (internal validity) and is uncovering a dynamic that one can expect to affect the
real-world process that the study is modeling, don’t give even a gram of credit to a low p-value.
The causal interpretation of correlations in an empirical study is one important example of this point. The
basis for inferring causation between correlated variables always is independent of the data (always,
whether the study is experimental or observational!); it’s only after one is satisfied that the design
supports a causal inference that one turns to statistical significance testing to assure oneself that there is a
sufficiently low risk that the observed effect would not have occurred by chance.
7. Null hypothesis testing.
Statistical significance is integral to the “null hypothesis testing” paradigm in the social (and much of the
physical) sciences. The paradigm involves conducting a test to see whether the relationship between
some influence of interest and some outcome of interest differs from “zero” by an amount that is
“statistically significant.” If so, one can “reject the null hypothesis”—that the influence and the outcome
are in fact not systematically related in some fashion.
There are lots of problems with this mode of empirical investigation and even more with the thoughtless
dominance that NHT has assumed in social science research.
But we’ll talk about that another time!