Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Philosophy of Science Association Error Statistics and Learning from Error: Making a Virtue of Necessity Author(s): Deborah G. Mayo Source: Philosophy of Science, Vol. 64, Supplement. Proceedings of the 1996 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Dec., 1997), pp. S195-S212 Published by: The University of Chicago Press on behalf of the Philosophy of Science Association Stable URL: http://www.jstor.org/stable/188403 Accessed: 23/10/2008 22:33 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=ucpress. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected]. The University of Chicago Press and Philosophy of Science Association are collaborating with JSTOR to digitize, preserve and extend access to Philosophy of Science. http://www.jstor.org Error Statisticsand LearningFrom Error:Making a Virtue of Necessity Deborah G. Mayott Virginia Tech The error statistical account of testing uses statistical considerations, not to provide a measure of probability of hypotheses, but to model patterns of irregularity that are useful for controlling, distinguishing, and learning from errors. The aim of this paper is (1) to explain the main points of contrast between the error statistical and the subjective Bayesian approach and (2) to elucidate the key errors that underlie the central objection raised by Colin Howson at our PSA 96 Symposium. 1. Introduction. The two main attitudes held to-day towards the theory of probability both result from an attempt to definethe probabilitynumber scale so that it may readilybe put in gearwith common processesof rationalthought. For one school, the degreeof confidencein a proposition ... provides the basic notion to which the numericalscale should be adjusted.The other school notes how in ordinarylife a knowledge of the relative frequencyof occurrenceof a particular class of events in a series of repetitionshas again and again an influence on conduct; it thereforesuggests that it is through its link with relativefrequencythat a numericalprobabilitymeasurehas the most directmeaning for the human mind. (Pearson 1950, 228) The two main attitudes of which Pearsonhere speaks correspondto two distinct views of the task of a theory of statistics:the first we may call the evidential-relation (E-R) view, and the second, the error statis- tical view. This difference corresponds to fundamental differencesin the idea of how probabilisticconsiderationsenterin scientificinference tDepartment of Philosophy, Virginia Tech, Blacksburg, VA 24061; [email protected]. %Ithank E. L. Lehmann for several important error statistical insights. Philosophy of Science, 64 (Proceedings) pp. S195-S212. 0031-8248/97/64supp-0019$0.00 Copyright 1997 by the Philosophy of Science Association. All rights reserved. S195 S196 DEBORAH G. MAYO and therebyin the goal of philosophy of statistics.Evidential-relationship approaches grew quite naturally from what was traditionally thought to be requiredby a "logic" of confirmationor induction.Most commonly, such approachesseek quantitativemeasuresof the bearing of evidence on hypotheses. What I call error statisticalapproaches,in contrast, focus their attention on finding general methods or procedures of testing with certain good properties. In the E-R view, the task of a theory of statisticsis to say, for given evidence and hypotheses, how well evidence confirms or supportshypotheses (whether absolutely or comparatively).In this view, the role of statistics is that of furnishinga set of formal rules or a "logic" relating given evidence to hypotheses.The dominant example of such an approach on the contemporaryphilosophical scene is based on one or anotherBayesianmeasureof supportor confirmation.With the Bayesian approach, what we have learned about a hypothesis H from evidence e is measuredby the conditional probability of H given e using Bayes's theorem. The cornerstoneof the Bayesian approach is the use of prior probability assignments to hypotheses, generallyinterpreted as an agent's subjectivedegreesof belief. In contrast, the methods and models of classical and NeymanPearson statistics (e.g., statisticalsignificancetests, confidenceinterval methods) are primaryexamplesof errorprobabilityapproaches.These eschew the use of prior probabilitieswhere these are not based on objective frequencies.Probabilityentersinstead as a way of characterizing the experimental or testing process itself; to express how reliably it discriminatesbetween alternativehypothesesand how well it facilitates learningfrom error.Theseprobabilisticpropertiesof experimentalprocedures are error probabilities. Severalfamiliaruses of statistics that we read about daily are based on errorstatisticalmethods and models:in polls inferringthat the proportion likely to vote for a given candidate equals p% plus or minus some percentagepoints, in reportsof statisticallysignificantdifferences between treated and control groups, in data analyses in physics, astronomy and elsewherein order to distinguish"signal"from "noise." Despite the prevalenceof errorstatisticalmethods in scientificpractice, the Bayesian Way has been regarded as the model of choice among philosophers looking to statisticalmethodology. By and large, philosophers of science who consult philosophersof statisticsreceivethe impression that all but Bayesian statisticsis discredited. Jerzy Neyman (co-developerof Neyman and Pearson methods) expressedsurpriseat the ardorwith which subjectivistsattackedNeymanPearson tests and confidence interval estimation methods back in the 1970s: ERROR STATISTICS AND LEARNING FROM ERROR S197 I feel a degree of amusementwhen reading an exchange between an authority in 'subjectivisticstatistics' and a practicing statistician, more or less to this effect: The Authority:'You must not use confidence intervals;they are discredited!' PracticingStatistician.'I use confidenceintervalsbecausethey correspond exactly to certain needs of applied work.' (Neyman 1977, 97) Neyman's remarks hold true today. The subjective Bayesian is still regarded,in many philosophy of science circles, as "the authority"in statistical inference, and scientists from increasinglydiversefields still regardNP methods as correspondingexactly to their needs. It may seem surprising,given the currentclimate in philosophy of science, to find philosophers (still) declaringinvalid a standard set of experimentalmethods, ratherthan tryingto understandor explainwhy scientistsevidently (still) find them so useful. I think it is surprising.In any event, I believe it is time to remedy the situation. A genuinely adequatephilosophy of experimentalinferencewill only emergeif it is not at odds with statisticalpracticein science. My position is that the error statistical approach is at the heart of the widespreadapplicationsof statisticalideas in scientificinquiry,and that it offers a fruitfulbasis for a philosophy of experimentalinference. Although my account builds upon several methods and models from classical and Neyman-Pearson (NP) statistics, it does so in ways that depart sufficiently from what is typically associated with these approaches as to warrantsome new label. Nevertheless,I retainthe chief feature of Neyman-Pearson methods-the centrality of error probabilities-hence the label "error-statistics." Colin Howson (this issue) argues that error probabilisticmethods are in error.After all, scrutinizingan experimentalresultby considering the error probabilitiesof the proceduresthat produced it is anathema to Bayesians, considering as it does outcomes other than the one actually observed. But this just brings out a key point at which the error statisticianis at loggerheadswith the Bayesian,and is not an argument that the BayesianWay offers a better account of experimentallearning in science. Granting that a very strict (behavioristic)construal of the NP approach-where all that matters is low error rates in the long run-can seem to license counterintuitiveinferences,I have sought to erect an error statistical approach that avoids them. In addition to solving a cluster of problems and misinterpretations,this approach attempts to set out a conception of experimentalinquiry in which the errorstatisticaltools provide us with just the tools we need for learning in the face of error. S198 DEBORAH G. MAYO 2. A FundamentalDifference in Aims. The error statistical approach seeks tools that can cope with the necessarylimitations, and the inevitable slings and arrows,of actual experimentalinquiry.The subjective Bayesian upholds a different standardof virtue. Take the very definition of inductivelogic, stated by Howson and Urbach: Inductive logic-which is how we regard the subjectiveBayesian theory-is the theory of inference from some exogenously given data and prior distribution of belief to a posterior distribution. (1989, 290) Inductiveinferencefrom evidence is a matter of updating one's degree of belief to yield a posterior degree of belief (via Bayes's theorem). Wheredoes one get the prior probabilitiesand the likelihoodsrequired to apply Bayes's theorem?Howson and Urbach (1989, 273) reply that "we are underno obligation to legislateconcerningthe methodspeople adopt for assigning prior probabilities.These are supposed merely to characterisetheir beliefs subject to the sole constraint of consistency with the probabilitycalculus." What about the grounds for accepting the statementsof evidence?Just as with arrivingat prior probabilities, the evidence is something you need to start out with: The Bayesiantheory we are proposingis a theory of inferencefrom data; we say nothing about whether it is correct to accept the data.... The Bayesian theory of support is a theory of how the acceptance as true of some evidential statement affects your belief in some hypothesis. How you came to accept the truth of the evidence, and whetheryou are correctin acceptingit as true, are matters which, from the point of view of the theory, are simply irrelevant. (Howson and Urbach 1989, 272, emphasis added) This conception of inductive inferenceno doubt has its virtues. It has the simplicity and cleanness of a deductive logic, virtues that, admittedly, are absent from the error statistician'sview of things. Error statisticianswillingly forgo grand and unified schemesfor relating their beliefs, preferringa hodgepodge of methods that let them set sail with the kind of information they actually tend to have. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires.The value of statisticaltools for them is to develop strategiesthat capitalize on their knowledge of mistakes:strategiesfor collecting and modeling data, for efficientlycheckingan assortmentof errors, and for communicating results in a form that promotes their scrutinyand their extension by others. Once it is recognizedthat there is a big difference between our aims, we can agree to disagree with ERRORSTATISTICSAND LEARNINGFROMERROR S199 subjectiveBayesians as to what virtues our account of scientificinference should possess. Let me highlight some aspects of the errorstatistical approach that I regardas virtues. To begin with, rather than starting its work with evidence or data (as Bayesian and other evidential-relationshipaccounts do), our error statistical approach includes the task of arrivingat data-a task that it recognizes as calling for its own inferences.A second point of contrast is that we do not seek to equate the scientific inference with a direct application of some statisticalinferencescheme. For example, philosophersoften suppose that to apply NP statistics in philosophy of science, scientificinferencemust be viewed as a matter of accepting or rejecting hypotheses according to whether outcomes fall in acceptance or rejection regions of NP tests. Finding examples where this kind of automatic "accept-rejectrule" distorts scientificinference, it is concluded that NP statistics is inappropriatefor building an account of inferencein science. This conclusion is unwarrantedbecause it overlooks the ways in which these methods are actually used in science. What I am calling the error statistical account, I believe, reflectsthese actual uses, and shows what is behind the claim of Neyman's scientist, that these methods correspond precisely to certain needs of applied work. 2.1. A Frameworkof Inquiry.To get at the use of these methods in science, I propose that experimentalinferencebe understood within a frameworkof inquiry. You cannot just throw some "evidence"at the error statistician and expect an informativeanswer to the question of what hypothesisit warrants.But neitherdoes the errorstatisticianneed to begin with neat and tidy data to get started.A frameworkof inquiry incorporatesmethods of experimentaldesign, data generation,modeling, and testing. For each experimentalinquirywe can delineatethree types of models: models of primary scientific hypotheses (or questions), models of data, and models of experiment.1 A substantive scientific inquiry is to be broken down into one or more local or "topical"hypothesesthat make up the primaryquestions or problems of separate inquiries.2Typically, primary problems take the form of estimating quantities of a model or theory, or of testing hypothesized values of these quantities. These local problems often correspond to questions framed in terms of one or more standardor 1. This is akin to the delineationof a hierarchyof models proposed by PatrickSuppes (1969). 2. The term "topical hypotheses" is coined by Hacking (1992). Like topical creams, they are to be contrastedwith deeply penetratingtheories. S200 Primary Model DEBORAH G. MAYO 4--- Experimental Model Data Model Figure 1: Models of Experimental Inquiry. canonicalerrors:about parametervalues, about causes, about accidental effects, and about assumptionsinvolved in testing other errors.The experimentalmodels serve as the key linkage models connecting the primarymodel to the data, links that require,not the raw data itself, but appropriately modeled data. 2.2 A Piecemeal Account of Testing. In the error statistical account, formal statistical methods relate to experimentalhypotheses, hypotheses framed in the experimentalmodel of a given inquiry. Relating inferencesabout experimentalhypotheses to primaryscientificclaims is, except in special cases, a distinct step. Yet a third step is called for to link raw data to data models-the real material of experimental inference. So, for example, an inference from data to a primary hypothesis may fail to be warrantedeither because the experimentalinference that is licensed fails to answer the primaryquestion, or it may fail because the assumptionsof the experimentaldata are not met sufficiently by the actual data. In short, there is a sequenceof errorsthat this account directs you to check and utilize along the way, in a kind of piecemealapproach. Now Howson charges (in unpublishedcomments for our PSA symposium) that I am "promotinga methodology of piecemealtesting ... in an attempt to save the game ... this is merely making a virtue out of necessity" (emphasis mine). I accept this charge, for it is a game well worth saving. Where data are inexact, noisy, and incomplete, where extraneous factors are uncontrolled or physically uncontrollable,we simply cannot aspire to the virtues the Bayesian theory demands.We do not have an exhaustivelist of hypothesesand probabilitieson them, we cannot predict the future course of science, which, to paraphrase Wesley Salmon (1991, 329), would seem to be requiredto assign the likelihood to the Bayesian catchall factor (i.e., P(e Inot-H)). Nor can we, in science, go along with something because individuals strongly believe it, nor can we wait for swamping out of priors to adjudicate disagreementsright now about the evidence in front of us. Nor need we. As a matter of actual fact, we are rather good at ERROR STATISTICS AND LEARNING FROM ERROR S201 finding things out with a lot less information-especially where the threats would be the greatest were we unable to do so. If I cannot actuallyhold one factor constant to see the effects of others,if I cannot literally manipulate or change, I may still be able to find out what it wouldbe like if I could do those things. If I cannot test everythingat once, I may be clever enough to test piecemeal.If an event is very rare, I may be able to amplify its effects sufficientlyto detect its presence.If I find myself threatenedwith error, then I need to become a shrewd inquisitor of error. If I cannot face up to these necessary features of experimentby the kind of "white glove" logical analysis of evidence and hypothesesenvisionedby E-R approaches,then I am going to have to get down to the nitty-grittydetails of the data collection, modeling, and analysis. Statistics, as I see it, is the conglomeration of systematic tools for carrying out these aims-for making virtues out of necessities. What is being systematizedby these statistical tools is a reflectionof familiar,day-to-day learningfrom errors. 3. LearningFromErrors.How do we learn from error?Let me outline in a very sketchy way the kinds of answersthat may be found.3 1. After-trial checking (correcting myself). By "after-trial" I mean after the data or evidence to be used in some inferenceis at hand. A tentative conclusion may be considered, and we want to check if it is correct. Having made mistakes in reaching a type of inferencein the past, we often learn techniques that can be applied the next time to check if we are committing the same error. In addition to techniques for catching ourselves in error there are techniquesfor correctingerrors.Especiallyimportanterror-correcting techniquesare those designedto go from less accurateto more accurate results, such as taking severalmeasurementsand averagingthem. 2. Before-trialplanning. Knowledge of past mistakes gives rise to efforts to avoid the errorsahead of time, before runningan experiment or obtaining data. For example, teachers who suspect that knowing the author of a paper may influencetheir gradingmay go out of their way to ensure anonymity before startingto grade. This is an informal analogue to techniques of astute experimentaldesign, such as the use of control groups, double-blinding,and large sample size. 3. An error repertoire.The history of mistakes made in a type of inquiry gives rise to a list of mistakes to either work to avoid (beforetrial planning) or check if committed (after-trialchecking), for example, a list of the familiarmistakes when inferringa cause of a correla3. A far more complete discussion of these and other aspects of the error statistical approachmay be found in Mayo 1996. S202 DEBORAH G. MAYO tion: Is the correlationspurious?Is it due to an extraneousfactor?Am I confusing cause and effect?More homely examplesare familiarfrom past efforts at fixing a car or a computer, or at cooking. 4. The effects of mistakes. Through the study of mistakes we learn about the kind and extent of the effect attributableto differenterrors. This information is then utilized in subsequentinquiriesor criticisms. The key is to be able to discriminateeffects. Perhaps putting in too much water causes the rice to be softer, but not saltier. Knowledge of the effects of mistakes is often exploited to "subtract out" their influencesafter the trial. If the effects of differentfactorscan be sufficientlydistinguishedor subtractedout later, then the inferences are not threatenedby a failure to control for them. Thus knowing the effects of mistakes is often the key to justifying inferences. 5. Simulatingerrors.An importantway to glean informationabout the effects of mistakes is by utilizing techniques (real or artificial)to display what it would be like if a given errorwere committedor a given factor is operative. Such simulationscan be used both to rule out and pinpoint factors responsible. Observingan antibiotic capsule in a glass of water over severaldays revealed, by the condition of the coating, how an ulcerationlikely occurred when its coating stuck in my throat. In the same vein, we find scientists appealing to familiar chance mechanisms(e.g., coin tossing) to simulatewhat would be expectedif a resultwere due to experimental artifacts. Statistical models are valuable because they perform this simulationfunction formally, by way of (probabilistic)distributions. 6. Amplifying and listening to error patterns. One way of learning from error is through techniques for magnifying their effects. I can detect a tiny systematicerrorin my odometer by drivingfar enough to a place of known distance. Likewise, a pattern may be gleaned from "noisy" data by introducing a known standard and studying the deviations from that standard. By studying the pattern of discrepancy and by magnifying the effects of distortions, the nature of residuals, and so forth, such deviations can be made to speak volumes. 7. Robustness.From all of this information, we also learn when violating certainrecommendationsor backgroundassumptionsdoes not pose any problem, does not vitiate specificinferences.Such outcomes or inferencesare said to be robustagainst such mistakes.An important strategyfor checkingrobustnessis to deliberatelyvary the assumptions and see if the result or argumentstill holds. This strategyoften allows for the argument that the inference is sound, despite violations, that inaccuraciesin underlyingfactors cannot be responsible for a result. For were they responsible,we would not have been able to consistently obtain the same results despite variations. ERROR STATISTICS AND LEARNING FROM ERROR S203 8. Severelyprobingerror.The above seven points form the basis of learningto detect errors.We can put together so potent an arsenalfor unearthinga given error that when we fail to find it we have excellent grounds for concluding that the error is absent. The same kind of reasoning is at the heart of experimentaltesting. I call it arguing from error. After learning enough about certain types of mistakes, we may construct a testing procedurewith an overwhelmingly good chance of revealing the presence of a specific error, if it exists-but not otherwise. Such a testing procedure may be called a severe (or reliable) test, or a severe error probe. If a hypothesized error is not detected by a test that has an overwhelminglyhigh chance of detecting it, if instead the test yields a result that accords well with no error, then there are grounds for the claim that the erroris absent. We can infer something positive, that the particularerror is absent (or is no greater than a certain amount). The informal pattern of such an argumentfrom erroris guided by the following thesis: It is learned that an error is absent when (and only to the extent that) a procedureof inquiry(whichmay include severaltests taken together) that has a very high probability of detecting an error if (and only if) it exists, neverthelessdetects no error. Its failing to detect the errormeans it producesa result(or set of results) that accords with the absence of the error.Alternatively,the argument from error can be describedin terms of a test of a hypothesis H, that a given error is absent.4The evidence indicates the correctnessof H when H passes a severetest-one with a very high probabilityof failing H, if H is false. An analogous argumentis used to infer the presence of an error. Let me make some remarkson this idea of severity,although I cannot elaborate here in the detail that is merited. First, severity always refers to a particularinference reached or hypothesis passed-a test may be severe for one hypothesis and not another. Second, the statement of "high probability" need not be obtained by reference to a statisticalcalculation:some of the strongest argumentsfrom errorare based on entirely qualitative assessmentsof severity.This links to the third point, that the formal statement of severity,while a useful summary, is a pale reflection of the actual, real life flesh and blood argument from error. The substantive argument really refers to how ex4. In terms of a hypothesis H, the argumentfrom errormay be construedas follows: Evidence in accordancewith hypothesis H indicates the correctnessof H when (and only to the extent that) the evidenceresultsfrom a procedurethat with high probability would have produceda result more discordantfrom H, were H incorrect. S204 DEBORAH G. MAYO traordinarythe set of circumstanceswould have to be in order for an error to continually remain hidden from several well-understooddetection techniques. In order to construct severe probes of error, experimentaldesign directs inquiriesto be broken down into piecemeal questions. The situation is brokendown so that each hypothesisis a local assertionabout a particularerror in a given experimentalframework.It is important to see that the claim that "H is false" in assessing severity is not the Bayesian catchall hypothesis. Within an experimentaltesting model, the falsity of a primaryhypothesis H takes on a very specificmeaning. How to construe it depends on the particularerror being ruled out in affirmingH (e.g., hypothesisH assertsa given erroris absent,His false asserts that it is present). If H states a parameteris greaterthan some value c, H is false states it is less than c; if H states that factor x is responsiblefor at least p%of an effect, its denial states it is responsible for less than p%; if H states an effect is caused by factor f-say an artifact of the instrument-H is false may say an artifactcould not be responsible; if H states the effect is systematic, of the sort brought about more often than by chance, then H is false states it is due to chance. This approach lets me test one piece at a time, and there is no subliminal assignmentof prior probabilitiesto the hypothesesthat are not being tested by a given test. If I test, seekingto explainyour sore throat, whether it is due to strep or not, I am not assigning zero probability to all the other hypotheses that could explain your sore throat. I am just runninga test that discriminatesstrepfrom no-strep.It is true that to keep alternatives out of a Bayesian appraisal they are effectively assigned a zero probability. That is because the Bayesian appraisal considers a single probability pie, as it were. Appraising any single hypothesis is necessarily a function of all the alternativesin the socalled catchall hypotheses. Lacking this information and desiring to begin to learn something, the scientist, making a virtue of necessity, calls for tools that do not requirethis information. 4. Do ErrorStatisticalTests LicenseUnsoundInferences?Howson criticizes a capsulizedversion of my idea of arguing from error. He refers to it as (*): e is a good indication of H to the extent that H has passed a severetest with e. His argumentthat (*) is unsoundrestson describing a situation in which there is a hypothesis H that is indicatedaccording to (*) and yet, intuitively, e does not indicate H. Both the assumed situation and the intuitions, however, are Bayesian ones. He supposes, in particular,that (i) a certain disease has a very small incidence, say p%, in a given population; and (ii) any randomly chosen test subject ERROR STATISTICS AND LEARNING FROM ERROR S205 from this population has a (prior) probability of having the disease equal to p. Howson's criticismis that evidencethat an errorstatistician would allegedlytake as indicatinghypothesis H, the disease is present, yields a very low posterior probability to H-thanks to its low prior probability:"the error probability conditions for a severe test of that particularhypothesisH are clearlysatisfied;equallyclearly,passing the test provides no indication of its correctness" (Howson, this issue, em- phasis added). To begin, it is important to distinguish between questioning the soundness of my rule (*) and questioning the soundness of the formal theory of NP tests, although Howson runs the two together.The error probabilistic calculations of NP tests and confidence intervals are as deductivelysound as the Bayesian'scalculations.(*), by contrast,is an ampliativeand not a deductiverule, and its scrutinywould have to be in contrast to an analogous ampliativeBayesian rule, if only Howson will give us one. Implicitly,he does. UnderlyingHowson's chargethat H's "passingthe test providesno indication of its correctness"is something like the following rule: Howson's (implicit) rule:There is a good indication or strong evidence for the correctnessof hypothesis H just to the extent that it has a high posteriorprobability which may either be a degree of belief or a relativefrequency.We will see that Howson has provided no counterexampleto (*)-when it is correctly applied-but rather an illustration of the ways our different rules may conflict. 4.1. The Case With a Frequentist Prior. Before turning to the ex- ample, recallthat the errorstatisticalaccount is based upon frequentist methods such as NP tests, and these methods developed preciselyfor situationsin which no frequentistprior is availableor even meaningful, as with the majority of scientific hypotheses of interest. Instead, the hypotheses are regardedas unknown constants and only errorprobabilities given one or another hypothesis are considered. The virtue of these methods is their ability to control and learn from these error probabilitieswithout regardto the frequencieswith which the hypotheses are true-frequencies that could only make sense where the hypotheses may be regardedas random variables.5 But if H is a random variable, and a frequentistprior is available, 5. Although hypothesesregardedas unknownconstantsareviewedas randomvariables by Bayesians,Howson is incorrectto maintainthat this is also the case for frequentists. It is not (see for example,Neyman 1952, Ch. 1). S206 DEBORAHG. MAYO the error statistician can use it too.6 However, Howson erroneously supposes that the probability of randomly selecting a person with the disease from a given population gives a frequentistprior appropriate for the error statistician'squestion, in the kind of test he describes.It does not.7 Where the error tester requirespositive evidence of having ruled out the presence of disease before issuing a clean bill of health, Howson's analysisalwaysconcludes that there is no indicationthat the disease is present! The disease example. In Howson's example, the test or null hypoth- esis H asserts that a disease is present, and alternativenot-H asserts the disease is absent. In this highly artificialexamplethereare only two outputs: an abnormal reading or a normal reading. The test fails to reject H just when an abnormal reading e occurs.8 The case of breastcancer screeningoffers an examplewhereone can find statistics strikinglyclose to Howson's, and it will help to clarify the intuitions upon which this puzzle rests. The null or test hypothesis is H: breast cancer is present;while "not-H" is that breast cancer is absent. However, "not-H" is a disjunctionof hypothesesrangingfrom the presence of precancerousconditions, to a variety of benign breast diseases, all the way to the absence of breast disease, and in order for us to calculate error probabilities we need to consider specific alternatives under "not-H". We can accommodate Howson's example by focusing on the following null and alternativehypotheses,respectively: H: breast cancer is present;J: breast disease is absent In a typical quantitativelymodeled test, the null hypothesisH asserts some parameter, is equal to some value3uo,and the test rejectsHjust in case some random variable X is observed to be sufficientlylarge. Somethinganalogous can be done in modeling the screeningfor breast cancer. We can imagine that each test involves a set of diagnosticprocedures(e.g., mammogram,tumor marker,ultrasound)and Xrecords the number of these that find nothing suspicious. To approximate Howson's example, which supposes that if H is true (breast cancer is present) then an abnormalreadinge is assured- i.e., that there is a 0 probability of a Type I error- we can imagine that if even one pro6. In such cases, the usual errorcharacteristicsof tests can themselvesbe seen as random variableswhose expectationsmay be assessed. See Neyman 1971, 3. 7. For furtherdiscussionof the errorinvolved in this type of example, see Mayo 1997. 8. Howson refersto the two test outputs as "positive"and "negative,"where"positive" is the abnormalresult, but this may be confusing because a positive test result is commonly equated with rejectingthe null. Here, he wants a "positive" (i.e., abnormal) output to fail to rejectH. ERRORSTATISTICSAND LEARNINGFROMERROR S207 cedure shows suspicion, then the overall result is abnormal, and H is not rejected.With today's tests, we could actuallyget close to Howson's 0 probabilityof a Type I error, but more realistically,let us suppose: P(e | H: breast cancer) = practically 1. We are to suppose furtherthat if a person is disease-free,then she very rarelygets an abnormalreading e. Let P(e | J) = very low, say .01. Howson's criticismmay be spelled out as follows: 1. An abnormal result e is taken as failing to reject H (i.e., as "acceptingH"); while rejectingJ, that no breast disease exists. 2. H passes a severetest and thus H is indicatedaccordingto (*). 3. But the disease is so rare in the population9(from which the patient was randomly sampled) that the posterior probability of H given e is still very low (and that of J given e is still very high). 4. Therefore,"intuitively",H is not indicatedbut ratherJ is. 5. Therefore(*) is unsound. There are two serious problems with this argument:First, premise 2 results from misapplying(*), and second, premise4 assumes the intuitions of Howson's Bayesian rule. I would deny both premises. How to Tell the TruthaboutFailuresto Reject H. I grant that there are NP tests that might license the objectionableinferenceto H, but I deny that they do so severely. As I have already stressed, the error statistician does not use NP tests as automatic accept or reject rules. Rather, one infers those hypotheses that pass severe tests, and to calculate severity correctly requiresbeing clear as to the type and extent of the particularerrorbeing probed. Although this is not reducibleto a recipe, we can articulatesystematic("metastatistical")rulesin order to avoid classic misinterpretationsof both rejectionsof H and failures to rejectH in standardtesting situations.Howson overlooks theserules and misapplies(*). Whenever a test result fails to reject null hypothesis H, a standard problem we must be on the lookout for is that the test did not have a good enough chance to reject H even if H is false. We take a lesson 9. Howson's example takes the rate of disease to be .001, but we can consider the population of women in their 30s to arriveat this lower incidencerate of .0005. This allows us to use the more realisticprobabiityof P(abnormalresult I no disease) = .01 and still get the low posteriorprobabilityof H given e (.05) that Howson needs for his argumentto go through. S208 DEBORAHG. MAYO from formal tests of H: Failing to reject the null hypothesis H that u = u0does not license the inferencethat ,u is exactly ,u0because the test may rarely have rejectedH even under situations where H is strictly false (i.e., where there are discrepanciesfrom u0 in the direction of alternativeJ). H would only pass with low severity.10 The rule for scrutinizingfailures to reject H follows the pattern of arguing from error. Let me try to apply it to Howson's example (although without a specific designation of random variable X, this can only be a rough approximation): a. An abnormal result e is a poor indicationof the presence of diseasemore extensivethan d if such an abnormalresultis probable even with the presenceof disease no more extensive than d. b. An abnormal result e is a good indicationof the presence of disease as extensive as d if it is very improbablethat such an abnormal resultwould have occurredif a lesserextent of diseasewere present. We see that a failure to rejectH with e (an abnormal result) does not indicate H so long as there are alternativesto H that would very often producesuch a result. For example,the test might often yield a positive result even if only a benign type of breastcondition exists. In that case, the severity assessment shows that e is not grounds for denying that the condition is of the benign sort: assertingthe presence of a disease more serious than a benign one fails to pass a severe test with e. Thus we deny premise 2 of Howson's argument, and notice, we have done so without making use of the prior probability of H. Only discriminations based on error statisticalcalculationswere needed. Unsoundness? No. A Conflict of Aims? Yes. The second part of the rule, however, allows that the test is warrantedin denying hypothesis J: the absence of disease. That is because if J were true, the test would almost surely (99%of the time) not have given the abnormal result it did. But cannot Howson's argument be leveled against our allowing "not-J" i.e., some disease to severelypass?Let us grant that it can. The argument would go like this: Failing to reject H with e is taken as indicating the denial of J according to (*), but the disease is so rare that the posterior probability of J given e is still very high. Therefore, "intuitively,"J is indicated (and denying J is not indicated), and thus (*) gets it wrong. However, this conclusionrestsupon Howson's Bayesian intuition, instantiatedin premise4 above. I readilygrant that the Bayesianand the errorstatisticianhave very 10. For furtherdiscussionof the "ruleof acceptance"of H see Mayo 1985, 1989, 1996. ERROR STATISTICS AND LEARNING FROM ERROR S209 different intuitions here and they stem from the difference in aims sketched earlier. Let us be clear on how an error statistician understands what is being demandedby the test that Howson has specified. (The question of whetherit is an appropriatetest for some substantive primaryquestion is a distinct problem from the one being confronted just now.) In specifying such a test, with H as the null, and with the 0 or virtually 0 probability of a Type I error, the error tester is saying that we are primarilyconcernedto avoid giving a woman a false sense of security.We do not want to rejectH: the presenceof disease, unless we have done an extremely good job ruling out the ways in which it may be a mistake to hold that J: a disease-freecondition exists. We must do a good job ruling out the ways of erroneously inferring Jbefore J is licensed. Finding an abnormal result e clearly does not rule out these mistakes, thus e does not warrantJ. In other words, in failing to reject H with this test what we mean is: we do not have grounds (of the extent this test is demanding)to assertthe absenceof breastdisease. But Howson says that we should be primarilyconcernedto calculate the posterior probability of J given e-which is .95. The appropriate report,he thinks, is that the evidenceis a good indicationof the absence of the disease.1 In other words, the Bayesian lab will output a clean bill of health even on the basis of an abnormalreadingon the grounds that the incidenceof the disease is so low (in the group from which the subjectwas randomly taken) that the posteriorprobabilityis still high that the disease is absent. In the Bayesian screening,no women could ever have breast disease indicated by this test. If the result is normal, the Bayesian infers (with probabilityapproaching1) there is no breast disease;if the resultis abnormal,as we just saw, he also infersno breast disease-although the posteriorprobabilityhas gone down a little. The Bayesian calculations are correct-the problem is that Howson's lab has not done the job demanded by the error statistical test for breast cancer! Summary of the Error Statistical Report. Although in practice we are not limited to the artificial"abnormal-normal"dichotomyof Howson's example, even within this limitation we can see how the use of error probabilistic considerations conveys what the results indicate. The abnormal result, one can see, is virtually certain among women 11. Newsweek(Feb. 24, 1997, p. 56) recentlyreportedthat only 2.5%of women in their 40s who obtain abnormalmammogramsare found to have breast cancer. So P(breast cancer I abnormalmammogram) = .025-quite like Howson's made-upexample.Using Howson's construal of the evidence, such an abnormalmammogramgives confidence of the absenceof breast cancer. So the follow-up that discoveredthese cancers would not have been warranted. S210 DEBORAH G. MAYO with breastcancer,while it is quite rare,probability.01, among women with no breast disease. Although the abnormalresult, we said, did not do a good enough job at discriminatingmalignant from benign conditions, it clearly did not give positive assuranceof a disease-freestate. In practice, when such a vague index of suspicion of breast cancer results, one of the new highly-sensitiveimaging techniquesmay be indicated to distinguish malignant tumors from various benign breast diseases. But the indication for this further scrutiny hinges upon the soundness of the initial indication-that this result speaks against a disease-freecondition. Of course if there is adequate information on the rates of benign breast disease, the error statisticianmay report it along with the indication given by the error probabilistic assessments (e.g., "It is very likely that the condition, if any, will prove to be benign"). In practice, however, there is considerableuncertaintyas to whether any population from which a given woman is randomly selected provides the appropriate referenceclass for assessing her particularrisk. (Have they consideredher age, genetic background,occupation, diet, weight, age of menstruation, etc.?) Of course there are some situations with frequentist priors where the posterior probabilities are the numbers sought, but those cases involve asking about a very different type of error, and (*) gives the right indication for that error (see Note 6). 4.2. The Case With Subjective Priors. Now all this was when the Bayesianuses frequentistprior probabilitiesand likelihoods.The most troubling problem for the Bayesian account is its use of probabilities construed only as the subjective degrees of belief of a given agent. Although in defending their methods from criticism Bayesians are quick to turn to statistical contexts with frequentistpriors (as in the exampleabove), it should not be forgotten that, exceptfor these special cases, the Bayesian obtains the numbers that he says we really want (i.e., posterior probabilities in hypotheses) only by countenancing probabilities understood as subjectivedegrees of belief. Our analysis of the previous example lets us see how a sufficientlyhigh subjective prior for a hypothesis J countenanceshigh Bayesian confirmationfor J, even in the face of evidence that is anomalous for J. A familiarillustrationis found in the subjectiveBayesian"solution" to Duhem's problem of where to lay the blame in the face of an anomaly. The situationmay parallelthe diseaseexample:H entails e whereas e is very improbablegiven alternativehypothesis J. Priorprobabilities again come to J's rescue,now in the form of a high enoughpriordegree of belief in J. The posteriorin J remainshigh even in the face of anom- ERROR STATISTICS AND LEARNING FROM ERROR S211 alous result e. This "warrants"the scientist in deflectingthe anomaly from J and instead discreditingrival hypotheses. To the error statistician,finding an anomaly for J hardly counts as having done work to rule out the ways in which it can be an error to suppose J is correct. The agent's degreesof belief in J have nothing to do with it.12 Indeed, so cavalier a treatment of anomalies can be shown to allow hypothesis J to pass with low and even minimal severity.But Bayesians are not requiredto satisfy these error probabilisticrequirements. Howson sees this as a great virtue, heraldingthe returnof common sense. Error statisticians ... have for decades given us something quite differentfrom what we want.... Only recentlyhas commonsense returned commonsense reduced to a calculus ... now known as the Bayesian theory. (Howson, this issue) But the BayesianWay has dominatedin philosophy of sciencefor some time. As a result, importantaspects of scientificpracticeare misunderstood or overlooked by philosophers, because these practices reflect errorstatisticalprinciplesthat are widespreadin science.Howson goes so far as to issue a warning against a turn to error statistics: "The message is clear: rely on error probabilities only at your peril" (this issue). But when we reallywant to learnwhat our test resultsare saying, whetherabout our bodies or about this world, the peril is in relyingon the subjectiveBayesian screening. REFERENCES Hacking, I. (1992), "The Self-Vindication of the Laboratory Sciences", in A. Pickering (ed.), Science as Practice and Culture. Chicago: University of Chicago Press, pp. 29-64. Howson, C. (1997), "Error Probabilities in Error", Philosophy of Science 64 (Proceedings): this issue. Howson, C. and P. Urbach (1989), Scientific Reasoning: The Bayesian Approach. La Salle: Open Court. Mayo, D. (1985), "Increasing Public Participation in Controversies Involving Hazards: The Value of Metastatistical Rules", Science, Technology, and Human Values 10: 55-68. . (1989), "Toward a More Objective Understanding of the Evidence of Carcinogenic Risk", in A. Fine and J. Leplin (eds.), PSA 1988, Vol. 2, East Lansing, MI: Philosophy of Science Association, pp. 489-503. - . (1996), Error and the Growthof Experimental Knowledge. Chicago: The University of Chicago Press. . (1997), "Duhem's Problem, The Bayesian Way, and Error Statistics, or 'What's Belief Got To Do With It'?" and "Response to Howson and Laudan", Philosophy of Science (June 1997), in press. Neyman, J. (1952), Lectures and Conferenceson Mathematical Statistics and Probability, 2nd ed. Washington, DC: U.S. Department of Agriculture. 12. See Mayo 1997. S212 DEBORAH G. MAYO . (1971), "Foundations of Behavioristic Statistics", in V. P. Godambe and D. A. Sprott (eds.), Foundationsof Statistical Inference.Toronto: Holt, Rinehart and Winston of Canada, pp.1-13 (comments and reply, pp. 14-19). . (1977), "Frequentist Probability and Frequentist Statistics", Synthese 36: 97-131. Newsweek, (February 24, 1997), "The Mammogram War", pp. 54-58. Pearson, E. S. (1950), "On Questions Raised by the Combination of Tests Based on Discontinuous Distributions", Biometrika37: 383-398, as reprintedin E. S. Pearson (1966), The Selected Papers ofE. S. Pearson. Berkeley: University of California Press, pp. 217-232. Salmon, W. (1991), "The Appraisal of Theories: Kuhn Meets Bayes", in A. Fine, M. Forbes, and L. Wessels (eds.), PSA 1990, Vol. 2. East Lansing, MI: Philosophy of Science Association, pp. 325-332. Suppes, P. (1969), "Models of Data", in his Studies in the Methodology and Foundations of Science. Dordrecht: D. Reidel, pp. 24-35.