Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data analysis: Frequently Bayesian The well-established mathematics of probability theory notwithstanding, assessing the validity of a scientific hypothesis remains a thorny proposition. Glen Cowan April 2007, page 82 All experiments have uncertain or random aspects, and quantifying randomness is what probability is all about. The mathematical axioms of probability provide rules for manipulating the numbers, and yet pinning down exactly what a probability means can be difficult. Attempts at clarification have resulted in two main schools of statistical inference, frequentist and Bayesian, and years of raging debate. A Bayesian interprets data A role for subjectivity? For frequentists, a probability is something associated with the outcome of an observation that is at least in principle repeatable, such as the number of nuclei that decay in a certain time. After many repetitions of a measurement under the same conditions, the fraction of times one sees a certain outcome—for example, 5 decays in a minute—tends toward a fixed value. The idea of probability as a limiting frequency is perhaps the most widely used interpretation encountered in a physics lab, but it is not really what people mean when they say, "What is the probability that the Higgs boson exists?" Viewed as a limiting frequency, the answer is either 0% or 100%, though one may not know which. Nevertheless, one can answer with a subjective probability, a numerical measure of an individual's state of knowledge, and in that sense a value of 50% can be a perfectly reasonable response. The term "degree of belief" is used in the field to describe that subjective measure. Both the frequentist and subjective interpretations provoke some criticism. How can scientists repeat an experiment an infinite number of times under identical conditions, and would the empirical frequency be anything that a mathematician would recognize as a mathematical limit? On the other hand, it surely seems suspect to inject subjective judgments into an experimental investigation. Shouldn't scientists analyze their results as objectively as possible and without prejudice? Regardless of interpretation, any probability must obey an important theorem published by Thomas Bayes in 1763. Suppose A and B represent two things to which probabilities are to be assigned. They may be outcomes of a repeatable observation or perhaps hypotheses to be ascribed a degree of belief. As long as the probability of B, P(B), is nonzero, the conditional probability of A given B, P(A|B), may be defined as P(A|B) = P(A and B)/P(B). Here P(A and B) means the probability that both A and B are true. Consider, for example, rolling a die. A could mean "the outcome is even" and B "the outcome is less than 3." Then "A and B" is satisfied only with a roll of two. The imposed condition, B, says that the space of possible outcomes is to be regarded as some subset of those initially specified. As a special case, B could be the initially specified set; in that sense all probabilities are conditional. Now A and B are arbitrary labels; so as long as P(A) is nonzero, one can reverse the labels A and B in the equation defining P(A|B) to obtain P(B|A) = P(B and A)/P(A). But the stipulation "A and B" is exactly the same as "B and A," so their probabilities must also be equal. Therefore, one can solve the respective conditional-probability equations for P(A and B) and P(B and A), set them equal, and arrive at Bayes's theorem, The theorem itself is not a subject of debate, and it finds application in both frequentist and Bayesian methods. The controversy stems from how it is applied and, in particular, whether one extends probability to include degree of belief. The frequentist school restricts probabilities to outcomes of repeatable measurements. Its approach to statistical testing is to reject a hypothesis if the observed data fall in a predefined "critical region," chosen to encompass data that are uncharacteristic for the hypothesis in question and better accounted for by an alternative explanation. The discussion of what models are good and bad revolves around how often, after many repetitions of the measurement, one would reject a true hypothesis and not reject a false one. For the case of measuring a parameter—say, the mass of the top quark—a frequentist would choose the value that maximizes the so-called likelihood, or probability of obtaining data close to what is actually seen. Hypothesis tests and the method of maximum likelihood are among the most widely used tools in the analysis of experimental data. But notice that frequentists only talk about probabilities of data, not the probability of a hypothesis or a parameter. The somewhat contorted phrasing that their methods necessitate seems to avoid the questions one really wants to ask, namely, "What is the probability that the parameter is in a given range?" or "What is the probability that my theory is correct?" Bayesian learning The main idea of Bayesian statistics is to use subjective probability to quantify degree of belief in different models. Bayes's theorem can be written as P(θ|x) ∝ P(x|θ)P(θ), where, instead of A and B, one has a parameter θ to represent the hypothesis and a data value x to represent the outcome of an observation. The quantity P(x|θ) on the right-hand side is the probability to obtain x for a given θ. But given empirical data, one can plug the values into P(x|θ) and consider it to be a function of θ. In that case, the function is called the likelihood—the same quantity as mentioned in connection with frequentist methods. The likelihood multiplies P(θ), called the prior probability, which reflects the degree of belief before an experimenter makes measurements. The requirement that P(θ|x) be normalized to unity when integrated over all values of θ determines the constant of proportionality 1/P(x). The left-hand side of the theorem gives the posterior probability for θ, that is, the probability for θ deduced after seeing the outcome of the measurement x. Bayes's theorem tells experimenters how to learn from their measurements; the figure presents a couple of graphical examples. But the learning requires an input: a prior degree of belief about the hypothesis, P(θ). Bayesian analysis provides no golden rule for prior probabilities; they might be based on previous measurements, symmetry arguments, or physical intuition. But once they are given, Bayes's theorem specifies uniquely how those probabilities should change in light of the data. In many cases of practical interest—for example, a large data sample and only vague initial judgments—Bayesian and frequentist methods yield essentially the same numerical results. Still, the interpretations of those results have subtle but significant differences. In important cases involving small data samples, the differences are apparent both philosophically and numerically. What you knew and when you knew it The difficulties in a Bayesian analysis usually stem from the requirement of the prior probabilities. Before measuring a parameter θ, say, a particle mass, one might be tempted to profess ignorance and assign a noninformative prior probability, such as a uniform probability density from 0 to some large mass. An important problem is that specification of ignorance for a continuous parameter is not unique. For example, a model may be parameterized not by θ but instead by λ = ln θ. A constant probability for one parameter would imply a nonconstant probability for the other. Nevertheless, one often uses uniform prior probabilities not because they represent real prior judgments but because they provide a convenient point of reference. Difficulties with noninformative priors diminish if one can write down probabilities that rationally reflect prior input. The problem is that judgments of what to incorporate and how to do it can vary widely among individuals, and one would like experimental results to be relevant to the entire scientific community, not just to scientists whose prior probabilities coincide with those of the experimenter. So to be of broader value, a Bayesian analysis needs to show how the posterior probabilities change under a reasonable variation of assumed priors. Scientists should not be required to label themselves as frequentists or Bayesians. The two approaches answer different but related questions, and a presentation of an experimental result should naturally involve both. Most of the time, one wants to summarize the results of a measurement without explicit reference to prior probabilities; in those cases the frequentist approach will be most visible. It often boils down to reporting the likelihood function or an appropriate summary of it, such as the parameter value for which it is maximized and the standard deviation of that so-called maximum-likelihood estimator. But if parts of the problem require assignment of probabilities to nonrepeatable phenomena then Bayesian tools will be used. In general, experiments involve systematic uncertainties due to various parameters whose values are not precisely known, but which are assumed not to fluctuate with repeated measurements. If information is available that constrains those parameters, it can be incorporated into prior probabilities and used in Bayes's theorem. For many, it is natural to take the results of an experiment and fold in both the likelihood of obtaining the specific data and prior judgments about models or hypotheses. Anyone who follows that approach is thinking like a Bayesian. Glen Cowan is a senior lecturer in the physics department at Royal Holloway, University of London. Additional resources T. Bayes, Philos. Trans. 53, 370 (1763), reprinted in Biometrika 45, 293 (1958) . R. D. Cousins, Am. J. Phys. 63, 398 (1995) [INSPEC]. P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support, Cambridge U. Press, New York (2005). D. S. Sivia with J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd ed., Oxford U. Press, New York (2006). Supplemental resources A. O'Hagan, Kendall's Advanced Theory of Statistics, vol. 2B, Bayesian Inference, Oxford U. Press, New York (1994). E. T. Jaynes, Probability Theory: The Logic of Science, Cambridge U. Press, New York (2003). S. Press, Subjective and Objective Bayesian Statistics: Principles, Models, and Applications, 2nd ed., Wiley, Hoboken, NJ (2003). P. Saha, Principles of Data Analysis, Cappella Archive, Great Malvern, UK (2003), available online at [LINK]. W. Bolstad, Introduction to Bayesian Statistics, Wiley, Hoboken, NJ (2004). A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis, 2nd ed., Chapman & Hall/CRC, Boca Raton, FL (2004). P. M. Lee, Bayesian Statistics: An Introduction, 3rd ed., Wiley, Hoboken, NJ (2004). Christian P. Robert, The Bayesian Choice, Springer, New York (2005). An example of Bayesian inference about a parameter Supplemental material for the Quick Study "Data analysis: Frequently Bayesian" PHYSICS TODAY, April 2007, page 82. A number of important features of Bayesian statistics can be illustrated in a simple example where one infers the value of a parameter θ on the basis of a single measurement, x. The parameter could represent some constant of nature that one wants to estimate, such as the mass of an elementary particle. The quantity x could represent an estimate of θ, but having a value that is subject to random effects. Often such a measurement can be modeled as a random variable following a Gaussian distribution with a standard deviation σ, centered about θ. That is, the probability for x (strictly speaking, the probability density function or pdf)is given by Here the probability has been written as a conditional probability density for x given θ. The likelihood is found by evaluating P(x|θ) with the value of x observed and then regarding it as a function of θ. By using Bayes's theorem, this can be related to the conditional probability for θ given x, which will encapsulate the knowledge about θ after the measurement has been made. To infer anything about θ using Bayes's theorem, one also needs the prior probability, which reflects the knowledge about θ that is available before the observation of x. The prior will depend in general on the problem at hand, but in many cases this could be a Gaussian distribution centered about some value θ0, having a standard deviation σ0, If very little is known about θ before carrying out the measurement, then one would have a very broad prior with σ0 ≫ σx. On the other hand, the prior knowledge could be based on a previous measurement of θ that was reported as a value θ0 and standard deviation σ0. In general the prior distribution can be of whatever shape best summarizes the analyst's knowledge about the parameter, including constraints—for example, that θ be bounded to lie within a certain range. Bayesian statistics provides no golden rule for the prior probabilities, but only says how they should be modified when new observations are considered. The analyst's knowledge about θ is updated in light of the measurement x by using Bayes's theorem to find the posterior probability, P(θ|x) ∝ P(x|θ)P(θ). Using the likelihood and prior probabilities from above gives Multiplying out the arguments of the exponentials and completing the square, one finds a term that depends on θ with a Gaussian form, multiplied by a factor that does not depend on θ. The posterior probability can therefore be written as where θp is and the standard deviation σ is related to σx and σ0 by For the posterior probability in equation 4, the constant of proportionality was determined by the requirement that P(θ|x) be normalized to unity when integrated over all values of θ. Since the part of the expression that depends on θ has the form of a Gaussian, the known normalization factor for this distribution, 1/√(2πσ), gives the desired result. So in this special case of a Gaussian distributed measurement and a Gaussian prior, the posterior probability is found to be a Gaussian as well. In general, life is not so simple, but one can nevertheless see some important features of a Bayesian analysis from this example. The figure shows the likelihood function P(x|θ), prior P(θ), and posterior probability P(θ|x); all of the curves have a Gaussian form. A Bayesian would usually summarize the type of analysis given above by reporting θp and σ—the mean and standard deviation of the posterior distribution. In other cases where the posterior is more complicated, one could either report the entire function or summarize it with an appropriate set of parameters, such as its mean, median or mode, its standard deviation, and perhaps other constants that characterize its skewness or tails. Figure 1a In panel a of the figure, the prior distribution is relatively broad compared to the likelihood. One can see that in this case, the likelihood and posterior are very close. That accord is characteristic of cases where the prior contributes relatively little information compared to what is given by the measurement. In the limiting case of a constant prior, the posterior has exactly the same form as the likelihood. Then the mode (peak position) of the posterior distribution corresponds to the parameter that maximizes the likelihood function; it is the maximum likelihood estimator used in frequentist statistics. In panel b, however, one sees that the prior information is comparable to that brought by the measurement, since the prior distribution has a width comparable to that of the likelihood. Bayes's theorem gives the appropriate compromise between the two, mixing together the information from different sources according to the rules of probability. The likelihood (red), prior (green), and posterior (blue) probabilities for two distinct cases. (a) The prior is very broad. (b) The prior and likelihood have comparable widths. Figure 1b A worked-out numerical example Supplemental material for the Quick Study "Data analysis: Frequently Bayesian" PHYSICS TODAY, April 2007, page 82. The Quick Study "Data analysis: Frequently Bayesian" presented Bayes's theorem in the form P(θ|x) proportional to P(x|θ)P(θ), where the parameter θ represents a hypothesis and the data value x represents the outcome of an observation. The various terms in the proportionality have an associated vocabulary. The quantity P(x|θ) on the right-hand side is the probability to obtain x for a given θ. But given empirical data, one can plug the values into P(x|θ) and consider it to be a function of θ. In that case, the function is called the likelihood. The likelihood multiplies P(θ), called the prior probability, which reflects an experimenter's knowledge of θ before making a measurement. The left-hand side of the theorem gives the posterior probability for θ, that is, the probability for θ deduced after seeing the outcome of the measurement x. The requirement that P(θ|x) be normalized to unity when integrated over all values of θ determines the constant of proportionality 1/P(x). The following example, though admittedly artificial, may serve to help make concrete the abstract probabilities just defined. Consider a parameter—in the spirit of concreteness think of an atomic mass—that has an unknown integer value in the range 1–20. An experimenter entertains two hypotheses; in other words, the parameter θ assumes one of two values. One of them, call it H1, posits that the mass is in the range 1–5. The complementary hypothesis H2 says the mass is in the range 6–20. Suppose that the data value x may likewise return one of two possibilities: Either the mass is prime (including 1) or it is composite. Once the measurement has been made, a Bayesian will calculate the likelihood function and multiply it by a prior probability to obtain a refined posterior probability. In the absence of other knowledge, the experimenter may well set the prior P(H1) = 1/4, because H1 represents one-quarter of the possible mass values, and P(H2) = 3/4. Now it's time to make a measurement and the result is . . . prime. The likelihood values are P(prime|H1) = 4/5 and P(prime|H2) = 1/3. The normalization constant needed in Bayes's theorem is 1/P(prime) = 20/9, which reflects that 9 of the first 20 integers are prime. The Bayesian combines the likelihoods, prior probabilities, and normalization to obtain the posterior probabilities P(H1|prime) = 4/5 × 1/4 × 20/9 = 4/9 and P(H2|prime) = 1/3 × 3/4 × 20/9 = 5/9. An important message that should not get lost in the numbers is the significant consequence of the prior probability. Given the data value of "prime," H1 returned a significantly higher likelihood than H2. The Bayesian prior was biased in favor of H2, and, even given the likelihood function, the Bayesian's posterior probability slightly favored H2 over H1. Steven K. Blau