Download Data analysis: Frequently Bayesian

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability box wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Data analysis: Frequently Bayesian
The well-established mathematics of probability theory notwithstanding, assessing the validity of a
scientific hypothesis remains a thorny proposition.
Glen Cowan
April 2007, page 82
All experiments have uncertain or random aspects, and
quantifying randomness is what probability is all about. The
mathematical axioms of probability provide rules for
manipulating the numbers, and yet pinning down exactly what
a probability means can be difficult. Attempts at clarification
have resulted in two main schools of statistical inference,
frequentist and Bayesian, and years of raging debate.
A Bayesian interprets data
A role for subjectivity?
For frequentists, a probability is something associated with the outcome of an observation that is at
least in principle repeatable, such as the number of nuclei that decay in a certain time. After many
repetitions of a measurement under the same conditions, the fraction of times one sees a certain
outcome—for example, 5 decays in a minute—tends toward a fixed value.
The idea of probability as a limiting frequency is perhaps the most widely used interpretation
encountered in a physics lab, but it is not really what people mean when they say, "What is the
probability that the Higgs boson exists?" Viewed as a limiting frequency, the answer is either 0% or
100%, though one may not know which. Nevertheless, one can answer with a subjective
probability, a numerical measure of an individual's state of knowledge, and in that sense a value of
50% can be a perfectly reasonable response. The term "degree of belief" is used in the field to
describe that subjective measure.
Both the frequentist and subjective interpretations provoke some criticism. How can scientists
repeat an experiment an infinite number of times under identical conditions, and would the
empirical frequency be anything that a mathematician would recognize as a mathematical limit? On
the other hand, it surely seems suspect to inject subjective judgments into an experimental
investigation. Shouldn't scientists analyze their results as objectively as possible and without
prejudice?
Regardless of interpretation, any probability must obey an important theorem published by Thomas
Bayes in 1763. Suppose A and B represent two things to which probabilities are to be assigned.
They may be outcomes of a repeatable observation or perhaps hypotheses to be ascribed a degree of
belief. As long as the probability of B, P(B), is nonzero, the conditional probability of A given B,
P(A|B), may be defined as P(A|B) = P(A and B)/P(B).
Here P(A and B) means the probability that both A and B are true. Consider, for example, rolling a
die. A could mean "the outcome is even" and B "the outcome is less than 3." Then "A and B" is
satisfied only with a roll of two. The imposed condition, B, says that the space of possible outcomes
is to be regarded as some subset of those initially specified. As a special case, B could be the
initially specified set; in that sense all probabilities are conditional.
Now A and B are arbitrary labels; so as long as P(A) is nonzero, one can reverse the labels A and B
in the equation defining P(A|B) to obtain P(B|A) = P(B and A)/P(A). But the stipulation "A and B" is
exactly the same as "B and A," so their probabilities must also be equal. Therefore, one can solve
the respective conditional-probability equations for P(A and B) and P(B and A), set them equal, and
arrive at Bayes's theorem,
The theorem itself is not a subject of debate, and it finds application in both frequentist and
Bayesian methods. The controversy stems from how it is applied and, in particular, whether one
extends probability to include degree of belief.
The frequentist school restricts probabilities to outcomes of repeatable measurements. Its approach
to statistical testing is to reject a hypothesis if the observed data fall in a predefined "critical
region," chosen to encompass data that are uncharacteristic for the hypothesis in question and better
accounted for by an alternative explanation. The discussion of what models are good and bad
revolves around how often, after many repetitions of the measurement, one would reject a true
hypothesis and not reject a false one. For the case of measuring a parameter—say, the mass of the
top quark—a frequentist would choose the value that maximizes the so-called likelihood, or
probability of obtaining data close to what is actually seen.
Hypothesis tests and the method of maximum likelihood are among the most widely used tools in
the analysis of experimental data. But notice that frequentists only talk about probabilities of data,
not the probability of a hypothesis or a parameter. The somewhat contorted phrasing that their
methods necessitate seems to avoid the questions one really wants to ask, namely, "What is the
probability that the parameter is in a given range?" or "What is the probability that my theory is
correct?"
Bayesian learning
The main idea of Bayesian statistics is to use subjective probability to quantify degree of belief in
different models. Bayes's theorem can be written as P(θ|x) ∝ P(x|θ)P(θ), where, instead of A and B,
one has a parameter θ to represent the hypothesis and a data value x to represent the outcome of an
observation.
The quantity P(x|θ) on the right-hand side is the probability to obtain x for a given θ. But given
empirical data, one can plug the values into P(x|θ) and consider it to be a function of θ. In that case,
the function is called the likelihood—the same quantity as mentioned in connection with frequentist
methods. The likelihood multiplies P(θ), called the prior probability, which reflects the degree of
belief before an experimenter makes measurements. The requirement that P(θ|x) be normalized to
unity when integrated over all values of θ determines the constant of proportionality 1/P(x).
The left-hand side of the theorem gives the posterior probability for θ, that is, the probability for θ
deduced after seeing the outcome of the measurement x. Bayes's theorem tells experimenters how to
learn from their measurements; the figure presents a couple of graphical examples. But the learning
requires an input: a prior degree of belief about the hypothesis, P(θ). Bayesian analysis provides no
golden rule for prior probabilities; they might be based on previous measurements, symmetry
arguments, or physical intuition. But once they are given, Bayes's theorem specifies uniquely how
those probabilities should change in light of the data.
In many cases of practical interest—for example, a large data sample and only vague initial
judgments—Bayesian and frequentist methods yield essentially the same numerical results. Still,
the interpretations of those results have subtle but significant differences. In important cases
involving small data samples, the differences are apparent both philosophically and numerically.
What you knew and when you knew it
The difficulties in a Bayesian analysis usually stem from the requirement of the prior probabilities.
Before measuring a parameter θ, say, a particle mass, one might be tempted to profess ignorance
and assign a noninformative prior probability, such as a uniform probability density from 0 to some
large mass.
An important problem is that specification of ignorance for a continuous parameter is not unique.
For example, a model may be parameterized not by θ but instead by λ = ln θ. A constant probability
for one parameter would imply a nonconstant probability for the other. Nevertheless, one often uses
uniform prior probabilities not because they represent real prior judgments but because they provide
a convenient point of reference.
Difficulties with noninformative priors diminish if one can write down probabilities that rationally
reflect prior input. The problem is that judgments of what to incorporate and how to do it can vary
widely among individuals, and one would like experimental results to be relevant to the entire
scientific community, not just to scientists whose prior probabilities coincide with those of the
experimenter. So to be of broader value, a Bayesian analysis needs to show how the posterior
probabilities change under a reasonable variation of assumed priors.
Scientists should not be required to label themselves as frequentists or Bayesians. The two
approaches answer different but related questions, and a presentation of an experimental result
should naturally involve both. Most of the time, one wants to summarize the results of a
measurement without explicit reference to prior probabilities; in those cases the frequentist
approach will be most visible. It often boils down to reporting the likelihood function or an
appropriate summary of it, such as the parameter value for which it is maximized and the standard
deviation of that so-called maximum-likelihood estimator.
But if parts of the problem require assignment of probabilities to nonrepeatable phenomena then
Bayesian tools will be used. In general, experiments involve systematic uncertainties due to various
parameters whose values are not precisely known, but which are assumed not to fluctuate with
repeated measurements. If information is available that constrains those parameters, it can be
incorporated into prior probabilities and used in Bayes's theorem.
For many, it is natural to take the results of an experiment and fold in both the likelihood of
obtaining the specific data and prior judgments about models or hypotheses. Anyone who follows
that approach is thinking like a Bayesian.
Glen Cowan is a senior lecturer in the physics department at Royal Holloway, University of
London.
Additional resources




T. Bayes, Philos. Trans. 53, 370 (1763), reprinted in Biometrika 45, 293 (1958) .
R. D. Cousins, Am. J. Phys. 63, 398 (1995) [INSPEC].
P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences: A Comparative
Approach with Mathematica Support, Cambridge U. Press, New York (2005).
D. S. Sivia with J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd ed., Oxford U. Press,
New York (2006).
Supplemental resources








A. O'Hagan, Kendall's Advanced Theory of Statistics, vol. 2B, Bayesian Inference, Oxford
U. Press, New York (1994).
E. T. Jaynes, Probability Theory: The Logic of Science, Cambridge U. Press, New York
(2003).
S. Press, Subjective and Objective Bayesian Statistics: Principles, Models, and Applications,
2nd ed., Wiley, Hoboken, NJ (2003).
P. Saha, Principles of Data Analysis, Cappella Archive, Great Malvern, UK (2003),
available online at [LINK].
W. Bolstad, Introduction to Bayesian Statistics, Wiley, Hoboken, NJ (2004).
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis, 2nd ed.,
Chapman & Hall/CRC, Boca Raton, FL (2004).
P. M. Lee, Bayesian Statistics: An Introduction, 3rd ed., Wiley, Hoboken, NJ (2004).
Christian P. Robert, The Bayesian Choice, Springer, New York (2005).
An example of Bayesian inference about a
parameter
Supplemental material for the Quick Study "Data analysis: Frequently Bayesian" PHYSICS
TODAY, April 2007, page 82.
A number of important features of Bayesian statistics can be illustrated in a simple example where
one infers the value of a parameter θ on the basis of a single measurement, x. The parameter could
represent some constant of nature that one wants to estimate, such as the mass of an elementary
particle. The quantity x could represent an estimate of θ, but having a value that is subject to
random effects. Often such a measurement can be modeled as a random variable following a
Gaussian distribution with a standard deviation σ, centered about θ. That is, the probability for x
(strictly speaking, the probability density function or pdf)is given by
Here the probability has been written as a conditional probability density for x given θ. The
likelihood is found by evaluating P(x|θ) with the value of x observed and then regarding it as a
function of θ. By using Bayes's theorem, this can be related to the conditional probability for θ
given x, which will encapsulate the knowledge about θ after the measurement has been made.
To infer anything about θ using Bayes's theorem, one also needs the prior probability, which
reflects the knowledge about θ that is available before the observation of x. The prior will depend in
general on the problem at hand, but in many cases this could be a Gaussian distribution centered
about some value θ0, having a standard deviation σ0,
If very little is known about θ before carrying out the measurement, then one would have a very
broad prior with σ0 ≫ σx. On the other hand, the prior knowledge could be based on a previous
measurement of θ that was reported as a value θ0 and standard deviation σ0. In general the prior
distribution can be of whatever shape best summarizes the analyst's knowledge about the parameter,
including constraints—for example, that θ be bounded to lie within a certain range. Bayesian
statistics provides no golden rule for the prior probabilities, but only says how they should be
modified when new observations are considered.
The analyst's knowledge about θ is updated in light of the measurement x by using Bayes's theorem
to find the posterior probability, P(θ|x) ∝ P(x|θ)P(θ). Using the likelihood and prior probabilities
from above gives
Multiplying out the arguments of the exponentials and completing the square, one finds a term that
depends on θ with a Gaussian form, multiplied by a factor that does not depend on θ. The posterior
probability can therefore be written as
where θp is
and the standard deviation σ is related to σx and σ0 by
For the posterior probability in equation 4, the constant of proportionality was determined by the
requirement that P(θ|x) be normalized to unity when integrated over all values of θ. Since the part
of the expression that depends on θ has the form of a Gaussian, the known normalization factor for
this distribution, 1/√(2πσ), gives the desired result.
So in this special case of a Gaussian distributed measurement and a Gaussian prior, the posterior
probability is found to be a Gaussian as well. In general, life is not so simple, but one can
nevertheless see some important features of a Bayesian analysis from this
example.
The figure shows the likelihood function P(x|θ), prior P(θ), and posterior
probability P(θ|x); all of the curves have a Gaussian form. A Bayesian
would usually summarize the type of analysis given above by reporting θp
and σ—the mean and standard deviation of the posterior distribution. In
other cases where the posterior is more complicated, one could either
report the entire function or summarize it with an appropriate set of
parameters, such as its mean, median or mode, its standard deviation, and
perhaps other constants that characterize its skewness or tails.
Figure 1a
In panel a of the figure, the prior distribution is relatively broad compared to the likelihood. One
can see that in this case, the likelihood and posterior are very close. That accord is characteristic of
cases where the prior contributes relatively little information compared to what is given by the
measurement. In the limiting case of a constant prior, the posterior has exactly the same form as the
likelihood. Then the mode (peak position) of the posterior distribution corresponds to the parameter
that maximizes the likelihood function; it is the maximum likelihood estimator used in frequentist
statistics.
In panel b, however, one sees that the prior information is comparable to
that brought by the measurement, since the prior distribution has a width
comparable to that of the likelihood. Bayes's theorem gives the appropriate
compromise between the two, mixing together the information from
different sources according to the rules of probability.
The likelihood (red), prior (green), and posterior (blue) probabilities for
two distinct cases. (a) The prior is very broad. (b) The prior and likelihood
have comparable widths.
Figure 1b
A worked-out numerical example
Supplemental material for the Quick Study "Data analysis: Frequently Bayesian" PHYSICS
TODAY, April 2007, page 82.
The Quick Study "Data analysis: Frequently Bayesian" presented Bayes's theorem in the form
P(θ|x) proportional to P(x|θ)P(θ), where the parameter θ represents a hypothesis and the data value
x represents the outcome of an observation. The various terms in the proportionality have an
associated vocabulary. The quantity P(x|θ) on the right-hand side is the probability to obtain x for a
given θ. But given empirical data, one can plug the values into P(x|θ) and consider it to be a
function of θ. In that case, the function is called the likelihood.
The likelihood multiplies P(θ), called the prior probability, which reflects an experimenter's
knowledge of θ before making a measurement. The left-hand side of the theorem gives the posterior
probability for θ, that is, the probability for θ deduced after seeing the outcome of the measurement
x. The requirement that P(θ|x) be normalized to unity when integrated over all values of θ
determines the constant of proportionality 1/P(x).
The following example, though admittedly artificial, may serve to help make concrete the abstract
probabilities just defined. Consider a parameter—in the spirit of concreteness think of an atomic
mass—that has an unknown integer value in the range 1–20. An experimenter entertains two
hypotheses; in other words, the parameter θ assumes one of two values. One of them, call it H1,
posits that the mass is in the range 1–5. The complementary hypothesis H2 says the mass is in the
range 6–20. Suppose that the data value x may likewise return one of two possibilities: Either the
mass is prime (including 1) or it is composite.
Once the measurement has been made, a Bayesian will calculate the likelihood function and
multiply it by a prior probability to obtain a refined posterior probability. In the absence of other
knowledge, the experimenter may well set the prior P(H1) = 1/4, because H1 represents one-quarter
of the possible mass values, and P(H2) = 3/4.
Now it's time to make a measurement and the result is . . . prime. The likelihood values are
P(prime|H1) = 4/5 and P(prime|H2) = 1/3. The normalization constant needed in Bayes's theorem is
1/P(prime) = 20/9, which reflects that 9 of the first 20 integers are prime. The Bayesian combines
the likelihoods, prior probabilities, and normalization to obtain the posterior probabilities
P(H1|prime) = 4/5 × 1/4 × 20/9 = 4/9 and P(H2|prime) = 1/3 × 3/4 × 20/9 = 5/9.
An important message that should not get lost in the numbers is the significant consequence of the
prior probability. Given the data value of "prime," H1 returned a significantly higher likelihood than
H2. The Bayesian prior was biased in favor of H2, and, even given the likelihood function, the
Bayesian's posterior probability slightly favored H2 over H1.
Steven K. Blau