Download Bayes Theorem/Rule, A First Intro Until the mid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of randomness wikipedia , lookup

Indeterminism wikipedia , lookup

Randomness wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Probability box wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Bayes Theorem/Rule, A First Intro
Until the mid-1700’s, the theory of probabilities (as distinct from theories of valuation like
expected utility theory) was focussed almost entirely on estimating the likelihood of uncertain
future events; lotteries, coin flips, or life expectancies. This class of probability estimate is often
called aleatory probability from the latin aleator, meaning gambler. In law, aleatory contracts are
those in which the signatories to the contract both risk loss or gain in the face of future uncertainty. A life insurance policy is an example of an aleatory contract.
Aleatory uncertainties are exactly the kind of probabilistic events that Pascal had envisioned as
the subject of a calculus of probabilities. Regardless of whether or not the world is as truly deterministic as Descartes and Galileo hoped, we often do not know what will happen the future. We
do not know when a particular individual will die or whether a particular coin will land heads or
tails up if it is flipped. Pascal’s probability theory was designed to model events of this type. In the
second half of the eighteenth century two men revolutionized the calculus of probability when
they realized that one could apply this probability theory not just to assess the likelihood of future
events, but also to assess the likelihood of past events. While this may seem a small thing, it
changed the way europeans thought about the mathematics of probability and opened the way to a
more formal theory of decision making.
Consider an uncertain situation which was of tremendous interest to both The English reverend
Thomas Bayes and to the French mathematician Pierre-Simon LaPlace. An astronomer measures
the angular altitude of Jupiter six times in rapid succession and gets six slightly different numbers
each time. Jupiter has a single altitude, but we have six imperfect observations of that altitude, all
of which differ. What, we might ask, was the most likely actual altitude of Jupiter at the time that
we made our observations? It was Thomas Bayes’ insight, published posthumously in 1763, that
probability theory could be extended to answer questions of this type as well. Bayes reasoned that
if one knew the distribution of errors induced by the astronomer’s instruments, then one could
mathematically infer the most likely true altitude of Jupiter when the observations were made. It
is important to note that there is nothing aleatory about this kind of probability. At the time the
measurement was made Jupiter certainly had an altitude. The only uncertainty derives from our
own lack of knowledge. The limitation that we face in this example is entirely epistemological.
Bayes was suggesting that probability theory could be used to describe epistemological uncertainty as well as to aleatory uncertainty.
Unfortunately little is known about the historical Thomas Bayes. We do know that he was a rural
protestant theologian and minister who was not a member of the Church of England, a Dissenter.
He published only two works during his life: A theological work entitled Divine Benevolence, or
an Attempt to Prove That the Principal End of the Divine Providence and Government Is the Happiness of His Creatures and a more mathematical work; An Introduction to the Doctrine of Fluxions, and a Defence of the Mathematicians Against the Objections of the Author of The Analyst in
which he defended Newton’s Calculus against an attack by the philosopher Bishop George Berkeley. After his death, Bayes’ friend and executor Richard Price discovered amongst his papers a
third manuscript entitled: Essay Towards Solving a Problem in the Doctrine of Chances. Price
presented that paper at the Royal Society in London in 1763 and it is entirely upon that work
which Bayes’ quite considerable fame rests.
Today Bayes is such a towering name in mathematics, it seems astonishing that we know so little
about him. We do not, for example, know why he was elected a fellow of the Royal Society before
his death. In fact, the only picture of Bayes that we have may not even be a portrait of him. The
historical Bayes is almost a total mystery. To his contemporaries that may not have been terribly
surprising; the posthumous publication of Bayes’ essay in The Proceedings had almost no impact
until LaPlace rediscovered it about 10 years later.
Bayes’ insight was profound. He realized that there are many events about which we have only
partial or inaccurate knowledge. Events which truly happened but about which we are, because of
our limited knowledge, are uncertain. It was Bayes who first realized that a mathematically complete kind of inverse probability could be used to infer the most likely values or properties of
those events1.
The Bayesian theorem provides the basis for a fundamentally statistical approach to this kind of
epistemological uncertainty. It does this by putting, on rigorous mathematical footing, the process
of predicting the likelihood of all possible previous states of the world given one’s available
observations. Put in English, Bayes’ theorem allows us to ask the following question: given my
knowledge of how often I have observed that the world appeared to be in state x, and my knowledge of how well correlated my current sensory data is with the actual world state x, then precisely how likely is it that the world was actually in state x.
Bayes’ theorem is so important that I want to digress here to present a fairly complete example of
how the mathematics of the theorem work. Imagine that you are a monkey trained to fixate a spot
of light while two eccentric spots of light are also illuminated just as in the example presented in
chapter five. In this experiment, however, the central fixation light changes color to indicate which
of the two eccentric target lights, the left one or the right one, will serve as your goal on this trial.
If you can decide which target is the goal, and look at it, you receive a raisin as a reward. However, the color of the central fixation light (or more precisely the wavelength of the light emitted
by the central stimulus) can be any one of a hundred different hues (or wavelengths). We can
begin our Bayesian description of this task by saying that there are two possible world states. One
state in which a leftward eye movement will be rewarded and one state in which a rightward eye
movement will be rewarded.
Figure 8.1: Bayesian Graphs of the Example - EPS doc - PG
In mathematical notation we designate these two world states as w1 and w2. State w1 is when a
leftward eye movement, or saccade, will be rewarded and state w2 is when a rightward saccade
will be rewarded. After observing 100 trials we discover that on 25% of trials a leftward movement was rewarded, irrespective of the color of the fixation light and on 75% of trials the rightward movement was rewarded. Based upon this observation we can say that the prior probability
that world state w1 will occur (known as P(w1)) is 0.25, and the prior probability of world state w2
is 0.75.
1. As Stephen Stigler has pointed out, Thomas Stimpson was really the first mathematician to propose the idea of inverse probabilities, but it was Bayes who developed the mathematical approach on which modern inverse probabilities are based (Stigler,
1989).
To make these prior probabilities more accurate estimates of the state of the world we next have
to take into account the color of the central fixation stimulus and the correlation of that stimulus
color with each of the world states. To do that we need to generate a graph which plots the probability that we will encounter a particular stimulus wavelength (which we will call λ) when the
world is in state w1. Figure 8.5a plots an example of such a probability density function1 showing
the likelihood of each value of λ when the world is in state w1, and when in state w2. We refer to
this as the conditional probability density function for λ in world state w1, or P(λ|w1).
Next, in order to get the two graphs in Figure 8.5a to tell us how likely it is that we see a given λ
and the world is in a given state, we have to correct these graphs for the overall likelihood that the
world is in either state w1 or state w2. To do that we multiply each point on the graphs by the prior
probability of that world state. The graph on the left thus becomes: P(λ|w1)P(w1), where P(w1) is
the prior probability for world state w1 as described above. Note in Figure 8.5b that this has the
effect of re-scaling the graphs that appeared in Figure 8.5a.
Finally, we have to determine how likely it is that any given value of λ will occur regardless of
world state. To do this we need simply to count up all the times that we have seen λ at a specific
value and then plot the probability density function for all values of λ (irrespective of which
movement was rewarded) as shown in Figure 8.5c.
Now we are ready to ask, when we see a given wavelength of light, what is the likelihood that on
this trial a rightward movement will be rewarded (that we are in world state w1) and what is the
likelihood that a leftward movement will be rewarded (world state w2). To compute these likelihoods we divide the curves shown in Figure 8.5b by the curve shown in Figure 8.5c. This essentially corrects the likelihood that one would see a particular λ in a particular world state for the
overall likelihood that one would ever have seen that wavelength λ. This is the essence of the
Bayesian theorem given by the equation:
Probability of w1 given the current value of λ is P(λ|w1)P(w1)/P(λ)
To restate this in english one could say: the best possible estimate of the probability that a rightward movement will be rewarded is equal to the probability that the central stimulus would be this
color on a rightward trial -times- the overall probability of a rightward trial -divided by- the probability that this particular color would ever be observed. The result is usually referred to as a posterior probability and it reports, in principle, the best estimate that you can derive for this
likelihood. Therein lies the absolute beauty of Bayes’ theorem. Bayes’ theorem provides a
mechanical tool which can report the best possible estimate of the likelihood of an event. No other
method, no matter how sophisticated, can provide a more accurate estimate of the likelihood of an
uncertain event. The Bayesian theorem is a critical advance because no decision process which
must estimate the likelihood of an uncertain outcome can ever do better than a Bayesian estimate
1. I should point out that P(λ) in this specific example is actually a probability function, not a probability density function,
because wavelength is treated as a discrete variable. In practice this makes little difference to this exposition but it is, in fairness, an abuse of notation which more mathematical readers may find annoying.
of that probability. The Bayesian theorem is a tool for reducing epistemologic uncertainty to a
minimal level and then for assigning probabilities to world states.
Bayes' Theorem
A: Prior Probability of Seeing a Given Wavelength
If World is in State W1:
If World is in State W2:
Probability Density
p(λ|w2)
Probability Density
p(λ|w1)
Wavelength
Wavelength
Total Probabilitiy of W1 = 0.25
IF:
Total Probabilitiy of W2 = 0.75
B: Prior Probability of Seeing a Given Wavelength
And Seeing it in a Given World State:
p(λ|w2)P(w2)
Probability
Probability
p(λ|w1)P(w1)
Wavelength
Wavelength
C: Prior Probability of Seeing a Given Wavelength
Regardless of World State:
Probability
P(λ)
Wavelength
THEN:
D: The Likelyhood that You Are In a Particular World State
As a Function of Wavelength is:
Probability
p(λ|w2)P(w2)/P(λ)
p(λ|w1)P(w1)/P(λ)
Wavelength
Figure 8.5
Caption: Bayes' Theorem
Source: PWG