COMP 2208
Bayes’ Theorem,
Bayesian Reasoning,
and Bayesian Networks
Dr. Long Tran-Thanh
[email protected]
University of Southampton
Logic /Rule based
• build up basic rules (axioms) using some form of logic
• Other rules (reasoning) can be derived from the above
• Functional or declarative programming (LISP, ML, Prolog, etc…)
Stochastic reasoning
• Frequentist (non-Bayesian)
• Bayesian
Some bridging efforts:
• E.g., Markov logic (see, e.g., Pedro Domingos)
The right way to do reasoning?
Debate 1: logic based vs. stochastic
• E.g., Noam Chomsky vs. peter Norvig
Debate 2: frequentist vs. Bayesian
• Many vs. many
Today we talk about Bayesian (because it’s simple to understand and
The Bayesian way
• Bayes’ Theorem
• Bayesian belief update
• Inference in Bayesian networks
Some probability theory
• Space of all possible world models = area equal to 1
Some probability theory
• Probability of event A = fraction of worlds in which A happens
• What does it mean that P(A) = 0.2?
• What does it mean that P(A) = 0? or = 1?
Some probability theory
• Probability of A not happening = complement of P(A)
Some probability theory
Basic axioms in probability theory
Domain of the probability value:
Connection of AND and OR:
Conditional probability
Only consider worlds in which A happens -> new space of worlds
Consider worlds in which B happens, but only within the new space
P(B|A) = fraction of worlds with B within the new space
Conditional probability
Conditional probability
We have:
Chain rule:
Law of total probability:
Bayes’ Theorem (Bayes’ rule)
Use chain rule twice for P(A and B):
The right hand sides must be the same!
Bayes’ rule:
The beauty of Bayes’ Theorem
A = evidence (observation); B = hypothesis
Prior: captures our prior knowledge/belief
Likelihood: how likely to observe the evidence, if the hypothesis was true
P(evidence) = probability of observing the evidence in general (aggregated
over all possible hypotheses)
We update our belief after observing some evidences
Example: the Monty Hall problem
The game:
At the beginning, all doors are closed
The prize is behind 1 door (with equal probability)
You choose 1 door (say Door 1)
The host opens a door (say Door 2) which has a goat behind it
Let’s make a deal: would you swap your choice to Door 3?
Solution of the Monty Hall problem
Your choice: Door 1; Offer: choose Door 3 instead
What are the chances of each option for winning (getting the prize)?
A = hypothesis: prize behind Door 1
B = host chooses Door 2 to open (and we see a goat)
= 1/3
= 1/2
Bayes’ rule:
= ½ (why?)
Chance of winning the prize for staying with Door 1 = 1/3
Chance of winning the prize for switching to Door 3 = 2/3
Calculating the denominator
Bayes’ rule:
A = prize behind Door 1; B = host chooses Door 2 (between Doors 2 and 3)
Use law of total probability: X = door with prize
X= Door 1:
X= Door 2:
X= Door 3:
Example 2: the HIV test problem
HIV lab tests are quite accurate:
• 99% sensitivity: if a patient is HIV+, then probability that the test has
positive results is 0.99
• 99% specificity: if a patient is HIV-, then probability that the test has
negative results is also 0.99
HIV is rare in patients in our population: about 1 out of 1000 (even among
those who get tested)
Situation: A patient does a HIV test and gets a positive result
Question: what are the chances that the patient is indeed HIV+?
Solution for the HIV test problem
A = test was positive; B = patient is HIV+
We want to calculate P(B|A)
Prior: P(B) = 0.001 (1 out of 1000 is HIV+)
Likelihood: P(A|B) = 0.99 (99% sensitivity)
What about P(A) ?
Solution for the HIV test problem
Calculation of P(A)
Use law of total probability
Term 1:
Term 2:
Solution for HIV test
Calculating P(B|A)
P(B) = 0.001; P(A|B) = 0.99; P(A) = 0.01089
This means that even if the test is positive, it’s only 9% that the patient is
Some discussions
Only 15% of doctors gets this right
Most doctors think that if a HIV test is positive, there’s a high chance that the
patient is HIV+
• They typically focus on the accuracy (sensitivity) of the test
• They're neglecting the background or base rate of HIV prevalence (prior)
Bonus question
Russian roulette with 2 bullets
• You put 2 bullets into a revolver, such
that they are next to each other
• Your opponent spins and pull the trigger
• … and survives. Now it’s your turn!
• Question: should you spin the revolver as well, or you shouldn’t spin it?
• Question 2: what if there’s only 1 bullet in the revolver?
Belief update
London – Rome flight
Belief update
You don’t know over which area you are flying …
But you know that you’ve been flying for a while
You look out the window, and you see…
…and sea
…and high mountains
Near London
Near Rome
Near Paris
Near Monaco
Bayesian belief update
Belief: probability distribution over all the possible models
• Captures our knowledge + uncertainty about the true world model
How to update our belief after each observation?
Prior: probability that the model is true (before the observation)
Likelihood: how likely to have the observed event, if the model was true
Denominator: marginal likelihood (or model evidence)
Left hand side = posterior: probability that the model is true, after we
have seen the observations
Bayesian belief update
Prior probability
Prior distribution = prior belief
Posterior probability
Posterior distribution = posterior belief
Bayesian belief update example
Search for a crashed airplane using Bayesian updating
• Imagine you're designing a search-and- rescue UAV. Its job is
to autonomously look for aircraft wreckage
• It is easier to detect wreckage in some terrain types than
Bayesian belief update example
• Difficulty model: what’s the probability to find the wreckage in the area
Bayesian belief update example
• Prior belief: based on the last known location
Bayesian belief update example
• Search: always go to the point with highest probability
• At the beginning:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 10 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 50 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 250 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 500 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 1000 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 2000 steps:
Complex knowledge representation
• We use probabilities to capture uncertainty in our knowledge
• So far we deal with simple correlations between probabilities
• What if we have much more complicated network of correlations?
Inference: derive extra information/conclusions from observed data
• Inference in simple systems: use Bayes’ rule
• How to do inference in complex networks? How to use Bayes’ rule there?
Inference with joint distribution
But before Bayesian nets …
• The simplest way to do inference is to look at the joint distribution of the
probability variables
• Joint distribution captures all the interconnections + dependencies
An example from employment survey
• M: the person is a male?
• L: does the person work long hours?
• R: is the person rich?
Inference with joint distribution
Truth table: if all the variables are binaries
Inference with joint distribution
P(the person is rich) = ? 0.13+0.10+0.01+0.02 = 0.26
Inference with joint distribution
P(L | M) = ? P (L|M) = P(L and M)/P(M) = (0.13+0.11)/(0.13+0.11+0.10+0.34)
= 0.35
Inference with joint distribution
• We can do any inference from joint distribution
• Issues: doesn’t scale well in practice (brute force solution)
• E.g., with 30 variables, we need 2^30 probabilities … (1 billion)
• In theory: doesn’t show the relationships
• We might want to exploit the structure of relationships to simplify the
• E.g., if R (rich) is independent from M (male) and L (working long
hours), then we can drop M and L when we do inference about R
Definition: Two random variables are independent if their joint probability
is the product of their probabilities
Another property: If P(A), P(B) > 0
Bayesian networks
Use graphical representation to capture the dependencies between the
random variables
P(S) = 0.8
P(cS) = 0.2
Studied for
the exam
Lecturer is in
good mood
High exam
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M) = 0.3
P(cM) = 0.7
Bayesian networks
Inference: given high exam result, what is the probability that the lecturer
was in a good mood? (P(M|H) = ?)
P(S) = 0.8
P(cS) = 0.2
Studied for
the exam
Lecturer is in
good mood
High exam
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M) = 0.3
P(cM) = 0.7
Bayesian networks
P(S) = 0.8
P(cS) = 0.2
P(M) = 0.3
P(cM) = 0.7
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M| H) = ?
P(M) = 0.3
P(H|M) = P(H|M, S)P(S) + P(H|M,cS)P(cS)
= 0.9*0.8 + 0.5*0.2 = 0.73
P(H) = ?
Bayesian networks
P(H) = P(H|M,S)P(M and S)
M and S are
= P(H|M,S)P(M)P(S)
+ P(H|M,cS)P(M and cS)
+ P(H|M,cS)P(M)P(cS)
+ P(H|cM,S)P(cM and S)
+ P(H|cM,S)P(cM)P(S)
+ P(H|cM,cS)P(cM and cS)
+ P(H|cM,cS)P(cM)P(cS)
P(H) = 0.9*0.3*0.8
+ 0.5*0.3*0.2
+ 0.4*0.7*0.8
+ 0.05*0.7*0.2 = 0.216 + 0.03 + 0.224 + 0.007 = 0.477
Bayesian networks
P(S) = 0.8
P(cS) = 0.2
P(M) = 0.3
P(cM) = 0.7
P(M| H) = ?
P(M) = 0.3
P(H|M) = 0.73
P(H) = 0.477
P(M| H) = 0.3*0.73/0.477 = 0.459
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
Building Bayesian networks
With domain expert:
• Bayesian nets are sometimes built manually, consulting domain experts
for structure and probabilities.
• More often, the structure is supplied by domain experts (i.e., they specify
what affects what) but the probabilities are learned from data.
Building from data:
• Sometimes both structure and probabilities are learned from data.
• Difficult problem: puts the AI program in a similar position to a scientist
trying out different hypotheses.
• Need a method to reward the proposed net structure for matching the
data, but to penalize excessive complexity (Occam’s razor).
Properties of Bayesian networks
• Bayesian networks must be directed acyclic graphs.
• The major efficiency of the Bayesian network is that we have
economized on memory.
• They are also easier for human beings to interpret than the raw joint
Summing up:
Why it is good to use Bayesian?
• It captures the uncertainty of our knowledge about the environment in a
very elegant and simple way.
• We can integrate our prior knowledge into the reasoning process by using
the prior distribution (which represents our prior knowledge, as the name
• We can always update our belief about the world by using Bayes’
Theorem, as the more we observe
Summing up:
And why it is bad to use Bayesian?
• If we use a wrong prior, we will get lots of difficulties in getting the right
• The calculation includes integrals and summing up over all the possible
situations, which is typically computationally very expensive.
• Does not have theoretical guarantees that actually what we are doing will
in fact give us the correct answer (no theoretical proofs yet).