Download 04_Bayes - University of Southampton

Document related concepts

Gambler's fallacy wikipedia , lookup

Transcript
COMP 2208
Bayes’ Theorem,
Bayesian Reasoning,
and Bayesian Networks
Dr. Long Tran-Thanh
[email protected]
University of Southampton
Classification
Perception
Environment
Behaviour
Perception
Behaviour
Categorize
inputs
Decision
making
Update
decision
making
policy
Update
belief
model
Reasoning
Perception
Environment
Behaviour
Perception
Behaviour
Categorize
inputs
Decision
making
Update
decision
making
policy
Update
belief
model
Reasoning
Logic /Rule based
• build up basic rules (axioms) using some form of logic
• Other rules (reasoning) can be derived from the above
• Functional or declarative programming (LISP, ML, Prolog, etc…)
Stochastic reasoning
• Frequentist (non-Bayesian)
• Bayesian
Some bridging efforts:
• E.g., Markov logic (see, e.g., Pedro Domingos)
The right way to do reasoning?
Debate 1: logic based vs. stochastic
• E.g., Noam Chomsky vs. peter Norvig
Debate 2: frequentist vs. Bayesian
• Many vs. many
Today we talk about Bayesian (because it’s simple to understand and
elegant)
The Bayesian way
• Bayes’ Theorem
• Bayesian belief update
• Inference in Bayesian networks
Some probability theory
• Space of all possible world models = area equal to 1
Some probability theory
• Probability of event A = fraction of worlds in which A happens
• What does it mean that P(A) = 0.2?
• What does it mean that P(A) = 0? or = 1?
A
Some probability theory
• Probability of A not happening = complement of P(A)
A
Some probability theory
B
A
Basic axioms in probability theory
Domain of the probability value:
Constants:
Connection of AND and OR:
Conditional probability
Only consider worlds in which A happens -> new space of worlds
Consider worlds in which B happens, but only within the new space
P(B|A) = fraction of worlds with B within the new space
B
B|A
A
Conditional probability
B
B|A
A
Conditional probability
We have:
Chain rule:
Law of total probability:
Bayes’ Theorem (Bayes’ rule)
Use chain rule twice for P(A and B):
The right hand sides must be the same!
Bayes’ rule:
The beauty of Bayes’ Theorem
A = evidence (observation); B = hypothesis
Prior: captures our prior knowledge/belief
Likelihood: how likely to observe the evidence, if the hypothesis was true
P(evidence) = probability of observing the evidence in general (aggregated
over all possible hypotheses)
We update our belief after observing some evidences
Example: the Monty Hall problem
The game:
At the beginning, all doors are closed
The prize is behind 1 door (with equal probability)
You choose 1 door (say Door 1)
The host opens a door (say Door 2) which has a goat behind it
Let’s make a deal: would you swap your choice to Door 3?
Solution of the Monty Hall problem
Your choice: Door 1; Offer: choose Door 3 instead
What are the chances of each option for winning (getting the prize)?
A = hypothesis: prize behind Door 1
B = host chooses Door 2 to open (and we see a goat)
= 1/3
= 1/2
Bayes’ rule:
= ½ (why?)
Chance of winning the prize for staying with Door 1 = 1/3
Chance of winning the prize for switching to Door 3 = 2/3
Calculating the denominator
Bayes’ rule:
A = prize behind Door 1; B = host chooses Door 2 (between Doors 2 and 3)
Use law of total probability: X = door with prize
X= Door 1:
X= Door 2:
X= Door 3:
Example 2: the HIV test problem
HIV lab tests are quite accurate:
• 99% sensitivity: if a patient is HIV+, then probability that the test has
positive results is 0.99
• 99% specificity: if a patient is HIV-, then probability that the test has
negative results is also 0.99
HIV is rare in patients in our population: about 1 out of 1000 (even among
those who get tested)
Situation: A patient does a HIV test and gets a positive result
Question: what are the chances that the patient is indeed HIV+?
Solution for the HIV test problem
A = test was positive; B = patient is HIV+
We want to calculate P(B|A)
Prior: P(B) = 0.001 (1 out of 1000 is HIV+)
Likelihood: P(A|B) = 0.99 (99% sensitivity)
What about P(A) ?
Solution for the HIV test problem
Calculation of P(A)
Use law of total probability
Term 1:
Term 2:
Solution for HIV test
Calculating P(B|A)
P(B) = 0.001; P(A|B) = 0.99; P(A) = 0.01089
This means that even if the test is positive, it’s only 9% that the patient is
HIV+
Some discussions
Only 15% of doctors gets this right
Most doctors think that if a HIV test is positive, there’s a high chance that the
patient is HIV+
Why?
• They typically focus on the accuracy (sensitivity) of the test
• They're neglecting the background or base rate of HIV prevalence (prior)
Bonus question
Russian roulette with 2 bullets
• You put 2 bullets into a revolver, such
that they are next to each other
• Your opponent spins and pull the trigger
• … and survives. Now it’s your turn!
• Question: should you spin the revolver as well, or you shouldn’t spin it?
• Question 2: what if there’s only 1 bullet in the revolver?
Belief update
London – Rome flight
Belief update
You don’t know over which area you are flying …
But you know that you’ve been flying for a while
You look out the window, and you see…
land
…and sea
…and high mountains
Near London
unlikely
unlikely
unlikely
Near Rome
maybe
maybe
unlikely
Near Paris
maybe
unlikely
unlikely
Near Monaco
maybe
maybe
probably
Bayesian belief update
Belief: probability distribution over all the possible models
• Captures our knowledge + uncertainty about the true world model
How to update our belief after each observation?
Prior: probability that the model is true (before the observation)
Likelihood: how likely to have the observed event, if the model was true
Denominator: marginal likelihood (or model evidence)
Left hand side = posterior: probability that the model is true, after we
have seen the observations
Bayesian belief update
Prior probability
Prior distribution = prior belief
Posterior probability
Posterior distribution = posterior belief
Bayesian belief update example
Search for a crashed airplane using Bayesian updating
• Imagine you're designing a search-and- rescue UAV. Its job is
to autonomously look for aircraft wreckage
• It is easier to detect wreckage in some terrain types than
others
Bayesian belief update example
• Difficulty model: what’s the probability to find the wreckage in the area
Bayesian belief update example
• Prior belief: based on the last known location
Bayesian belief update example
• Search: always go to the point with highest probability
• At the beginning:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 10 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 50 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 250 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 500 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 1000 steps:
Bayesian belief update example
• Search: always go to the point with highest probability
• After 2000 steps:
Complex knowledge representation
• We use probabilities to capture uncertainty in our knowledge
• So far we deal with simple correlations between probabilities
A
B
• What if we have much more complicated network of correlations?
Inference: derive extra information/conclusions from observed data
• Inference in simple systems: use Bayes’ rule
• How to do inference in complex networks? How to use Bayes’ rule there?
Inference with joint distribution
But before Bayesian nets …
• The simplest way to do inference is to look at the joint distribution of the
probability variables
• Joint distribution captures all the interconnections + dependencies
An example from employment survey
• M: the person is a male?
• L: does the person work long hours?
• R: is the person rich?
Inference with joint distribution
Truth table: if all the variables are binaries
Inference with joint distribution
P(the person is rich) = ? 0.13+0.10+0.01+0.02 = 0.26
Inference with joint distribution
P(L | M) = ? P (L|M) = P(L and M)/P(M) = (0.13+0.11)/(0.13+0.11+0.10+0.34)
= 0.35
Inference with joint distribution
• We can do any inference from joint distribution
• Issues: doesn’t scale well in practice (brute force solution)
• E.g., with 30 variables, we need 2^30 probabilities … (1 billion)
• In theory: doesn’t show the relationships
• We might want to exploit the structure of relationships to simplify the
calculations
• E.g., if R (rich) is independent from M (male) and L (working long
hours), then we can drop M and L when we do inference about R
Independence
Definition: Two random variables are independent if their joint probability
is the product of their probabilities
Another property: If P(A), P(B) > 0
Similarly:
Bayesian networks
Use graphical representation to capture the dependencies between the
random variables
P(S) = 0.8
P(cS) = 0.2
Studied for
the exam
Lecturer is in
good mood
High exam
result
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M) = 0.3
P(cM) = 0.7
Bayesian networks
Inference: given high exam result, what is the probability that the lecturer
was in a good mood? (P(M|H) = ?)
P(S) = 0.8
P(cS) = 0.2
Studied for
the exam
Lecturer is in
good mood
High exam
result
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M) = 0.3
P(cM) = 0.7
Bayesian networks
P(S) = 0.8
P(cS) = 0.2
P(M) = 0.3
P(cM) = 0.7
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
P(M| H) = ?
P(M) = 0.3
P(H|M) = P(H|M, S)P(S) + P(H|M,cS)P(cS)
= 0.9*0.8 + 0.5*0.2 = 0.73
P(H) = ?
Bayesian networks
P(H) = P(H|M,S)P(M and S)
M and S are
independent
= P(H|M,S)P(M)P(S)
+ P(H|M,cS)P(M and cS)
+ P(H|M,cS)P(M)P(cS)
+ P(H|cM,S)P(cM and S)
+ P(H|cM,S)P(cM)P(S)
+ P(H|cM,cS)P(cM and cS)
+ P(H|cM,cS)P(cM)P(cS)
P(H) = 0.9*0.3*0.8
+ 0.5*0.3*0.2
+ 0.4*0.7*0.8
+ 0.05*0.7*0.2 = 0.216 + 0.03 + 0.224 + 0.007 = 0.477
Bayesian networks
P(S) = 0.8
P(cS) = 0.2
P(M) = 0.3
P(cM) = 0.7
P(M| H) = ?
P(M) = 0.3
P(H|M) = 0.73
P(H) = 0.477
P(M| H) = 0.3*0.73/0.477 = 0.459
P(H|M, S) = 0.9
P(H|M, cS) = 0.5
P(H|cM,S) = 0.4
P(H|cM,cS) = 0.05
Building Bayesian networks
With domain expert:
• Bayesian nets are sometimes built manually, consulting domain experts
for structure and probabilities.
• More often, the structure is supplied by domain experts (i.e., they specify
what affects what) but the probabilities are learned from data.
Building from data:
• Sometimes both structure and probabilities are learned from data.
• Difficult problem: puts the AI program in a similar position to a scientist
trying out different hypotheses.
• Need a method to reward the proposed net structure for matching the
data, but to penalize excessive complexity (Occam’s razor).
Properties of Bayesian networks
• Bayesian networks must be directed acyclic graphs.
• The major efficiency of the Bayesian network is that we have
economized on memory.
• They are also easier for human beings to interpret than the raw joint
distribution.
Summing up:
Why it is good to use Bayesian?
• It captures the uncertainty of our knowledge about the environment in a
very elegant and simple way.
• We can integrate our prior knowledge into the reasoning process by using
the prior distribution (which represents our prior knowledge, as the name
suggests).
• We can always update our belief about the world by using Bayes’
Theorem, as the more we observe
Summing up:
And why it is bad to use Bayesian?
• If we use a wrong prior, we will get lots of difficulties in getting the right
answers.
• The calculation includes integrals and summing up over all the possible
situations, which is typically computationally very expensive.
• Does not have theoretical guarantees that actually what we are doing will
in fact give us the correct answer (no theoretical proofs yet).