* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Slides
Survey
Document related concepts
Transcript
CMSC 671
Fall 2010
Class #18/19 – Wednesday, November 3 /
Monday, November 8
Some material borrowed with
permission from Lise Getoor
1
Next two classes
• Probability theory (quick review!)
• Bayesian networks
– Network structure
– Conditional probability tables
– Conditional independence
• Bayesian inference
– From the joint distribution
– Using independence/factoring
– From sources of evidence
2
Bayesian Reasoning
Chapter 13
3
Sources of uncertainty
• Uncertain inputs
– Missing data
– Noisy data
• Uncertain knowledge
– Multiple causes lead to multiple effects
– Incomplete enumeration of conditions or effects
– Incomplete knowledge of causality in the domain
– Probabilistic/stochastic effects
• Uncertain outputs
– Abduction and induction are inherently uncertain
– Default reasoning, even in deductive fashion, is uncertain
– Incomplete deductive inference may be uncertain
Probabilistic reasoning only gives probabilistic
results (summarizes uncertainty from various sources)
4
Decision making with uncertainty
• Rational behavior:
– For each possible action, identify the possible outcomes
– Compute the probability of each outcome
– Compute the utility of each outcome
– Compute the probability-weighted (expected) utility
over possible outcomes for each action
– Select the action with the highest expected utility
(principle of Maximum Expected Utility)
5
Why probabilities anyway?
•
Kolmogorov showed that three simple axioms lead to the
rules of probability theory
– De Finetti, Cox, and Carnap have also provided compelling
arguments for these axioms
1. All probabilities are between 0 and 1:
•
0 ≤ P(a) ≤ 1
2. Valid propositions (tautologies) have probability 1, and
unsatisfiable propositions have probability 0:
•
P(true) = 1 ; P(false) = 0
3. The probability of a disjunction is given by:
•
P(a b) = P(a) + P(b) – P(a b)
a
ab
b
6
Probability theory
• Random variables
– Domain
• Alarm, Burglary, Earthquake
– Boolean (like these), discrete,
continuous
• Atomic event: complete
specification of state
• Alarm=True Burglary=True
Earthquake=False
alarm burglary ¬earthquake
• Prior probability: degree
of belief without any other
evidence
• Joint probability: matrix
of combined probabilities
of a set of variables
• P(Burglary) = .1
• P(Alarm, Burglary) =
alarm
¬alarm
burglary
.09
.01
¬burglary
.1
.8
7
Probability theory (cont.)
• Conditional probability:
probability of effect given causes
• Computing conditional probs:
– P(a | b) = P(a b) / P(b)
– P(b): normalizing constant
• Product rule:
– P(a b) = P(a | b) P(b)
• Marginalizing:
– P(B) = ΣaP(B, a)
– P(B) = ΣaP(B | a) P(a)
(conditioning)
• P(burglary | alarm) = .47
P(alarm | burglary) = .9
• P(burglary | alarm) =
P(burglary alarm) / P(alarm)
= .09 / .19 = .47
• P(burglary alarm) =
P(burglary | alarm) P(alarm) =
.47 * .19 = .09
• P(alarm) =
P(alarm burglary) +
P(alarm ¬burglary) =
.09+.1 = .19
8
Example: Inference from the joint
alarm
¬alarm
earthquake
¬earthquake
earthquake
¬earthquake
burglary
.01
.08
.001
.009
¬burglary
.01
.09
.01
.79
P(Burglary | alarm) = α P(Burglary, alarm)
= α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake)
= α [ (.01, .01) + (.08, .09) ]
= α [ (.09, .1) ]
Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.09+.1) = 5.26
(i.e., P(alarm) = 1/α = .19 – quizlet: how can you verify this?)
P(burglary | alarm) = .09 * 5.26 = .474
P(¬burglary | alarm) = .1 * 5.26 = .526
9
Exercise: Inference from the joint
smart
smart
p(smart
study prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– What is the prior probability of smart?
– What is the prior probability of study?
– What is the conditional probability of prepared, given
study and smart?
• Save these answers for next time!
10
Independence
• When two sets of propositions do not affect each others’
probabilities, we call them independent, and can easily
compute their joint and conditional probability:
– Independent (A, B) → P(A B) = P(A) P(B), P(A | B) = P(A)
• For example, {moon-phase, light-level} might be
independent of {burglary, alarm, earthquake}
– Then again, it might not: Burglars might be more likely to
burglarize houses when there’s a new moon (and hence little light)
– But if we know the light level, the moon phase doesn’t affect
whether we are burglarized
– Once we’re burglarized, light level doesn’t affect whether the alarm
goes off
• We need a more complex notion of independence, and
methods for reasoning about these kinds of relationships
11
Exercise: Independence
smart
smart
p(smart
study prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– Is smart independent of study?
– Is prepared independent of study?
12
Conditional independence
• Absolute independence:
– A and B are independent if P(A B) = P(A) P(B); equivalently,
P(A) = P(A | B) and P(B) = P(B | A)
• A and B are conditionally independent given C if
– P(A B | C) = P(A | C) P(B | C)
• This lets us decompose the joint distribution:
– P(A B C) = P(A | C) P(B | C) P(C)
• Moon-Phase and Burglary are conditionally independent
given Light-Level
• Conditional independence is weaker than absolute
independence, but still useful in decomposing the full joint
probability distribution
13
Exercise: Conditional independence
smart
smart
p(smart
study prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– Is smart conditionally independent of prepared, given
study?
– Is study conditionally independent of prepared, given
smart?
14
Bayes’s rule
• Bayes’s rule is derived from the product rule:
– P(Y | X) = P(X | Y) P(Y) / P(X)
• Often useful for diagnosis:
– If X are (observed) effects and Y are (hidden) causes,
– We may have a model for how causes lead to effects (P(X | Y))
– We may also have prior beliefs (based on experience) about the
frequency of occurrence of effects (P(Y))
– Which allows us to reason abductively from effects to causes (P(Y |
X)).
15
Bayesian inference
• In the setting of diagnostic/evidential reasoning
H i P(Hi )
hypotheses
P(E j | Hi )
E1
Ej
Em
evidence/m anifestati ons
– Know prior probability of hypothesis
conditional probability
– Want to compute the posterior probability
P(Hi )
P(E j | Hi )
P(Hi | E j )
• Bayes’ theorem (formula 1):
P(Hi | E j ) P(Hi )P(E j | Hi ) / P(E j )
16
Simple Bayesian diagnostic reasoning
• Knowledge base:
– Evidence / manifestations:
– Hypotheses / disorders:
E1, … Em
H1, … H n
• Ej and Hi are binary; hypotheses are mutually exclusive (nonoverlapping) and exhaustive (cover all possible cases)
– Conditional probabilities:
P(Ej | Hi), i = 1, … n; j = 1, … m
• Cases (evidence for a particular instance): E1, …, El
• Goal: Find the hypothesis Hi with the highest posterior
– Maxi P(Hi | E1, …, El)
17
Bayesian diagnostic reasoning II
• Bayes’ rule says that
– P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El)
• Assume each piece of evidence Ei is conditionally
independent of the others, given a hypothesis Hi, then:
– P(E1, …, El | Hi) = lj=1 P(Ej | Hi)
• If we only care about relative probabilities for the Hi, then
we have:
– P(Hi | E1, …, El) = α P(Hi) lj=1 P(Ej | Hi)
18
Limitations of simple
Bayesian inference
• Cannot easily handle multi-fault situation, nor cases where
intermediate (hidden) causes exist:
– Disease D causes syndrome S, which causes correlated
manifestations M1 and M2
• Consider a composite hypothesis H1 H2, where H1 and H2
are independent. What is the relative posterior?
– P(H1 H2 | E1, …, El) = α P(E1, …, El | H1 H2) P(H1 H2)
= α P(E1, …, El | H1 H2) P(H1) P(H2)
= α lj=1 P(Ej | H1 H2) P(H1) P(H2)
• How do we compute P(Ej | H1 H2) ??
19
Limitations of simple Bayesian
inference II
• Assume H1 and H2 are independent, given E1, …, El?
– P(H1 H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El)
• This is a very unreasonable assumption
– Earthquake and Burglar are independent, but not given Alarm:
• P(burglar | alarm, earthquake) << P(burglar | alarm)
• Another limitation is that simple application of Bayes’s rule doesn’t
allow us to handle causal chaining:
– A: this year’s weather; B: cotton production; C: next year’s cotton price
– A influences C indirectly: A→ B → C
– P(C | B, A) = P(C | B)
• Need a richer representation to model interacting hypotheses,
conditional independence, and causal chaining
• Next time: conditional independence and Bayesian networks!
20
Bayesian Networks
Chapter 14.1-14.3
Some material borrowed
from Lise Getoor
21
Bayesian Belief Networks (BNs)
• Definition: BN = (DAG, CPD)
– DAG: directed acyclic graph (BN’s structure)
• Nodes: random variables (typically binary or discrete, but
methods also exist to handle continuous variables)
• Arcs: indicate probabilistic dependencies between nodes
(lack of link signifies conditional independence)
– CPD: conditional probability distribution (BN’s parameters)
• Conditional probabilities at each node, usually stored as a table
(conditional probability table, or CPT)
P ( xi | i ) where i is the set of all parent nodes of xi
– Root nodes are a special case – no parents, so just use priors
in CPD:
i , so P ( xi | i ) P ( xi )
22
Example BN
P(A) = 0.001
a
P(B|A) = 0.3
P(B|A) = 0.001
b
P(C|A) = 0.2
P(C|A) = 0.005
c
d
P(D|B,C) = 0.1
P(D|B,C) = 0.01
P(D|B,C) = 0.01
P(D|B,C) = 0.00001
e
P(E|C) = 0.4
P(E|C) = 0.002
Note that we only specify P(A) etc., not P(¬A), since they have to add to one
23
Conditional independence and
chaining
• Conditional independence assumption
– P ( x i | i , q) P ( x i | i )
i
where q is any set of variables
q
(nodes) other than x i and its successors
xi
– i blocks influence of other nodes on x i
and its successors (q influences x i only
through variables in i )
– With this assumption, the complete joint probability distribution of all
variables in the network can be represented by (recovered from) local
CPDs by chaining these CPDs:
P ( x1 ,..., xn ) ni1 P ( xi | i )
24
Chaining: Example
a
b
c
d
e
Computing the joint probability for all variables is easy:
P(a, b, c, d, e)
= P(e | a, b, c, d) P(a, b, c, d)
by the product rule
= P(e | c) P(a, b, c, d)
by cond. indep. assumption
= P(e | c) P(d | a, b, c) P(a, b, c)
= P(e | c) P(d | b, c) P(c | a, b) P(a, b)
= P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)
25
Topological semantics
• A node is conditionally independent of its nondescendants given its parents
• A node is conditionally independent of all other nodes in
the network given its parents, children, and children’s
parents (also known as its Markov blanket)
• The method called d-separation can be applied to decide
whether a set of nodes X is independent of another set Y,
given a third set Z
26
Inference in Bayesian
Networks
Chapter 14.4-14.5
Some material borrowed
from Lise Getoor
28
Inference tasks
• Simple queries: Computer posterior marginal P(Xi | E=e)
– E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)
• Conjunctive queries:
– P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e)
• Optimal decisions: Decision networks include utility
information; probabilistic inference is required to find
P(outcome | action, evidence)
• Value of information: Which evidence should we seek next?
• Sensitivity analysis: Which probability values are most
critical?
• Explanation: Why do I need a new starter motor?
29
Approaches to inference
• Exact inference
–
–
–
–
Enumeration
Belief propagation in polytrees
Variable elimination
Clustering / join tree algorithms
• Approximate inference
–
–
–
–
–
–
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Genetic algorithms
Neural networks
Simulated annealing
Mean field theory
30
Direct inference with BNs
• Instead of computing the joint, suppose we just want the
probability for one variable
• Exact methods of computation:
– Enumeration
– Variable elimination
– Join trees: get the probabilities associated with every query variable
31
Inference by enumeration
• Add all of the terms (atomic event probabilities) from the
full joint distribution
• If E are the evidence (observed) variables and Y are the
other (unobserved) variables, then:
P(X|e) = α P(X, E) = α ∑ P(X, E, Y)
• Each P(X, E, Y) term can be computed using the chain rule
• Computationally expensive!
32
Example: Enumeration
a
b
•
•
•
•
c
d
e
P(xi) = Σ πi P(xi | πi) P(πi)
Suppose we want P(D=true), and only the value of E is
given as true
P (d|e) = ΣABCP(a, b, c, d, e)
= ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)
With simple iteration to compute this expression, there’s
going to be a lot of repetition (e.g., P(e|c) has to be
recomputed every time we iterate over C=true)
33
Exercise: Enumeration
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
study
.9
.7
study
.5
.1
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
Query: What is the
probability that a student
studied, given that they pass
the exam?
34
Variable elimination
• Basically just enumeration, but with caching of local
calculations
• Linear for polytrees (singly connected BNs)
• Potentially exponential for multiply connected BNs
Exact inference in Bayesian networks is NP-hard!
• Join tree algorithms are an extension of variable elimination
methods that compute posterior probabilities for all nodes
in a BN simultaneously
35
Variable elimination
General idea:
• Write query in the form
P( X n , e ) P( xi | pai )
• Iteratively
xk
x3
x2
i
– Move all irrelevant terms outside of innermost sum
– Perform innermost sum, getting a new term
– Insert the new term into the product
36
Variable elimination: Example
Cloudy
Rain
Sprinkler
WetGrass
P( w ) P( w | r, s)P(r | c)P(s | c)P(c)
r ,s , c
P( w | r, s) P(r | c)P(s | c)P(c)
r ,s
c
P ( w | r , s ) f1 ( r , s )
r ,s
f1 (r, s)
37
A more complex example
• “Asia” network:
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
39
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
40
S
V
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
Eliminate: v
Compute:
fv (t ) P (v )P (t |v )
v
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
Note: fv(t) = P(t)
In general, result of elimination is not necessarily a probability
term
41
S
V
• We want to compute P(d)
• Need to eliminate: s,x,t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
Eliminate: s
Compute:
fs (b,l ) P (s )P (b | s )P (l | s )
s
fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
Summing on s results in a factor with two arguments fs(b,l)
In general, result of elimination may be a function of several
variables
42
S
V
• We want to compute P(d)
• Need to eliminate: x,t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
Eliminate: x
Compute:
fx (a ) P (x | a )
x
fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
Note: fx(a) = 1 for all values of a !!
43
S
V
• We want to compute P(d)
• Need to eliminate: t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
Eliminate: t
Compute:
ft (a ,l ) fv (t )P (a |t , l )
t
fs (b, l )fx (a )ft (a , l )P (d | a , b )
44
S
V
• We want to compute P(d)
• Need to eliminate: l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
fs (b, l )fx (a )ft (a , l )P (d | a , b )
Eliminate: l
Compute:
fl (a , b ) fs (b,l )ft (a ,l )
fl (a , b )fx (a )P (d | a , b )
l
45
• We want to compute P(d)
• Need to eliminate: b
S
V
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
fs (b, l )fx (a )ft (a , l )P (d | a , b )
fl (a , b )fx (a )P (d | a , b ) fa (b,d ) fb (d )
Eliminate: a,b
Compute:
fa (b,d ) fl (a , b )fx (a ) p (d | a , b )
a
fb (d ) fa (b,d )
b
46
S
V
L
T
Dealing with evidence
• How do we deal with evidence?
B
A
X
D
• Suppose we are give evidence V = t, S = f, D = t
• We want to compute P(L, V = t, S = f, D = t)
47
S
V
L
T
Dealing with evidence
• We start by writing the factors:
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
•
•
Since we know that V = t, we don’t need to eliminate V
Instead, we can replace the factors P(V) and P(T|V) with
fP (V ) P (V t ) fp (T |V ) (T ) P (T |V t )
•
•
These “select” the appropriate parts of the original factors given the evidence
Note that fp(V) is a constant, and thus does not appear in elimination of other variables
48
Dealing with evidence
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
L
T
B
A
X
D
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
49
Dealing with evidence
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
L
T
B
A
X
D
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
50
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b )
51
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating a, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )fa (b,l )
52
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b )
•
•
Eliminating a, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )fa (b,l )
Eliminating b, we get
fP (v )fP (s )fP (l |s ) (l )fb (l )
53
Variable elimination algorithm
•
Let X1,…, Xm be an ordering on the non-query variables
•
For i = m, …, 1
... P(X
X1
X2
Xm
j
| Parents ( X j ))
j
– Leave in the summation for Xi only factors mentioning Xi
– Multiply the factors, getting a factor that contains a number for each value of the
variables mentioned, including Xi
– Sum out Xi, getting a factor f that contains a number for each value of the variables
mentioned, not including Xi
– Replace the multiplied factor in the summation
54
Complexity of variable elimination
Suppose in one elimination step we compute
fx (y1 ,, yk ) f 'x (x , y1 ,, yk )
x
m
f 'x (x , y1 , , y k ) fi (x , y1,1, , y1,li )
This requires
m Val(X ) Val(Yi )
i 1
i
multiplications (for each value for x, y1, …, yk, we do m multiplications) and
Val(X ) Val(Yi )
i
additions (for each value of y1, …, yk , we do |Val(X)| additions)
►Complexity is exponential in the number of variables in the intermediate factors
►Finding an optimal ordering is NP-hard
55
Exercise: Variable elimination
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
study
.9
.7
study
.5
.1
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
Query: What is the
probability that a student is
smart, given that they pass
the exam?
56
Conditioning
a
b
c
d
e
• Conditioning: Find the network’s smallest cutset S (a set of nodes
whose removal renders the network singly connected)
– In this network, S = {A} or {B} or {C} or {D}
• For each instantiation of S, compute the belief update with your favorite
inference algorithm
• Combine the results from all instantiations of S
• Computationally expensive (finding the smallest cutset is in general NPhard, and the total number of possible instantiations of S is O(2|S|))
57
Approximate inference:
Direct sampling
• Suppose you are given values for some subset of the
variables, E, and want to infer values for unknown
variables, Z
• Randomly generate a very large number of instantiations
from the BN
– Generate instantiations for all variables – start at root variables and
work your way “forward” in topological order
• Rejection sampling: Only keep those instantiations that are
consistent with the values for E
• Use the frequency of values for Z to get estimated
probabilities
• Accuracy of the results depends on the size of the sample
(asymptotically approaches exact results)
58
Exercise: Direct sampling
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
study
.9
.7
study
.5
.1
Topological order = …?
Random number
generator: .35, .76, .51, .44,
.08, .28, .03, .92, .02, .42
59
Likelihood weighting
• Idea: Don’t generate samples that need to be rejected in the
first place!
• Sample only from the unknown variables Z
• Weight each sample according to the likelihood that it
would occur, given the evidence E
60
Markov chain Monte Carlo algorithm
• So called because
– Markov chain – each instance generated in the sample is dependent
on the previous instance
– Monte Carlo – statistical sampling method
• Perform a random walk through variable assignment space,
collecting statistics as you go
– Start with a random instantiation, consistent with evidence variables
– At each step, for some nonevidence variable, randomly sample its
value, consistent with the other current assignments
• Given enough samples, MCMC gives an accurate estimate
of the true distribution of values
61
Exercise: MCMC sampling
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
study
.9
.7
study
.5
.1
Topological order = …?
Random number
generator: .35, .76, .51, .44,
.08, .28, .03, .92, .02, .42
62
Summary
• Bayes nets
– Structure
– Parameters
– Conditional independence
– Chaining
• BN inference
– Enumeration
– Variable elimination
– Sampling methods
63