Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PGM 2002/3 – Tirgul6
Approximate Inference:
Sampling
.
Approximation
Until
now, we examined exact computation
In many applications, approximation are sufficient
Example: P(X = x|e) = 0.3183098861838
Maybe P(X = x|e) 0.3 is a good enough
approximation
e.g., we take action only if P(X = x|e) > 0.5
Can
we find good approximation algorithms?
Types of Approximations
Absolute error
An estimate q of P(X = x | e) has absolute
error , if
P(X = x|e) - q P(X = x|e) +
equivalently
q - P(X = x|e) q +
Absolute
1
q
2
error is not always what we want:
If P(X = x | e) = 0.0001, then an absolute error
of 0.001 is unacceptable
If P(X = x | e) = 0.3, then an absolute error of
0.001 is overly precise
0
Types of Approximations
Relative error
An estimate q of P(X = x | e) has relative
error , if
P(X = x|e)(1 - ) q P(X = x|e)(1 + )
equivalently
q/(1-)
q/(1 + ) P(X = x|e) q/(1 - )
1
q
q/(1+)
Sensitivity
of approximation depends on
actual value of desired result
0
Complexity
Recall,
exact inference is NP-hard
Is approximate inference any easier?
Construction
for exact inference:
Input: a 3-SAT problem
Output: a BN such that P(X=t) > 0 iff is
satisfiable
Complexity: Relative Error
Suppose
that q is an relative error estimate of
P(X = t),
If is not satisfiable, then
0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0
Thus, if q > 0, then is satisfiable
An immediate consequence:
Thm:
Given , finding an -relative error approximation is NPhard
Complexity: Absolute error
We
can find absolute error approximations to
P(X = x)
We will see such algorithms shortly
However, once we have evidence, the problem is
harder
Thm
If < 0.5, then finding an estimate of P(X=x|e) with
absulote error approximation is NP-Hard
Proof
Recall
our construction
Q1
Q2
1
Q3
2
A1
3
Q4
...
... ...
A2
X
Qn
k-1
Ak/2
k
Proof (cont.)
we can estimate with absolute error
Let p1 P(Q1 = t | X = t)
Assign q1 = t if p1 > 0.5, else q1 = f
Suppose
Let p2 P(Q2 = t | X = t, Q1 = q1 )
Assign q2 = t if p2 > 0.5, else q2 = f
…
Let pn P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 )
Assign qn = t if pn > 0.5, else qn = f
Proof (cont.)
Claim: if is satisfiable, then q1 ,…, qn is a
satisfying assignment
Suppose is satisfiable
By induction on i there is a satisfying
assignment with Q1 = q1, …, Qi = qi
Base case:
If Q1 = t in all satisfying assignments,
P(Q1 = t | X = t) = 1
p1 1 - > 0.5
q1 = t
If Q1 = f, in all satisfying assignments, then q1 = f
Otherwise, statement holds for any choice of q1
Proof (cont.)
Claim: if is satisfiable, then q1 ,…, qn is a
satisfying assignment
Suppose is satisfiable
By induction on i there is a satisfying
assignment with Q1 = q1, …, Qi = qi
Induction argument:
If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1
pi+1 1 - > 0.5
qi+1 = t
If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
then qi+1 = f
Proof (cont.)
can efficiently check whether q1 ,…, qn is a
satisfying assignment (linear time)
If it is, then is satisfiable
If it is not, then is not satisfiable
We
Suppose
we have an approximation procedure with
relative error
we can determine 3-SAT with n procedure calls
approximation is NP-hard
Search Algorithms
Idea: search for high probability instances
Suppose x[1], …, x[N] are instances with high
mass
We can approximate:
P (Y
P (Y y , e|x[i])P(x[i])
y | e)
P (e|x[i])P(x[i])
i
i
x[i] is a complete instantiation, then P(e|x[i]) is 0
or 1
If
Search Algorithms (cont)
P (Y
P (Y y , e|x[i])P(x[i])
y | e)
P (e|x[i])P(x[i])
i
i
that do not satisfy e, do not play a role
in approximation
We need to focus the search to find instances
that do satisfy e
Instances
Clearly,
in some cases this is hard (e.g., the
construction from our NP-hardness result
Stochastic Simulation
we can sample instances <x1,…,xn>
according to P(X1,…,Xn)
What is the probability that a random sample
<x1,…,xn> satisfies e?
This is exactly P(e)
Suppose
We
can view each sample as tossing a biased coin
with probability P(e) of “Heads”
Stochastic Sampling
Intuition:
given a sufficient number of samples
x[1],…,x[N], we can estimate
# Heads
P (e)
N
P (e|x[i])
i
N
of large number implies that as N grows, our
estimate will converge to p with high probability
Law
How
many samples do we need to get a reliable
estimation?
Use Chernof’s bound for binomial distributions
Sampling a Bayesian Network
P(X1,…,Xn) is represented by a Bayesian network,
can we efficiently sample from it?
If
Idea:
sample according to structure of the network
Write distribution using the chain rule, and then
sample each variable given its parents
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
0.001
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
r
a
P(c) 0.8 0.05
0.4
0.01
Logic Sampling
X1, …, Xn be order of variables consistent with
arc direction
for i = 1, …, n do
sample xi from P(Xi | pai )
(Note: since Pai {X1,…,Xi-1}, we already
assigned values to them)
return x1, …,xn
Let
Logic Sampling
Sampling
a complete instance is linear in number of
variables
Regardless of structure of the network
if P(e) is small, we need many samples to
get a decent estimate
However,
Can we sample from P(X1,…,Xn |e)?
If
evidence is in roots of network, easily
If evidence is in leaves of network, we have a
problem
Our sampling method proceeds according to
order of nodes in graph
Note,
we can use arc-reversal to make evidence
nodes root.
In some networks, however, this will create
exponentially large tables...
Likelihood Weighting
we ensure that all of our sample satisfy e?
One simple solution:
When we need to sample a variable that is
assigned value by e, use the specified value
Can
X
example: we know Y = 1
Sample X from P(X)
Then take Y = 1
Is this a sample from P(X,Y |Y = 1) ?
For
Y
Likelihood Weighting
Problem:
these samples of X are from P(X)
Solution:
Penalize samples in which P(Y=1|X) is small
X
We
now sample as follows:
Let x[i] be a sample from P(X)
Let w[i] be P(Y = 1|X = x [i])
P (X
w [i ]P (X x|x[i])
x |Y 1 )
w [i ]
i
i
Y
Likelihood Weighting
Why
does this make sense?
When N is large, we expect to sample NP(X = x)
samples with x[i] = x
Thus,
w
i
NP (X x )P (Y 1 | X x )
i ,x [i ] x
NP (X x ,Y 1)
When
we normalize, we get approximation of the
conditional probability
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
e
e
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
0.001
P(e) 0.001
Earthquake
Radio
e
e
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
e
Alarm
=a
=r
e
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
0.6
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
e
Alarm
=a
=r
e
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c
0.6
a
P(c) 0.8 0.05
0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
e
e
Burglary
Alarm
=a
=r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c r
0.6 *0.3
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
X1, …, Xn be order of variables consistent with
arc direction
w = 1
for i = 1, …, n do
if Xi = xi has been observed
Let
w
w* P(Xi = xi | pai )
else
sample
return
xi from P(Xi | pai )
x1, …,xn, and w
Importance Sampling
A general method for evaluating <f>P(X) when we
cannot sample from P(X).
Idea: Choose an approximating distribution
Q(X) and sample from it
W(X)
f (x )
P (X )
f (x )P (x )dx f (x )P (x )
x
x
Q (X )
P (X )
dx f (x )
Q (X )
Q (X )
Q (X )
Using this we can now sample from Q and then
f (x )
P (X )
1
M
1
f (X [m ])
M
m 1
M
If we could generate
samples from P(X)
M
f (x [m])w (m )
m 1
Now that we generate
the sample from Q(X)
(Unnormalized) Importance Sampling
1. For m=1:M
Sample X[m] from Q(X)
Calculate W(m) = P(X)/Q(X)
2. Estimate the expectation of f(X) using
f (x )
P (X )
1
M
M
f (x [m])w (m )
m 1
Requirements:
P(X)>0 Q(X)>0 (do not ignore possible scenarios)
It is possible to calculate P(X),Q(X) for a specific X=x
It is possible to sample from Q(X)
Normalized Importance Sampling
Assume that we cannot now even evalute P(X=x) but
can evaluate P’(X=x) = P(X=x)
(for example we can evaluate P(X) but not P(X|e) in a Bayesian network)
We define w’(X) = P’(X)/Q(X). We can then evaluate
:
w '(X )
Q (X )
P '(X )
Q ( X )
P '(x ) α
Q (X )
x
x
and then:
f (x )
P (X )
f (x )P (x )dx f (x )P (x )
x
x
1
Q (X )
1
f
(
x
)
P
'
(
x
)
dx
f (X )w '(X )
αx
Q (X )
α
Q (X )
dx
Q (X )
Q (X )
f (X )w '(X )
w '(X )
Q (X )
Q (X )
where in the last step we simply replace with the above equation
Normalized Importance Sampling
We can now estimate the expectation of f(X) similarly
to unnormalized importance sampling by sampling
from Q(X) and then
M
f (x )
P (X )
f (x [m])w (m )
m 1
M
w (m )
m 1
(hence the name “normalized”)
Importance Sampling to LW
We want to compute P(Y=y|e)? (X is the set of random
variables in the network and Y is some subset we are interested in)
1) Define a mutilated Bayesian network BZ=z to be a
network where:
•
•
all variables in Z are disconnected from their
parents and are deterministically set to z
all other variables remain unchanged
2) Choose Q to be BE=e
convince yourself that P’(X)/Q(X) is exactly P(Y=y|X)
3) Choose f(x) to be 1(Y[m]=y)/M
4) Plug into the formula and you get exactly
Likelihood Weighting
Likelihood weighting is correct!!!