Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PGM 2002/3 – Tirgul6 Approximate Inference: Sampling . Approximation Until now, we examined exact computation In many applications, approximation are sufficient Example: P(X = x|e) = 0.3183098861838 Maybe P(X = x|e) 0.3 is a good enough approximation e.g., we take action only if P(X = x|e) > 0.5 Can we find good approximation algorithms? Types of Approximations Absolute error An estimate q of P(X = x | e) has absolute error , if P(X = x|e) - q P(X = x|e) + equivalently q - P(X = x|e) q + Absolute 1 q 2 error is not always what we want: If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 0 Types of Approximations Relative error An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - ) q P(X = x|e)(1 + ) equivalently q/(1-) q/(1 + ) P(X = x|e) q/(1 - ) 1 q q/(1+) Sensitivity of approximation depends on actual value of desired result 0 Complexity Recall, exact inference is NP-hard Is approximate inference any easier? Construction for exact inference: Input: a 3-SAT problem Output: a BN such that P(X=t) > 0 iff is satisfiable Complexity: Relative Error Suppose that q is an relative error estimate of P(X = t), If is not satisfiable, then 0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0 Thus, if q > 0, then is satisfiable An immediate consequence: Thm: Given , finding an -relative error approximation is NPhard Complexity: Absolute error We can find absolute error approximations to P(X = x) We will see such algorithms shortly However, once we have evidence, the problem is harder Thm If < 0.5, then finding an estimate of P(X=x|e) with absulote error approximation is NP-Hard Proof Recall our construction Q1 Q2 1 Q3 2 A1 3 Q4 ... ... ... A2 X Qn k-1 Ak/2 k Proof (cont.) we can estimate with absolute error Let p1 P(Q1 = t | X = t) Assign q1 = t if p1 > 0.5, else q1 = f Suppose Let p2 P(Q2 = t | X = t, Q1 = q1 ) Assign q2 = t if p2 > 0.5, else q2 = f … Let pn P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) Assign qn = t if pn > 0.5, else qn = f Proof (cont.) Claim: if is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Base case: If Q1 = t in all satisfying assignments, P(Q1 = t | X = t) = 1 p1 1 - > 0.5 q1 = t If Q1 = f, in all satisfying assignments, then q1 = f Otherwise, statement holds for any choice of q1 Proof (cont.) Claim: if is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Induction argument: If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1 pi+1 1 - > 0.5 qi+1 = t If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f Proof (cont.) can efficiently check whether q1 ,…, qn is a satisfying assignment (linear time) If it is, then is satisfiable If it is not, then is not satisfiable We Suppose we have an approximation procedure with relative error we can determine 3-SAT with n procedure calls approximation is NP-hard Search Algorithms Idea: search for high probability instances Suppose x[1], …, x[N] are instances with high mass We can approximate: P (Y P (Y y , e|x[i])P(x[i]) y | e) P (e|x[i])P(x[i]) i i x[i] is a complete instantiation, then P(e|x[i]) is 0 or 1 If Search Algorithms (cont) P (Y P (Y y , e|x[i])P(x[i]) y | e) P (e|x[i])P(x[i]) i i that do not satisfy e, do not play a role in approximation We need to focus the search to find instances that do satisfy e Instances Clearly, in some cases this is hard (e.g., the construction from our NP-hardness result Stochastic Simulation we can sample instances <x1,…,xn> according to P(X1,…,Xn) What is the probability that a random sample <x1,…,xn> satisfies e? This is exactly P(e) Suppose We can view each sample as tossing a biased coin with probability P(e) of “Heads” Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate # Heads P (e) N P (e|x[i]) i N of large number implies that as N grows, our estimate will converge to p with high probability Law How many samples do we need to get a reliable estimation? Use Chernof’s bound for binomial distributions Sampling a Bayesian Network P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? If Idea: sample according to structure of the network Write distribution using the chain rule, and then sample each variable given its parents Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm 0.03 P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b a P(c) 0.8 0.05 0.4 0.01 Logic sampling 0.001 P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a c a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a c r a P(c) 0.8 0.05 0.4 0.01 Logic Sampling X1, …, Xn be order of variables consistent with arc direction for i = 1, …, n do sample xi from P(Xi | pai ) (Note: since Pai {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn Let Logic Sampling Sampling a complete instance is linear in number of variables Regardless of structure of the network if P(e) is small, we need many samples to get a decent estimate However, Can we sample from P(X1,…,Xn |e)? If evidence is in roots of network, easily If evidence is in leaves of network, we have a problem Our sampling method proceeds according to order of nodes in graph Note, we can use arc-reversal to make evidence nodes root. In some networks, however, this will create exponentially large tables... Likelihood Weighting we ensure that all of our sample satisfy e? One simple solution: When we need to sample a variable that is assigned value by e, use the specified value Can X example: we know Y = 1 Sample X from P(X) Then take Y = 1 Is this a sample from P(X,Y |Y = 1) ? For Y Likelihood Weighting Problem: these samples of X are from P(X) Solution: Penalize samples in which P(Y=1|X) is small X We now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i]) P (X w [i ]P (X x|x[i]) x |Y 1 ) w [i ] i i Y Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus, w i NP (X x )P (Y 1 | X x ) i ,x [i ] x NP (X x ,Y 1) When we normalize, we get approximation of the conditional probability Likelihood Weighting P(e) 0.001 Earthquake Radio e e P(r) 0.3 0.001 Burglary Alarm =a =r 0.03 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting 0.001 P(e) 0.001 Earthquake Radio e e P(r) 0.3 0.001 Burglary Alarm =a =r P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Burglary Earthquake Radio e Alarm =a =r e P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a 0.6 a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Burglary Earthquake Radio e Alarm =a =r e P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a c 0.6 a P(c) 0.8 0.05 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Earthquake Radio e e Burglary Alarm =a =r P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a c r 0.6 *0.3 a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting X1, …, Xn be order of variables consistent with arc direction w = 1 for i = 1, …, n do if Xi = xi has been observed Let w w* P(Xi = xi | pai ) else sample return xi from P(Xi | pai ) x1, …,xn, and w Importance Sampling A general method for evaluating <f>P(X) when we cannot sample from P(X). Idea: Choose an approximating distribution Q(X) and sample from it W(X) f (x ) P (X ) f (x )P (x )dx f (x )P (x ) x x Q (X ) P (X ) dx f (x ) Q (X ) Q (X ) Q (X ) Using this we can now sample from Q and then f (x ) P (X ) 1 M 1 f (X [m ]) M m 1 M If we could generate samples from P(X) M f (x [m])w (m ) m 1 Now that we generate the sample from Q(X) (Unnormalized) Importance Sampling 1. For m=1:M Sample X[m] from Q(X) Calculate W(m) = P(X)/Q(X) 2. Estimate the expectation of f(X) using f (x ) P (X ) 1 M M f (x [m])w (m ) m 1 Requirements: P(X)>0 Q(X)>0 (do not ignore possible scenarios) It is possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X) Normalized Importance Sampling Assume that we cannot now even evalute P(X=x) but can evaluate P’(X=x) = P(X=x) (for example we can evaluate P(X) but not P(X|e) in a Bayesian network) We define w’(X) = P’(X)/Q(X). We can then evaluate : w '(X ) Q (X ) P '(X ) Q ( X ) P '(x ) α Q (X ) x x and then: f (x ) P (X ) f (x )P (x )dx f (x )P (x ) x x 1 Q (X ) 1 f ( x ) P ' ( x ) dx f (X )w '(X ) αx Q (X ) α Q (X ) dx Q (X ) Q (X ) f (X )w '(X ) w '(X ) Q (X ) Q (X ) where in the last step we simply replace with the above equation Normalized Importance Sampling We can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling from Q(X) and then M f (x ) P (X ) f (x [m])w (m ) m 1 M w (m ) m 1 (hence the name “normalized”) Importance Sampling to LW We want to compute P(Y=y|e)? (X is the set of random variables in the network and Y is some subset we are interested in) 1) Define a mutilated Bayesian network BZ=z to be a network where: • • all variables in Z are disconnected from their parents and are deterministically set to z all other variables remain unchanged 2) Choose Q to be BE=e convince yourself that P’(X)/Q(X) is exactly P(Y=y|X) 3) Choose f(x) to be 1(Y[m]=y)/M 4) Plug into the formula and you get exactly Likelihood Weighting Likelihood weighting is correct!!!