Download Inference IV: Approximate Inference

Approximate Inference Edited from Slides by Nir Friedman . Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NPhard Not surprising, since we can simulate Boolean gates. Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem:  q1,…,qn be propositions,  1 ,... ,k be clauses, such that i = li1 li2  li3 where each lij is a literal over q1,…,qn   = 1... k We will construct a Bayesian network s.t. P(X=t) > 0 iff  is satisfiable Q1 Q2 1  P(Qi Q3 2 3 A1 A2 ... Q4 ...  ... k-1 Ak/2-1 Qn k X = true) = 0.5,  P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql satisfy the clause I  A1, A2, …, are simple binary AND gates  It is easy to check    Polynomial number of variables Each local probability table can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q1,…,Qn  Conclusion: polynomial reduction of 3-SAT Note: this construction also shows that computing P(X = t) is harder than NP  2nP(X = t) is the number of satisfying assignments to   Thus, it is #P-hard. Hardness - Notes  We used deterministic relations in our construction  The same construction works if we use (1-, ) instead of (1,0) in each gate for any  < 0.5  Hardness does not mean we cannot solve inference  It implies that we cannot find a general procedure that works efficiently for all networks  For particular families of networks, we can have provably efficient procedures  We have seen such families in the course: HMMs, Evolutionary trees. Approximation  Until now, we examined exact computation  In many applications, approximation are sufficient  Example: P(X = x|e) = 0.3183098861838  Maybe P(X = x|e)  0.3 is a good enough approximation  e.g., we take action only if P(X = x|e) > 0.5  Can we find good approximation algorithms? Types of Approximations Absolute error  An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e)  q +   Absolute   1 q 2 error is not always what we want: If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 0 Types of Approximations Relative error  An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - )  q  P(X = x|e)(1 + ) equivalently q/(1-) q/(1 + )  P(X = x|e)  q/(1 - ) 1 q q/(1+)  Sensitivity of approximation depends on actual value of desired result 0 Complexity  Exact inference is NP-hard  Is approximate inference any easier?  Construction   for exact inference: Input: a 3-SAT problem  Output: a BN such that P(X=t) > 0 iff  is satisfiable Complexity: Relative Error Theorem: Given , finding an -relative error approximation is NPhard.  Suppose that q is a relative error estimate of P(X = t)=0.  Then, 0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0 namely, q=0. Thus, -relative error and exact computation coincide for the value 0. Complexity: Absolute error Theorem  If  < 0.5, then finding an estimate of P(X=x|e) with  absolute error approximation is NP-Hard Proof  Recall our construction Q1 Q2 1 Q3 2 A1 3 Q4 ...  ... ... A2 ... Qn k k-1 X Proof (cont.) we can estimate with  absolute error  Let p1  P(Q1 = t | X = t)  Assign q1 = t if p1 > 0.5, else q1 = f  Suppose Let p2  P(Q2 = t | X = t, Q1 = q1 ) Assign q2 = t if p2 > 0.5, else q2 = f … Let pn  P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) Assign qn = t if pn > 0.5, else qn = f Proof (cont.) Claim: if  is satisfiable, then q1 ,…, qn is a satisfying assignment  Suppose  is satisfiable  By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Base case: If Q1 = t in all satisfying assignments,  P(Q1 = t | X = t) = 1  p1  1 -  > 0.5  q1 = t If Q1 = f, in all satisfying assignments, then q1 = f Otherwise, the statement holds for any choice of q1 Proof (cont.) Induction argument: If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi  P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1  pi+1  1 -  > 0.5  qi+1 = t If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f. Otherwise, the statement holds for any choice of qi . Proof (cont.)  We can efficiently check whether q1 ,…, qn is a satisfying assignment (linear time)  If it is, then  is satisfiable  If it is not, then  is not satisfiable Suppose we have an approximation procedure with  relative error.  We can determine 3-SAT with n procedure calls. We generate an assignment as in the proof, and check satisfyability of the resulting assignment in linear time. If there were a satisfiable solution, we showed one would find it, and if no such assignment exists, one won’t find it.  Thus, approximation is NP-hard.  When can we hope to approximate? Two situations: 0.16 0.14 0.12  “Peaked” distributions improbable values are ignored 0.1 0.08 0.06 0.04 0.02  Highly stochastic distributions “Far” evidence is discarded. (E.g., far markers in genetic linkage analysis) 0 Probability Stochastic Simulation we can sample instances <x1,…,xn> according to P(X1,…,Xn)  What is the probability that a random sample <x1,…,xn> satisfies e?  This is exactly P(e)  Suppose  We can view each sample as tossing a biased coin with probability P(e) of “Heads” Stochastic Sampling  Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate # Heads P (e)   N  P (e|x[i]) i 1 or 0 N of large number implies that as N grows, our estimate will converge to p with high probability.  Law Sampling a Bayesian Network P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it?  If  YES: sample according to structure of the network: sample each variable given its sampled parents Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm 0.03 P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b a P(c) 0.8 0.05 0.4 0.01 Logic sampling 0.001 P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a c a P(c) 0.8 0.05 0.4 0.01 Logic sampling P(e) 0.001 Earthquake Radio Burglary Alarm P(b) 0.03 be be be be P(a) 0.98 0.7 e e P(r) 0.3 0.001 Call a B E A C R Samples: b e a c r a P(c) 0.8 0.05 0.4 0.01 Logic Sampling X1, …, Xn be order of variables consistent with arc direction  for i = 1, …, n do  sample xi from P(Xi | pai )  (Note: since Pai  {X1,…,Xi-1}, we already assigned values to them)  return x1, …,xn  Let Logic Sampling  Sampling a complete instance is linear in number of variables  Regardless of structure of the network if P(e) is small, we need many samples to get a decent estimate  However, Can we sample from P(X1,…,Xn |e)?  If  If evidence is in roots of the network, as before. evidence is in leaves of the network, we have a problem: Our sampling method proceeds according to order of nodes in graph. We need to retain only those samples that match e. This might be a rare event. Likelihood Weighting  Can we ensure that all of our sample is used?  One wrong (but fixable) approach:  When we need to sample a variable that is assigned a value by e, use that specified value. X example: we know Y = 1  Sample X from P(X)  Then take Y = 1  This is NOT a sample from P(X,Y |Y = 1) !  For Y Likelihood Weighting  Problem: these samples of X are from P(X)  Solution:  Penalize samples in which P(Y=1|X) is small X Y  We   now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i])  w[i]P( X  x|x[i]) P ( X  x |Y  1 )   w[i] i i 1 or 0 Likelihood Weighting P(e) 0.001 Earthquake Radio r r P(r) 0.3 0.001 Burglary Alarm =a =r 0.03 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting 0.001 P(e) 0.001 Earthquake Radio r r P(r) 0.3 0.001 Burglary Alarm =a =r P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Burglary Earthquake Radio r Alarm =a =r r P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a 0.6 a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Burglary Earthquake Radio r Alarm =a =r r P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a c 0.6 a P(c) 0.8 0.05 0.05 0.4 0.01 Likelihood Weighting P(e) 0.001 Earthquake Radio r r Burglary Alarm =a =r P(r) 0.3 0.001 P(b) 0.03 be be be be P(a) 0.98 0.7 Call a B E A C R Weight Samples: b e a c r 0.6 *0.3 a P(c) 0.8 0.05 0.4 0.01 Likelihood Weighting X1, …, Xn be order of variables consistent with arc direction w = 1  for i = 1, …, n do  if Xi = xi has been observed  Let w  w  P(Xi = xi | pai ) else  sample  return xi from P(Xi | pai ) x1, …,xn, and w Likelihood Weighting  Why does this make sense?  When N is large, we expect to sample NP(X = x) samples with x[i] = x  Thus,  P(Y  1 | X  x [i]) P( X  x|x [i] ) N  P(X  x , Y  1)   P( X  x | Y  1 ) N  P(Y  1)  P(Y  1 | X  x [i]) i i Likelihood Weighting What can we say about the quality of answer?  Intuitively, the weights of a sample reflects their probability given the evidence. We need collect a enough mass for the sample to provide accurate answer.  Another factor is the “extremeness” of CPDs. Theorem (Dagum & Luby AIJ93): If P(Xi | Pai) [l,u] for all local probability tables, and w [i ]  i 4(1   ) 2 2 ln u n  with probability 1-, the estimate is  relative error approximation then END

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Inference IV: Approximate Inference