Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QUIZ!! T/F: You can always (theoretically) do BNs inference by enumeration. TRUE T/F: In VE, always first marginalize, then join. FALSE T/F: VE is a lot faster than enumeration, but not always exact. FALSE T/F: The running time of VE is independent of the order you pick the r.v.s. FALSE T/F: The more evidence, the faster VE runs. TRUE P(X|Y) sums to ... |Y| P(x|Y) sums to ... ?? P(X|y) sums to ... 1 P(x|y) sums to ... P(x|y) P(x,Y) sums to .. P(x) P(X,Y) sums to ... 1 1 CSE 511a: Artificial Intelligence Spring 2013 Lecture 17: Bayes’ Nets IV– Approximate Inference (Sampling) 04/01/2013 Robert Pless Via Kilian Weinberger via Dan Klein – UC Berkeley Announcements Project 4 out soon. Project 3 due at midnight. 3 Exact Inference Variable Elimination 4 General Variable Elimination Query: Start with initial factors: Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): Pick a hidden variable H Join all factors mentioning H Eliminate (sum out) H Join all remaining factors and normalize 5 6 Example: Variable elimination P(at)=.8 P(st)=.6 attend study P(fa)=.9 fair prepared P(pr|at,st) 0.9 0.5 0.7 0.1 pass P(pa|at,pre,fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 pr T T T T F F F F at T T F F T T F F fa T F T F T F T F at T T F F st T F T F Query: What is the probability that a student attends class, given that they pass the exam? [based on slides taken from UMBC CMSC 671, 2005] 7 Join study factors P(at)=.8 P(st)=.6 attend study P(fa)=.9 fair prep T T T T F F F F prepared pass P(pa|at,pre,fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 pr T T T T F F F F at T T F F T T F F fa T F T F T F T F study T F T F T F T F attend T T F F T T F F Original P(pr|at,st) 0.9 0.5 0.7 0.1 0.1 0.5 0.3 0.9 P(st) 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 Joint P(pr,st|sm) 0.54 0.2 0.42 0.04 0.06 0.2 0.18 0.36 Marginal P(pr|sm) 0.74 0.46 0.26 0.54 8 Marginalize out study P(at)=.8 attend P(fa)=.9 fair prepared, study prep T T T T F F F F pass P(pa|at,pre,fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 pr T T T T F F F F at T T F F T T F F fa T F T F T F T F study T F T F T F T F attend T T F F T T F F Original P(pr|at,st) 0.9 0.5 0.7 0.1 0.1 0.5 0.3 0.9 P(st) 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 Joint P(pr,st|ac) 0.54 0.2 0.42 0.04 0.06 0.2 0.18 0.36 Marginal P(pr|at) 0.74 0.46 0.26 0.54 9 Remove “study” P(at)=.8 attend P(fa)=.9 fair prepared P(pr|at) 0.74 0.46 0.26 0.54 pr T T F F at T F T F pass P(pa|at,pre,fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 pr T T T T F F F F at T T F F T T F F fa T F T F T F T F 10 Join factors “fair” P(at)=.8 attend P(fa)=.9 fair prepared P(pr|at) 0.74 0.46 0.26 0.54 prep T T F F attend T F T F pass pa t t t t t t t t pre T T T T F F F F attend T T F F T T F F fair T F T F T F T F Original P(pa|at,pre, fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 P(fair) 0.9 0.1 0.9 0.1 0.9 0.1 0.9 0.1 Joint Marginal P(pa,fa|sm, P(pa|sm,pre pre) ) 0.81 0.82 0.01 0.63 0.64 0.01 0.63 0.64 0.01 0.18 0.19 0.01 11 Marginalize out “fair” P(at)=.8 attend prepared P(pr|at) 0.74 0.46 0.26 0.54 pass, fair pa T T T T T T T T prep T T F F Original pre T T T T F F F F attend T T F F T T F F fair T F T F T F T F P(pa|at,pre,fa) 0.9 0.1 0.7 0.1 0.7 0.1 0.2 0.1 attend T F T F Joint P(fair) 0.9 0.1 0.9 0.1 0.9 0.1 0.9 0.1 Marginal P(pa,fa|at,pre) P(pa|at,pre) 0.81 0.82 0.01 0.63 0.64 0.01 0.63 0.64 0.01 0.18 0.19 0.01 12 Marginalize out “fair” P(at)=.8 attend prepared P(pr|at) 0.74 0.46 0.26 0.54 prep T T F F attend T F T F pass P(pa|at,pre) 0.82 0.64 0.64 0.19 pa t t t t pre T T F F attend T F T F 13 Join factors “prepared” P(at)=.8 attend prepared pa t t t t pre T T F F attend T F T F Original P(pa|at,pr) 0.82 0.64 0.64 0.19 P(pr|at) 0.74 0.46 0.26 0.54 Joint Marginal P(pa,pr|sm) P(pa|sm) 0.6068 0.7732 0.2944 0.397 0.1664 0.1026 pass 14 Join factors “prepared” P(at)=.8 attend pa t t t t pre T T F F attend T F T F Original P(pa|at,pr) 0.82 0.64 0.64 0.19 P(pr|at) 0.74 0.46 0.26 0.54 Joint P(pa,pr|at) 0.6068 0.2944 0.1664 0.1026 Marginal P(pa|at) 0.7732 0.397 pass, prepared 15 Join factors “prepared” P(at)=.8 attend pass P(pa|at) 0.7732 0.397 pa t t attend T F 16 Join factors P(at)=.8 attend pass pa T T attend T F Original P(pa|at) 0.7732 0.397 P(at) 0.8 0.2 Joint P(pa,sm) 0.61856 0.0794 Normalized: P(at|pa) 0.89 0.11 17 Join factors attend, pass pa T T attend T F Original P(pa|at) 0.7732 0.397 P(at) 0.8 0.2 Joint P(pa,at) 0.61856 0.0794 Normalized: P(at|pa) 0.89 0.11 18 Approximate Inference Sampling (particle based method) 19 Approximate Inference 20 Sampling – the basics ... Scrooge McDuck gives you an ancient coin. He wants to know what is P(H) You have no homework, and nothing good is on television – so you toss it 1 Million times. You obtain 700000x Heads, and 300000x Tails. What is P(H)? 21 Sampling – the basics ... Exactly, P(H)=0.7 Why? # Heads 700000 P(H ) = = #Tosses 1000000 22 Monte Carlo Method Who is more likely to win? Green or Purple? What is the probability that green wins, P(G)? Two ways to solve this: 1. Compute the exact probability. 2. Play 100,000 games and see how many times green wins. 23 Approximate Inference Simulation has a name: sampling F Sampling is a hot topic in machine learning, and it’s really simple S Basic idea: Draw N samples from a sampling distribution S Compute an approximate posterior probability Show this converges to the true probability P A Why sample? Learning: get samples from a distribution you don’t know Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination) 24 Forward Sampling +c -c [Excel Demo] 0.5 0.5 Cloudy +c -c +s +s -s +s -s 0.1 0.9 0.5 0.5 +r -r -s +r -r +c Sprinkler +w -w +w -w +w -w +w -w 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 Rain WetGrass -c +r -r +r -r 0.8 0.2 0.2 0.8 Samples: +c, -s, +r, +w -c, +s, -r, +w … 25 Forward Sampling This process generates samples with probability: …i.e. the BN’s joint probability Let the number of samples of an event be Then I.e., the sampling procedure is consistent 26 Example We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy C Sprinkler S Rain R WetGrass W If we want to know P(W) We have counts <+w:4, -w:1> Normalize to get P(W) = <+w:0.8, -w:0.2> This will get closer to the true distribution with more samples Can estimate anything else, too What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? Fast: can use fewer samples if less time (what’s the drawback?) 27 Rejection Sampling Let’s say we want P(C) No point keeping all samples around Just tally counts of C as we go Cloudy C Sprinkler S Rain R WetGrass W Let’s say we want P(C| +s) Same thing: tally C outcomes, but ignore (reject) samples which don’t have S=+s This is called rejection sampling It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w 28 Sampling Example There are 2 cups. The first contains 1 penny and 1 quarter The second contains 2 quarters Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter (yes!). What is the probability that the other coin in that cup is also a quarter? Likelihood Weighting Problem with rejection sampling: If evidence is unlikely, you reject a lot of samples You don’t exploit your evidence as you sample Consider P(B|+a) Burglary Alarm -b, -a -b, -a -b, -a -b, -a +b, +a Idea: fix evidence variables and sample the rest Burglary Alarm -b +a -b, +a -b, +a -b, +a +b, +a Problem: sample distribution not consistent! Solution: weight by probability of evidence given parents 30 Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Cloudy C Now, samples have weights S R W Together, weighted sampling distribution is consistent 31 Likelihood Weighting +c -c 0.5 0.5 Cloudy +c -c +s +s -s +s -s 0.1 0.9 0.5 0.5 +r -r -s +r -r +c Sprinkler +w -w +w -w +w -w +w -w 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 Rain WetGrass -c +r -r +r -r 0.8 0.2 0.2 0.8 Samples: +c, +s, +r, +w … 32 Likelihood Weighting Example Cloudy 0 0 0 0 1 Rainy 1 0 0 0 0 Sprinkler 1 1 1 1 1 Wet Grass 1 1 1 1 1 Weight 0.495 0.45 0.45 0.45 0.09 Inference: Sum over weights that match query value Divide by total sample weight What is P(C|+w,+r)? å sample _ weight P(C | +w,+r) = å sample _ weight C=1 33 Likelihood Weighting Likelihood weighting is good We have taken evidence into account as we generate the sample E.g. here, W’s value will get picked based on the evidence values of S, R More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn’t solve all our problems Cloudy C S Rain R W Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable 34 Markov Chain Monte Carlo* Idea: instead of sampling from scratch, create samples that are each like the last one. Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(b|c): +b +a +c -b +a +c -b -a +c Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators! What’s the point: both upstream and downstream variables condition on evidence. 35 Random Walks [Explain on Blackboard] 36 Gibbs Sampling 1. Set all evidence E to e 2. Do forward sampling t obtain x1,...,xn 3. Repeat: 1. 2. 3. 4. Pick any variable Xi uniformly at random. Resample xi’ from p(Xi | x1,..., xi-1, xi+1,..., xn) Set all other xj’=xj The new sample is x1’,..., xn’ 37 Markov Blanket Markov blanket of X: 1. All parents of X 2. All children of X 3. All parents of children of X (except X itself) X X is conditionally independent from all other variables in the BN, given all variables in the markov blanket (besides X). 38 Gibbs Sampling 1. Set all evidence E to e 2. Do forward sampling t obtain x1,...,xn 3. Repeat: 1. 2. 3. 4. Pick any variable Xi uniformly at random. Resample xi’ from p(Xi | markovblanket(Xi)) Set all other xj’=xj The new sample is x1’,..., xn’ 39 Summary Sampling can be your salvation The dominating approach to inference in BNs Approaches: Forward (/Prior) Sampling Rejection Sampling Likelihood Weighted Sampling Gibbs Sampling 40 Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs for the Bayes Net. Might also start with priors for CPT probabilities. Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs). Turing Award for Bayes Nets 42