* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides
Survey
Document related concepts
Transcript
CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor 1 Next two classes • Probability theory (quick review!) • Bayesian networks – Network structure – Conditional probability tables – Conditional independence • Bayesian inference – From the joint distribution – Using independence/factoring – From sources of evidence 2 Bayesian Reasoning Chapter 13 3 Sources of uncertainty • Uncertain inputs – Missing data – Noisy data • Uncertain knowledge – Multiple causes lead to multiple effects – Incomplete enumeration of conditions or effects – Incomplete knowledge of causality in the domain – Probabilistic/stochastic effects • Uncertain outputs – Abduction and induction are inherently uncertain – Default reasoning, even in deductive fashion, is uncertain – Incomplete deductive inference may be uncertain Probabilistic reasoning only gives probabilistic results (summarizes uncertainty from various sources) 4 Decision making with uncertainty • Rational behavior: – For each possible action, identify the possible outcomes – Compute the probability of each outcome – Compute the utility of each outcome – Compute the probability-weighted (expected) utility over possible outcomes for each action – Select the action with the highest expected utility (principle of Maximum Expected Utility) 5 Why probabilities anyway? • Kolmogorov showed that three simple axioms lead to the rules of probability theory – De Finetti, Cox, and Carnap have also provided compelling arguments for these axioms 1. All probabilities are between 0 and 1: • 0 ≤ P(a) ≤ 1 2. Valid propositions (tautologies) have probability 1, and unsatisfiable propositions have probability 0: • P(true) = 1 ; P(false) = 0 3. The probability of a disjunction is given by: • P(a b) = P(a) + P(b) – P(a b) a ab b 6 Probability theory • Random variables – Domain • Alarm, Burglary, Earthquake – Boolean (like these), discrete, continuous • Atomic event: complete specification of state • Alarm=True Burglary=True Earthquake=False alarm burglary ¬earthquake • Prior probability: degree of belief without any other evidence • Joint probability: matrix of combined probabilities of a set of variables • P(Burglary) = .1 • P(Alarm, Burglary) = alarm ¬alarm burglary .09 .01 ¬burglary .1 .8 7 Probability theory (cont.) • Conditional probability: probability of effect given causes • Computing conditional probs: – P(a | b) = P(a b) / P(b) – P(b): normalizing constant • Product rule: – P(a b) = P(a | b) P(b) • Marginalizing: – P(B) = ΣaP(B, a) – P(B) = ΣaP(B | a) P(a) (conditioning) • P(burglary | alarm) = .47 P(alarm | burglary) = .9 • P(burglary | alarm) = P(burglary alarm) / P(alarm) = .09 / .19 = .47 • P(burglary alarm) = P(burglary | alarm) P(alarm) = .47 * .19 = .09 • P(alarm) = P(alarm burglary) + P(alarm ¬burglary) = .09+.1 = .19 8 Example: Inference from the joint alarm ¬alarm earthquake ¬earthquake earthquake ¬earthquake burglary .01 .08 .001 .009 ¬burglary .01 .09 .01 .79 P(Burglary | alarm) = α P(Burglary, alarm) = α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake) = α [ (.01, .01) + (.08, .09) ] = α [ (.09, .1) ] Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.09+.1) = 5.26 (i.e., P(alarm) = 1/α = .19 – quizlet: how can you verify this?) P(burglary | alarm) = .09 * 5.26 = .474 P(¬burglary | alarm) = .1 * 5.26 = .526 9 Exercise: Inference from the joint smart smart p(smart study prep) study study study study prepared .432 .16 .084 .008 prepared .048 .16 .036 .072 • Queries: – What is the prior probability of smart? – What is the prior probability of study? – What is the conditional probability of prepared, given study and smart? • Save these answers for next time! 10 Independence • When two sets of propositions do not affect each others’ probabilities, we call them independent, and can easily compute their joint and conditional probability: – Independent (A, B) → P(A B) = P(A) P(B), P(A | B) = P(A) • For example, {moon-phase, light-level} might be independent of {burglary, alarm, earthquake} – Then again, it might not: Burglars might be more likely to burglarize houses when there’s a new moon (and hence little light) – But if we know the light level, the moon phase doesn’t affect whether we are burglarized – Once we’re burglarized, light level doesn’t affect whether the alarm goes off • We need a more complex notion of independence, and methods for reasoning about these kinds of relationships 11 Exercise: Independence smart smart p(smart study prep) study study study study prepared .432 .16 .084 .008 prepared .048 .16 .036 .072 • Queries: – Is smart independent of study? – Is prepared independent of study? 12 Conditional independence • Absolute independence: – A and B are independent if P(A B) = P(A) P(B); equivalently, P(A) = P(A | B) and P(B) = P(B | A) • A and B are conditionally independent given C if – P(A B | C) = P(A | C) P(B | C) • This lets us decompose the joint distribution: – P(A B C) = P(A | C) P(B | C) P(C) • Moon-Phase and Burglary are conditionally independent given Light-Level • Conditional independence is weaker than absolute independence, but still useful in decomposing the full joint probability distribution 13 Exercise: Conditional independence smart smart p(smart study prep) study study study study prepared .432 .16 .084 .008 prepared .048 .16 .036 .072 • Queries: – Is smart conditionally independent of prepared, given study? – Is study conditionally independent of prepared, given smart? 14 Bayes’s rule • Bayes’s rule is derived from the product rule: – P(Y | X) = P(X | Y) P(Y) / P(X) • Often useful for diagnosis: – If X are (observed) effects and Y are (hidden) causes, – We may have a model for how causes lead to effects (P(X | Y)) – We may also have prior beliefs (based on experience) about the frequency of occurrence of effects (P(Y)) – Which allows us to reason abductively from effects to causes (P(Y | X)). 15 Bayesian inference • In the setting of diagnostic/evidential reasoning H i P(Hi ) hypotheses P(E j | Hi ) E1 Ej Em evidence/m anifestati ons – Know prior probability of hypothesis conditional probability – Want to compute the posterior probability P(Hi ) P(E j | Hi ) P(Hi | E j ) • Bayes’ theorem (formula 1): P(Hi | E j ) P(Hi )P(E j | Hi ) / P(E j ) 16 Simple Bayesian diagnostic reasoning • Knowledge base: – Evidence / manifestations: – Hypotheses / disorders: E1, … Em H1, … H n • Ej and Hi are binary; hypotheses are mutually exclusive (nonoverlapping) and exhaustive (cover all possible cases) – Conditional probabilities: P(Ej | Hi), i = 1, … n; j = 1, … m • Cases (evidence for a particular instance): E1, …, El • Goal: Find the hypothesis Hi with the highest posterior – Maxi P(Hi | E1, …, El) 17 Bayesian diagnostic reasoning II • Bayes’ rule says that – P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El) • Assume each piece of evidence Ei is conditionally independent of the others, given a hypothesis Hi, then: – P(E1, …, El | Hi) = lj=1 P(Ej | Hi) • If we only care about relative probabilities for the Hi, then we have: – P(Hi | E1, …, El) = α P(Hi) lj=1 P(Ej | Hi) 18 Limitations of simple Bayesian inference • Cannot easily handle multi-fault situation, nor cases where intermediate (hidden) causes exist: – Disease D causes syndrome S, which causes correlated manifestations M1 and M2 • Consider a composite hypothesis H1 H2, where H1 and H2 are independent. What is the relative posterior? – P(H1 H2 | E1, …, El) = α P(E1, …, El | H1 H2) P(H1 H2) = α P(E1, …, El | H1 H2) P(H1) P(H2) = α lj=1 P(Ej | H1 H2) P(H1) P(H2) • How do we compute P(Ej | H1 H2) ?? 19 Limitations of simple Bayesian inference II • Assume H1 and H2 are independent, given E1, …, El? – P(H1 H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El) • This is a very unreasonable assumption – Earthquake and Burglar are independent, but not given Alarm: • P(burglar | alarm, earthquake) << P(burglar | alarm) • Another limitation is that simple application of Bayes’s rule doesn’t allow us to handle causal chaining: – A: this year’s weather; B: cotton production; C: next year’s cotton price – A influences C indirectly: A→ B → C – P(C | B, A) = P(C | B) • Need a richer representation to model interacting hypotheses, conditional independence, and causal chaining • Next time: conditional independence and Bayesian networks! 20 Bayesian Networks Chapter 14.1-14.3 Some material borrowed from Lise Getoor 21 Bayesian Belief Networks (BNs) • Definition: BN = (DAG, CPD) – DAG: directed acyclic graph (BN’s structure) • Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables) • Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence) – CPD: conditional probability distribution (BN’s parameters) • Conditional probabilities at each node, usually stored as a table (conditional probability table, or CPT) P ( xi | i ) where i is the set of all parent nodes of xi – Root nodes are a special case – no parents, so just use priors in CPD: i , so P ( xi | i ) P ( xi ) 22 Example BN P(A) = 0.001 a P(B|A) = 0.3 P(B|A) = 0.001 b P(C|A) = 0.2 P(C|A) = 0.005 c d P(D|B,C) = 0.1 P(D|B,C) = 0.01 P(D|B,C) = 0.01 P(D|B,C) = 0.00001 e P(E|C) = 0.4 P(E|C) = 0.002 Note that we only specify P(A) etc., not P(¬A), since they have to add to one 23 Conditional independence and chaining • Conditional independence assumption – P ( x i | i , q) P ( x i | i ) i where q is any set of variables q (nodes) other than x i and its successors xi – i blocks influence of other nodes on x i and its successors (q influences x i only through variables in i ) – With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPDs by chaining these CPDs: P ( x1 ,..., xn ) ni1 P ( xi | i ) 24 Chaining: Example a b c d e Computing the joint probability for all variables is easy: P(a, b, c, d, e) = P(e | a, b, c, d) P(a, b, c, d) by the product rule = P(e | c) P(a, b, c, d) by cond. indep. assumption = P(e | c) P(d | a, b, c) P(a, b, c) = P(e | c) P(d | b, c) P(c | a, b) P(a, b) = P(e | c) P(d | b, c) P(c | a) P(b | a) P(a) 25 Topological semantics • A node is conditionally independent of its nondescendants given its parents • A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket) • The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z 26 Inference in Bayesian Networks Chapter 14.4-14.5 Some material borrowed from Lise Getoor 28 Inference tasks • Simple queries: Computer posterior marginal P(Xi | E=e) – E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false) • Conjunctive queries: – P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e) • Optimal decisions: Decision networks include utility information; probabilistic inference is required to find P(outcome | action, evidence) • Value of information: Which evidence should we seek next? • Sensitivity analysis: Which probability values are most critical? • Explanation: Why do I need a new starter motor? 29 Approaches to inference • Exact inference – – – – Enumeration Belief propagation in polytrees Variable elimination Clustering / join tree algorithms • Approximate inference – – – – – – Stochastic simulation / sampling methods Markov chain Monte Carlo methods Genetic algorithms Neural networks Simulated annealing Mean field theory 30 Direct inference with BNs • Instead of computing the joint, suppose we just want the probability for one variable • Exact methods of computation: – Enumeration – Variable elimination – Join trees: get the probabilities associated with every query variable 31 Inference by enumeration • Add all of the terms (atomic event probabilities) from the full joint distribution • If E are the evidence (observed) variables and Y are the other (unobserved) variables, then: P(X|e) = α P(X, E) = α ∑ P(X, E, Y) • Each P(X, E, Y) term can be computed using the chain rule • Computationally expensive! 32 Example: Enumeration a b • • • • c d e P(xi) = Σ πi P(xi | πi) P(πi) Suppose we want P(D=true), and only the value of E is given as true P (d|e) = ΣABCP(a, b, c, d, e) = ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) With simple iteration to compute this expression, there’s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C=true) 33 Exercise: Enumeration p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart pass p(pass|…) study .9 .7 study .5 .1 smart smart prep prep prep prep fair .9 .7 .7 .2 fair .1 .1 .1 .1 Query: What is the probability that a student studied, given that they pass the exam? 34 Variable elimination • Basically just enumeration, but with caching of local calculations • Linear for polytrees (singly connected BNs) • Potentially exponential for multiply connected BNs Exact inference in Bayesian networks is NP-hard! • Join tree algorithms are an extension of variable elimination methods that compute posterior probabilities for all nodes in a BN simultaneously 35 Variable elimination General idea: • Write query in the form P( X n , e ) P( xi | pai ) • Iteratively xk x3 x2 i – Move all irrelevant terms outside of innermost sum – Perform innermost sum, getting a new term – Insert the new term into the product 36 Variable elimination: Example Cloudy Rain Sprinkler WetGrass P( w ) P( w | r, s)P(r | c)P(s | c)P(c) r ,s , c P( w | r, s) P(r | c)P(s | c)P(c) r ,s c P ( w | r , s ) f1 ( r , s ) r ,s f1 (r, s) 37 A more complex example • “Asia” network: Visit to Asia Tuberculosis Smoking Lung Cancer Abnormality in Chest X-Ray Bronchitis Dyspnea 39 • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b S V L T B A Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) 40 S V • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b L T B A Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) Eliminate: v Compute: fv (t ) P (v )P (t |v ) v fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) Note: fv(t) = P(t) In general, result of elimination is not necessarily a probability term 41 S V • We want to compute P(d) • Need to eliminate: s,x,t,l,a,b L T • Initial factors B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) Eliminate: s Compute: fs (b,l ) P (s )P (b | s )P (l | s ) s fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b ) Summing on s results in a factor with two arguments fs(b,l) In general, result of elimination may be a function of several variables 42 S V • We want to compute P(d) • Need to eliminate: x,t,l,a,b L T • Initial factors B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b ) Eliminate: x Compute: fx (a ) P (x | a ) x fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b ) Note: fx(a) = 1 for all values of a !! 43 S V • We want to compute P(d) • Need to eliminate: t,l,a,b L T • Initial factors B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b ) Eliminate: t Compute: ft (a ,l ) fv (t )P (a |t , l ) t fs (b, l )fx (a )ft (a , l )P (d | a , b ) 44 S V • We want to compute P(d) • Need to eliminate: l,a,b L T • Initial factors B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b ) fs (b, l )fx (a )ft (a , l )P (d | a , b ) Eliminate: l Compute: fl (a , b ) fs (b,l )ft (a ,l ) fl (a , b )fx (a )P (d | a , b ) l 45 • We want to compute P(d) • Need to eliminate: b S V L T • Initial factors B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b ) fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b ) fs (b, l )fx (a )ft (a , l )P (d | a , b ) fl (a , b )fx (a )P (d | a , b ) fa (b,d ) fb (d ) Eliminate: a,b Compute: fa (b,d ) fl (a , b )fx (a ) p (d | a , b ) a fb (d ) fa (b,d ) b 46 S V L T Dealing with evidence • How do we deal with evidence? B A X D • Suppose we are give evidence V = t, S = f, D = t • We want to compute P(L, V = t, S = f, D = t) 47 S V L T Dealing with evidence • We start by writing the factors: B A X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b ) • • Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with fP (V ) P (V t ) fp (T |V ) (T ) P (T |V t ) • • These “select” the appropriate parts of the original factors given the evidence Note that fp(V) is a constant, and thus does not appear in elimination of other variables 48 Dealing with evidence • • • Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: S V L T B A X D fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b ) 49 Dealing with evidence • • • Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: S V L T B A X D fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b ) • Eliminating x, we get fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b ) 50 Dealing with evidence L T • • • Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: S V B A X D fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b ) • Eliminating x, we get fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b ) • Eliminating t, we get fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b ) 51 Dealing with evidence L T • • • Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: S V B A X D fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b ) • Eliminating x, we get fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b ) • Eliminating t, we get fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b ) • Eliminating a, we get fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )fa (b,l ) 52 Dealing with evidence L T • • • Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: S V B A X D fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b ) • Eliminating x, we get fP (v )fP ( s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b ) • Eliminating t, we get fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )ft (a ,l )fx (a )fP (d |a ,b ) (a , b ) • • Eliminating a, we get fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b )fa (b,l ) Eliminating b, we get fP (v )fP (s )fP (l |s ) (l )fb (l ) 53 Variable elimination algorithm • Let X1,…, Xm be an ordering on the non-query variables • For i = m, …, 1 ... P(X X1 X2 Xm j | Parents ( X j )) j – Leave in the summation for Xi only factors mentioning Xi – Multiply the factors, getting a factor that contains a number for each value of the variables mentioned, including Xi – Sum out Xi, getting a factor f that contains a number for each value of the variables mentioned, not including Xi – Replace the multiplied factor in the summation 54 Complexity of variable elimination Suppose in one elimination step we compute fx (y1 ,, yk ) f 'x (x , y1 ,, yk ) x m f 'x (x , y1 , , y k ) fi (x , y1,1, , y1,li ) This requires m Val(X ) Val(Yi ) i 1 i multiplications (for each value for x, y1, …, yk, we do m multiplications) and Val(X ) Val(Yi ) i additions (for each value of y1, …, yk , we do |Val(X)| additions) ►Complexity is exponential in the number of variables in the intermediate factors ►Finding an optimal ordering is NP-hard 55 Exercise: Variable elimination p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart pass p(pass|…) study .9 .7 study .5 .1 smart smart prep prep prep prep fair .9 .7 .7 .2 fair .1 .1 .1 .1 Query: What is the probability that a student is smart, given that they pass the exam? 56 Conditioning a b c d e • Conditioning: Find the network’s smallest cutset S (a set of nodes whose removal renders the network singly connected) – In this network, S = {A} or {B} or {C} or {D} • For each instantiation of S, compute the belief update with your favorite inference algorithm • Combine the results from all instantiations of S • Computationally expensive (finding the smallest cutset is in general NPhard, and the total number of possible instantiations of S is O(2|S|)) 57 Approximate inference: Direct sampling • Suppose you are given values for some subset of the variables, E, and want to infer values for unknown variables, Z • Randomly generate a very large number of instantiations from the BN – Generate instantiations for all variables – start at root variables and work your way “forward” in topological order • Rejection sampling: Only keep those instantiations that are consistent with the values for E • Use the frequency of values for Z to get estimated probabilities • Accuracy of the results depends on the size of the sample (asymptotically approaches exact results) 58 Exercise: Direct sampling p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart pass p(pass|…) smart smart prep prep prep prep fair .9 .7 .7 .2 fair .1 .1 .1 .1 study .9 .7 study .5 .1 Topological order = …? Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42 59 Likelihood weighting • Idea: Don’t generate samples that need to be rejected in the first place! • Sample only from the unknown variables Z • Weight each sample according to the likelihood that it would occur, given the evidence E 60 Markov chain Monte Carlo algorithm • So called because – Markov chain – each instance generated in the sample is dependent on the previous instance – Monte Carlo – statistical sampling method • Perform a random walk through variable assignment space, collecting statistics as you go – Start with a random instantiation, consistent with evidence variables – At each step, for some nonevidence variable, randomly sample its value, consistent with the other current assignments • Given enough samples, MCMC gives an accurate estimate of the true distribution of values 61 Exercise: MCMC sampling p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair p(prep|…) smart smart pass p(pass|…) smart smart prep prep prep prep fair .9 .7 .7 .2 fair .1 .1 .1 .1 study .9 .7 study .5 .1 Topological order = …? Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42 62 Summary • Bayes nets – Structure – Parameters – Conditional independence – Chaining • BN inference – Enumeration – Variable elimination – Sampling methods 63