* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS 471 - Bayesian Networks
Survey
Document related concepts
Transcript
CMSC 471 Fall 2002 Class #19 – Monday, November 4 1 Today’s class • (Probability theory) • Bayesian inference – From the joint distribution – Using independence/factoring – From sources of evidence • Bayesian networks – – – – Network structure Conditional probability tables Conditional independence Inference in Bayesian networks 2 Bayesian Reasoning / Bayesian Networks Chapters 14, 15.1-15.2 3 Why probabilities anyway? • Kolmogorov showed that three simple axioms lead to the rules of probability theory – De Finetti, Cox, and Carnap have also provided compelling arguments for these axioms 1. All probabilities are between 0 and 1: • 0 <= P(a) <= 1 2. Valid propositions (tautologies) have probability 1, and unsatisfiable propositions have probability 0: • P(true) = 1 ; P(false) = 0 3. The probability of a disjunction is given by: • P(a b) = P(a) + P(b) – P(a b) a ab b 4 Inference from the joint: Example alarm ¬alarm earthquake ¬earthquake earthquake ¬earthquake burglary .001 .008 .0001 .0009 ¬burglary .01 .09 .001 .79 P(Burglary | alarm) = α P(Burglary, alarm) = α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake) = α [ (.001, .01) + (.008, .09) ] = α [ (.009, .1) ] Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.009+.1) = 9.173 (i.e., P(alarm) = 1/α = .109 – quizlet: how can you verify this?) P(burglary | alarm) = .009 * 9.173 = .08255 P(¬burglary | alarm) = .1 * 9.173 = .9173 5 Independence • When two sets of propositions do not affect each others’ probabilities, we call them independent, and can easily compute their joint and conditional probability: – Independent (A, B) → P(A B) = P(A) P(B), P(A | B) = P(A) • For example, {moon-phase, light-level} might be independent of {burglary, alarm, earthquake} – Then again, it might not: Burglars might be more likely to burglarize houses when there’s a new moon (and hence little light) – But if we know the light level, the moon phase doesn’t affect whether we are burglarized – Once we’re burglarized, light level doesn’t affect whether the alarm goes off • We need a more complex notion of independence, and methods for reasoning about these kinds of relationships 6 Conditional independence • Absolute independence: – A and B are independent if P(A B) = P(A) P(B); equivalently, P(A) = P(A | B) and P(B) = P(B | A) • A and B are conditionally independent given C if – P(A B | C) = P(A | C) P(B | C) • This lets us decompose the joint distribution: – P(A B C) = P(A | C) P(B | C) P(C) • Moon-Phase and Burglary are conditionally independent given Light-Level • Conditional independence is weaker than absolute independence, but still useful in decomposing the full joint probability distribution 7 Bayes’ rule • Bayes rule is derived from the product rule: – P(Y | X) = P(X | Y) P(Y) / P(X) • Often useful for diagnosis: – If X are (observed) effects and Y are (hidden) causes, – We may have a model for how causes lead to effects (P(X | Y)) – We may also have prior beliefs (based on experience) about the frequency of occurrence of effects (P(Y)) – Which allows us to reason abductively from effects to causes (P(Y | X)). 8 Bayesian inference • In the setting of diagnostic/evidential reasoning H i P(Hi ) hypotheses P(E j | Hi ) E1 Ej Em evidence/m anifestati ons – Know prior probability of hypothesis conditional probability – Want to compute the posterior probability P(Hi ) P(E j | Hi ) P(Hi | E j ) • Bayes’ theorem (formula 1): P(Hi | E j ) P(Hi )P(E j | Hi ) / P(E j ) 9 Simple Bayesian diagnostic reasoning • Knowledge base: – Evidence / manifestations: – Hypotheses / disorders: E1, … Em H1, … H n • Ej and Hi are binary; hypotheses are mutually exclusive (nonoverlapping) and exhaustive (cover all possible cases) – Conditional probabilities: P(Ej | Hi), i = 1, … n; j = 1, … m • Cases (evidence for a particular instance): E1, …, El • Goal: Find the hypothesis Hi with the highest posterior – Maxi P(Hi | E1, …, El) 10 Bayesian diagnostic reasoning II • Bayes’ rule says that – P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El) • Assume each piece of evidence Ei is conditionally independent of the others, given a hypothesis Hi, then: – P(E1, …, El | Hi) = lj=1 P(Ej | Hi) • If we only care about relative probabilities for the Hi, then we have: – P(Hi | E1, …, El) = α P(Hi) lj=1 P(Ej | Hi) 11 Limitations of simple Bayesian inference • Cannot easily handle multi-fault situation, nor cases where intermediate (hidden) causes exist: – Disease D causes syndrome S, which causes correlated manifestations M1 and M2 • Consider a composite hypothesis H1 H2, where H1 and H2 are independent. What is the relative posterior? – P(H1 H2 | E1, …, El) = α P(E1, …, El | H1 H2) P(H1 H2) = α P(E1, …, El | H1 H2) P(H1) P(H2) = α lj=1 P(Ej | H1 H2) P(H1) P(H2) • How do we compute P(Ej | H1 H2) ?? 12 Limitations of simple Bayesian inference II • Assume H1 and H2 are independent, given E1, …, El? – P(H1 H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El) • This is a very unreasonable assumption – Earthquake and Burglar are independent, but not given Alarm: • P(burglar | alarm, earthquake) << P(burglar | alarm) • Another limitation is that simple application of Bayes’ rule doesn’t allow us to handle causal chaining: – A: year’s weather; B: cotton production; C: next year’s cotton price – A influences C indirectly: A→ B → C – P(C | B, A) = P(C | B) • Need a richer representation to model interacting hypotheses, conditional independence, and causal chaining • Next time: conditional independence and Bayesian networks! 13 Bayesian Belief Networks (BNs) • Definition: BN = (DAG, CPD) – DAG: directed acyclic graph (BN’s structure) • Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables) • Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence) – CPD: conditional probability distribution (BN’s parameters) • Conditional probabilities at each node, usually stored as a table (conditional probability table, or CPT) P ( xi | i ) where i is the set of all parent nodes of xi – Root nodes are a special case – no parents, so just use priors in CPD: i , so P ( xi | i ) P ( xi ) 14 Example BN P(A) = 0.001 a P(B|A) = 0.3 P(B|~A) = 0.001 b P(C|A) = 0.2 P(C|~A) = 0.005 c d P(D|B,C) = 0.1 P(D|B,~C) = 0.01 P(D|~B,C) = 0.01 P(D|~B,~C) = 0.00001 e P(E|C) = 0.4 P(E|~C) = 0.002 Note that we only specify P(A) etc., not P(¬A), since they have to add to one 15 Topological semantics • A node is conditionally independent of its nondescendants given its parents • A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket) • The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z 16 Independence and chaining • Independence assumption – P ( x i | i , q) P ( x i | i ) i where q is any set of variables q (nodes) other than x i and its successors xi – i blocks influence of other nodes on x i and its successors (q influences x i only through variables in i ) – With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPD by chaining these CPD P ( x1 ,..., xn ) ni1 P ( xi | i ) 17 Chaining: Example a b c d e Computing the joint probability for all variables is easy: P(a, b, c, d, e) = P(e | a, b, c, d) P(a, b, c, d) by Bayes’ theorem = P(e | c) P(a, b, c, d) by indep. assumption = P(e | c) P(d | a, b, c) P(a, b, c) = P(e | c) P(d | b, c) P(c | a, b) P(a, b) = P(e | c) P(d | b, c) P(c | a) P(b | a) P(a) 18 Direct inference with BNs • Now suppose we just want the probability for one variable • Belief update method • Original belief (no variables are instantiated): Use prior probability p(xi) • If xi is a root, then P(xi) is given directly in the BN (CPT at Xi) • Otherwise, – P(xi) = Σ πi P(xi | πi) P(πi) • In this equation, P(xi | πi) is given in the CPT, but computing P(πi) is complicated 19 Computing πi: Example a b • • • • c d e P (d) = P(d | b, c) P(b, c) P(b, c) = P(a, b, c) + P(¬a, b, c) (marginalizing) = P(b | a, c) p (a, c) + p(b | ¬a, c) p(¬a, c) (product rule) = P(b | a) P(c | a) P(a) + P(b | ¬a) P(c | ¬a) P(¬a) If some variables are instantiated, can “plug that in” and reduce amount of marginalization Still have to marginalize over all values of uninstantiated parents – not computationally feasible with large networks 20 Representational extensions • Compactly representing CPTs – Noisy-OR – Noisy-MAX • Adding continuous variables – Discretization – Use density functions (usually mixtures of Gaussians) to build hybrid Bayesian networks (with discrete and continuous variables) 21 Inference tasks • Simple queries: Computer posterior marginal P(Xi | E=e) – E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false) • Conjunctive queries: – P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e) • Optimal decisions: Decision networks include utility information; probabilistic inference is required to find P(outcome | action, evidence) • Value of information: Which evidence should we seek next? • Sensitivity analysis: Which probability values are most critical? • Explanation: Why do I need a new starter motor? 22 Approaches to inference • Exact inference – Enumeration – Variable elimination – Clustering / join tree algorithms • Approximate inference – – – – – – Stochastic simulation / sampling methods Markov chain Monte Carlo methods Genetic algorithms Neural networks Simulated annealing Mean field theory 23