Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Statistics and Belief Networks Overview • • • • Book: Ch 8.3 Refresher on Bayesian statistics Bayesian classifiers Belief Networks / Bayesian Networks Why Should We Care? • Theoretical framework for machine learning, classification, knowledge representation, analysis • Bayesian methods are capable of handling noisy, incomplete data sets • Bayesian methods are commonly in use today Bayesian Approach To Probability and Statistics • Classical Probability : Physical property of the world (e.g., 50% flip of a fair coin). True probability. • Bayesian Probability : A person’s degree of belief in event X. Personal probability. • Unlike classical probability, Bayesian probabilities benefit from but do not require repeated trials only focus on next event; e.g. probability Seawolves win next game? Bayes Rule Product Rule: Equating Sides: i.e. P A B P A| B P B P A B P B| A P A P( A| B) P( B) P B| A P( A) P(evidence| Class) P(Class) PClass| evidence P(evidence) All classification methods can be seen as estimates of Bayes’ Rule, with different techniques to estimate P(evidence|Class). Simple Bayes Rule Example Probability your computer has a virus, V, = 1/1000. If virused, the probability of a crash that day, C, = 4/5. Probability your computer crashes in one day, C, = 1/10. P(C|V)=0.8 P(C|V ) P(V ) (0.8)(0.001) 0.008 P(V)=1/1000 P(V | C) P(C) (01 .) P(C)=1/10 Even though a crash is a strong indicator of a virus, we expect only 8/1000 crashes to be caused by viruses. Why not compute P(V|C) from direct evidence? Causal vs. Diagnostic knowledge; (consider if P(C) suddenly drops). Bayesian Classifiers P(evidence| Class) P(Class) PClass| evidence P(evidence) If we’re selecting the single most likely class, we only need to find the class that maximizes P(e|Class)P(Class). Hard part is estimating P(e|Class). Evidence e typically consists of a set of observations: E (e1 , e2 ,..., en ) Usual simplifying assumption is conditional independence: n n P(e| C ) P(ei | C ) i 1 P(C| e) P(C) P(ei | C) i 1 P(e) Bayesian Classifier Example Probability P(C) P(crashes|C) P(diskfull|C) C=Virus 0.4 0.1 0.6 C=Bad Disk 0.6 0.2 0.1 Given a case where the disk is full and computer crashes, the classifier chooses Virus as most likely since (0.4)(0.1)(0.6) > (0.6)(0.2)(0.1). Beyond Conditional Independence Linear Classifier: C1 C2 • Include second-order dependencies; i.e. pairwise combination of variables via joint probabilities: P2 (e| c) P1 (e| c)[1 P1 (e| c)] Correction factor - n Difficult to compute - 2 joint probabilities to consider Belief Networks • DAG that represents the dependencies between variables and specifies the joint probability distribution • Random variables make up the nodes • Directed links represent causal direct influences • Each node has a conditional probability table quantifying the effects from the parents • No directed cycles Burglary Alarm Example P(B) 0.001 Burglary Alarm John Calls A T F P(E) 0.002 Earthquake B T T F F P(J) 0.90 0.05 E T F T F P(A) 0.95 0.94 0.29 0.001 Mary Calls A T F P(M) 0.70 0.01 Sample Bayesian Network Using The Belief Network Burglary P(B) 0.001 Earthquake B T T F F Alarm John Calls E T F T F P(A) 0.95 0.94 0.29 0.001 Mary Calls A T F P(J) 0.90 0.05 P(E) 0.002 A T F P(M) 0.70 0.01 n P( x1 , x2 ,... xn ) P( xi | Parents( X i )) i 1 Probability of alarm, no burglary or earthquake, both John and Mary call: P( J | A) P( M | A) P( A| B E ) P( B) P( E ) (0.9)(0.7)(0.001)(0.999)(0.998) 0.00062 Belief Computations • Two types; both are NP-Hard • Belief Revision – Model explanatory/diagnostic tasks – Given evidence, what is the most likely hypothesis to explain the evidence? – Also called abductive reasoning • Belief Updating – Queries – Given evidence, what is the probability of some other random variable occurring? Belief Revision • Given some evidence variables, find the state of all other variables that maximize the probability. • E.g.: We know John Calls, but not Mary. What is the most likely state? Only consider assignments where J=T and M=F, and maximize. Best: P(B) P(E ) P(A | B E ) P( J | A) P(M | A) (0.999)(0.998)(0.999)(0.05)(0.99) 0.049 Belief Updating • Causal Inferences E Q • Diagnostic Inferences Q E • Intercausal Inferences Q E • Mixed Inferences E Q E Causal Inferences Inference from cause to effect. E.g. Given a burglary, what is P(J|B)? Burglary P(B) 0.001 Earthquake B T T F F Alarm John Calls E T F T F P(A) 0.95 0.94 0.29 0.001 Mary Calls A T F P(J) 0.90 0.05 P(E) 0.002 A T F P(M) 0.70 0.01 P( J | B ) ? P( A | B ) P( B ) P(E )( 0.94) P( B ) P( E )( 0.95) P( A | B ) 1(0.998)( 0.94) 1(0.002)( 0.95) P( A | B ) 0.94 P( J | B ) P( A)( 0.9) P(A)( 0.05) P(M|B)=0.67 via similar calculations P( J | B ) (0.94)( 0.9) (0.06)( 0.05) 0.85 Diagnostic Inferences From effect to cause. E.g. Given that John calls, what is the P(burglary)? P( J | B) P( B) P( B | J ) P( J ) What is P(J)? Need P(A) first: P( A) P( B ) P( E )( 0.95) P(B ) P( E )( 0.29) P( B ) P(E )( 0.94) P(B ) P(E )( 0.001) P( A) (0.001)( 0.002)( 0.95) (0.999)( 0.002)( 0.29) (0.001)( 0.998)( 0.94) (0.998)( 0.999)( 0.001) P( A) 0.002517 P( J ) P( A)(0.9) P(A)(0.05) P( J ) (0.002517)(0.9) (0.9975)(0.05) P( J ) 0.052 P( B | J ) (0.85)(0.001) 0.016 (0.052) Many false positives. Intercausal Inferences Explaining Away Inferences. Given an alarm, P(B|A)=0.37. But if we add the evidence that earthquake is true, then P(B|A^E)=0.003. Even though B and E are independent, the presence of one may make the other more/less likely. Mixed Inferences Simultaneous intercausal and diagnostic inference. E.g., if John calls and Earthquake is false: P( A | J ^ E ) 0.03 P( B | J ^ E ) 0.017 Computing these values exactly is somewhat complicated. Exact Computation Polytree Algorithm • Judea Pearl, 1982 • Only works on singly-connected networks - at most one undirected path between any two nodes. • Backward-chaining Message-passing algorithm for computing posterior probabilities for query node X – Compute causal support for X, evidence variables “above” X – Compute evidential support for X, evidence variables “below” X Polytree Computation ... U(1 ) E x U(m) X Z(1,j) E x Z(n,j) ... Y(1 ) Y(n ) P( X | E ) P( X | E x ) P( E x | X ) P( X | E x ) P( X | u ) P(U i | Eui \ x ) u Algorithm recursive, message passing chain i P( E x | X ) P(E y | yi ) P( yi | X , z j ) P( zij | E zij\ yi ) i yi zj j Other Query Methods • Exact Algorithms – Clustering • Cluster nodes to make single cluster, message-pass along that cluster – Symbolic Probabilistic Inference • Uses d-separation to find expressions to combine • Approximate Algorithms – Select sampling distribution, conduct trials sampling from root to evidence nodes, accumulating weight for each node. Still tractable for dense networks. • Forward Simulation • Stochastic Simulation Summary • Bayesian methods provide sound theory and framework for implementation of classifiers • Bayesian networks a natural way to represent conditional independence information. Qualitative info in links, quantitative in tables. • NP-complete or NP-hard to compute exact values; typical to make simplifying assumptions or approximate methods. • Many Bayesian tools and systems exist References • Russel, S. and Norvig, P. (1995). Artificial Intelligence, A Modern Approach. Prentice Hall. • Weiss, S. and Kulikowski, C. (1991). Computer Systems That Learn. Morgan Kaufman. • Heckerman, D. (1996). A Tutorial on Learning with Bayesian Networks. Microsoft Technical Report MSR-TR95-06. • Internet Resources on Bayesian Networks and Machine Learning: http://www.cs.orst.edu/~wangxi/resource.html