Download Bayesian Statistics and Belief Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Linear belief function wikipedia , lookup

Pattern recognition wikipedia , lookup

Inductive probability wikipedia , lookup

Transcript
Bayesian Statistics and Belief
Networks
Overview
•
•
•
•
Book: Ch 13,14
Refresher on Probability
Bayesian classifiers
Belief Networks / Bayesian Networks
Why Should We Care?
• Theoretical framework for machine learning,
classification, knowledge representation, analysis
• Bayesian methods are capable of handling noisy,
incomplete data sets
• Bayesian methods are commonly in use today
Bayesian Approach To
Probability and Statistics
• Classical Probability : Physical property of the
world (e.g., 50% flip of a fair coin). True
probability.
• Bayesian Probability : A person’s degree of belief
in event X. Personal probability.
• Unlike classical probability, Bayesian probabilities
benefit from but do not require repeated trials only focus on next event; e.g. probability
Seawolves win next game?
Uncertainty
Methods for Handling
Uncertainty
Probability
Making Decisions Under
Uncertainty
Probability Basics
Random Variables
Prior Probability
Conditional Probability
Inference by Enumeration
Inference by Enumeration
Bayes Rule
Product Rule:
Equating
Sides:
i.e.
P A  B  P A| B P B
P A  B  P B| A P A
P( A| B) P( B)
P B| A 
P( A)
P(evidence| Class) P(Class)
PClass| evidence 
P(evidence)
All classification methods can be seen as estimates of Bayes’
Rule, with different techniques to estimate P(evidence|Class).
Inference by Enumeration
Simple Bayes Rule Example
Probability your computer has a virus, V, = 1/1000.
If virused, the probability of a crash that day, C, = 4/5.
Probability your computer crashes in one day, C, = 1/10.
P(C|V)=0.8
P(C|V ) P(V ) (0.8)(0.001)

 0.008
P(V)=1/1000 P(V | C) 
P(C)
(01
.)
P(C)=1/10
Even though a crash is a strong indicator of a virus, we expect only
8/1000 crashes to be caused by viruses.
Why not compute P(V|C) from direct evidence? Causal vs.
Diagnostic knowledge; (consider if P(C) suddenly drops).
Bayesian Classifiers
P(evidence| Class) P(Class)
PClass| evidence 
P(evidence)
If we’re selecting the single most likely class, we only
need to find the class that maximizes P(e|Class)P(Class).
Hard part is estimating P(e|Class).
Evidence e typically consists of a set of observations:
E  (e1 , e2 ,..., en )
Usual simplifying assumption is conditional independence:
n
n
P(e| C )   P(ei | C ) 
i 1
P(C| e) 
P(C) P(ei | C)
i 1
P(e)
Bayesian Classifier Example
Probability
P(C)
P(crashes|C)
P(diskfull|C)
C=Virus
0.4
0.1
0.6
C=Bad Disk
0.6
0.2
0.1
Given a case where the disk is full and computer crashes,
the classifier chooses Virus as most likely since
(0.4)(0.1)(0.6) > (0.6)(0.2)(0.1).
Beyond Conditional
Independence
Linear Classifier:
C1
C2
• Include second-order dependencies; i.e. pairwise
combination of variables via joint probabilities:
P2 (e| c)  P1 (e| c)[1  P1 (e| c)]
Correction factor -   n
Difficult to compute -  2 joint probabilities to consider
Belief Networks
• DAG that represents the dependencies between
variables and specifies the joint probability
distribution
• Random variables make up the nodes
• Directed links represent causal direct influences
• Each node has a conditional probability table
quantifying the effects from the parents
• No directed cycles
Burglary Alarm Example
P(B)
0.001
Burglary
Alarm
John Calls
A
T
F
P(E)
0.002
Earthquake
B
T
T
F
F
P(J)
0.90
0.05
E
T
F
T
F
P(A)
0.95
0.94
0.29
0.001
Mary Calls
A
T
F
P(M)
0.70
0.01
Sample Bayesian Network
Using The Belief Network
Burglary P(B)
0.001
Earthquake
B
T
T
F
F
Alarm
John Calls
E
T
F
T
F
P(A)
0.95
0.94
0.29
0.001
Mary Calls
A
T
F
P(J)
0.90
0.05
P(E)
0.002
A
T
F
P(M)
0.70
0.01
n
P( x1 , x2 ,... xn )   P( xi | Parents( X i ))
i 1
Probability of alarm, no burglary or earthquake, both John and Mary call:
P( J | A) P( M | A) P( A| B  E ) P( B) P( E )
(0.9)(0.7)(0.001)(0.999)(0.998)  0.00062
Belief Computations
• Two types; both are NP-Hard
• Belief Revision
– Model explanatory/diagnostic tasks
– Given evidence, what is the most likely hypothesis to
explain the evidence?
– Also called abductive reasoning
• Belief Updating
– Queries
– Given evidence, what is the probability of some other
random variable occurring?
Belief Revision
• Given some evidence variables, find the state of
all other variables that maximize the probability.
• E.g.: We know John Calls, but not Mary. What is
the most likely state? Only consider assignments
where J=T and M=F, and maximize. Best:
P(B) P(E ) P(A | B  E ) P( J | A) P(M | A)
(0.999)(0.998)(0.999)(0.05)(0.99)  0.049
Belief Updating
• Causal Inferences
E
Q
• Diagnostic Inferences
Q
E
• Intercausal Inferences
Q
E
• Mixed Inferences
E
Q
E
Causal Inferences
Inference from
cause to effect.
E.g. Given a
burglary, what is
P(J|B)?
Burglary P(B)
0.001
Earthquake
B
T
T
F
F
Alarm
John Calls
E
T
F
T
F
P(A)
0.95
0.94
0.29
0.001
Mary Calls
A
T
F
P(J)
0.90
0.05
P(E)
0.002
A
T
F
P(M)
0.70
0.01
P( J | B )  ?
P( A | B )  P( B ) P(E )( 0.94)  P( B ) P( E )( 0.95)
P( A | B )  1(0.998)( 0.94)  1(0.002)( 0.95)
P( A | B )  0.94
P( J | B )  P( A)( 0.9)  P(A)( 0.05)
P(M|B)=0.67 via similar calculations
P( J | B )  (0.94)( 0.9)  (0.06)( 0.05)
 0.85
Diagnostic Inferences
From effect to cause. E.g. Given that John calls, what is the P(burglary)?
P( J | B) P( B)
P( B | J ) 
P( J )
What is P(J)? Need P(A) first:
P( A)  P( B ) P( E )( 0.95)  P(B ) P( E )( 0.29)
 P( B ) P(E )( 0.94)  P(B ) P(E )( 0.001)
P( A)  (0.001)( 0.002)( 0.95)  (0.999)( 0.002)( 0.29)
 (0.001)( 0.998)( 0.94)  (0.998)( 0.999)( 0.001)
P( A)  0.002517
P( J )  P( A)(0.9)  P(A)(0.05)
P( J )  (0.002517)(0.9)  (0.9975)(0.05)
P( J )  0.052
P( B | J ) 
(0.85)(0.001)
 0.016
(0.052)
Many false positives.
Intercausal Inferences
Explaining Away Inferences.
Given an alarm, P(B|A)=0.37. But if we add the evidence that
earthquake is true, then P(B|A^E)=0.003.
Even though B and E are independent, the presence of
one may make the other more/less likely.
Mixed Inferences
Simultaneous intercausal and diagnostic inference.
E.g., if John calls and Earthquake is false:
P( A | J ^ E )  0.03
P( B | J ^ E )  0.017
Computing these values exactly is somewhat complicated.
Exact Computation Polytree Algorithm
• Judea Pearl, 1982
• Only works on singly-connected networks - at
most one undirected path between any two nodes.
• Backward-chaining Message-passing algorithm
for computing posterior probabilities for query
node X
– Compute causal support for X, evidence variables
“above” X
– Compute evidential support for X, evidence variables
“below” X
Polytree Computation
...
U(1
)
E x
U(m)
X
Z(1,j)
E x
Z(n,j)
...
Y(1
)
Y(n
)
P( X | E )  P( X | E x ) P( E x | X )
P( X | E x )   P( X | u ) P(U i | Eui \ x )
u
Algorithm recursive, message
passing chain
i
P( E x | X )    P(E y | yi ) P( yi | X , z j ) P( zij | E zij\ yi )
i
yi
zj
j
Other Query Methods
• Exact Algorithms
– Clustering
• Cluster nodes to make single cluster, message-pass along that
cluster
– Symbolic Probabilistic Inference
• Uses d-separation to find expressions to combine
• Approximate Algorithms
– Select sampling distribution, conduct trials sampling
from root to evidence nodes, accumulating weight for
each node. Still tractable for dense networks.
• Forward Simulation
• Stochastic Simulation
Summary
• Bayesian methods provide sound theory and
framework for implementation of classifiers
• Bayesian networks a natural way to represent
conditional independence information.
Qualitative info in links, quantitative in tables.
• NP-complete or NP-hard to compute exact values;
typical to make simplifying assumptions or
approximate methods.
• Many Bayesian tools and systems exist
References
• Russel, S. and Norvig, P. (1995). Artificial Intelligence,
A Modern Approach. Prentice Hall.
• Weiss, S. and Kulikowski, C. (1991). Computer Systems
That Learn. Morgan Kaufman.
• Heckerman, D. (1996). A Tutorial on Learning with
Bayesian Networks. Microsoft Technical Report MSRTR-95-06.
• Internet Resources on Bayesian Networks and Machine
Learning:
http://www.cs.orst.edu/~wangxi/resource.html