Download Overview Tutorial on Bayesian Networks Decision-theoretic techniques Applications to AI problems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Airborne Networking wikipedia , lookup

Transcript
Overview
■
Decision-theoretic techniques
◆ Explicit
management of uncertainty and tradeoffs
theory
◆ Maximization of expected utility
Tutorial on Bayesian Networks
◆ Probability
■
Applications to AI problems
◆ Diagnosis
Jack Breese & Daphne Koller
◆ Expert
systems
◆ Planning
First given as a AAAI’97 tutorial.
◆ Learning
2
1
Applications
Science- AAAI-97
■
Model Minimization in Markov Decision Processes
■
Effective Bayesian Inference for Stochastic Programs
■
Learning Bayesian Networks from Incomplete Data
■
Summarizing CSP Hardness With Continuous
Probability Distributions
■
Speeding Safely: Multi-criteria Optimization in
Probabilistic Planning
■
Structured Solution Methods for Non-Markovian
Decision Processes
Microsoft' s cost-cutting helps users
04/21/97
A Microsoft Corp. strategy to cut its support costs by letting users solve their
own problems using electronic means is paying off for users.In March, the
company began rolling out a series of Troubleshooting Wizards on its World
Wide Web site.
Troubleshooting Wizards save time and money for users who don' t
have Windows NT specialists on hand at all times, said Paul Soares,
vice president and general manager of Alden Buick Pontiac, a General
Motors Corp. car dealership in Fairhaven, Mass
3
4
Teenage Bayes
Course Contents
»
Microsoft Researchers
Exchange Brainpower with
Eighth-grader
Concepts in Probability
◆ Probability
◆ Random
◆ Basic
Teenager Designs AwardWinning Science Project
variables
properties (Bayes rule)
Bayesian Networks
Inference
■ Decision making
■ Learning networks from data
■ Reasoning over time
■ Applications
■
.. For her science project, which she
■
called "Dr. Sigmund Microchip,"
Tovar wanted to create a computer
program to diagnose the probability of
certain personality types. With only
answers from a few questions, the
program was able to accurately
diagnose the correct personality type
90 percent of the time.
5
6
1
Probabilities
■
Discrete Random Variables
Probability distribution P(X|ξ)
◆X
■
is a random variable
Finite set of possible outcomes
X ∈ {x1 , x2 , x3 ,..., xn }
■ Discrete
■ Continuous
◆
P( xi ) ≥ 0
ξ is background state of information
0.4
0.35
0.3
n
∑
0.25
P ( xi ) = 1
i =1
X binary:
0.2
0.15
0.1
0.05
P ( x ) + P ( x )= 1
0
X1
X2
X3
7
8
Continuous Random Variable
■
More Probabilities
Probability distribution (density function)
over continuous values
X ∈ [0,10 ]
■
Joint
P ( x, y ) ≡ P ( X = x ∧ Y = y )
that both X=x and Y=y
P ( x) ≥ 0
◆ Probability
10
∫ P ( x ) dx
=1
■
P(x)
◆ Probability
7
P (5 ≤ x ≤ 7 ) =
∫ P ( x ) dx
5
Conditional
P ( x | y) ≡ P( X = x | Y = y)
that X=x given we know that Y=y
0
5
7
x
9
Rules of Probability
■
Product Rule
P( H , E ) = P( H | E ) P( E ) = P( E | H ) P( H )
Marginalization
P (Y ) =
P( H | E ) =
n
∑
10
Bayes Rule
P( X , Y ) = P( X | Y ) P(Y ) = P(Y | X ) P( X )
■
X4
P (Y , x i )
P( E | H ) P( H )
P( E )
i =1
X binary:
P(Y ) = P(Y , x) + P(Y , x )
11
12
2
Course Contents
■
»
Bayesian networks
Concepts in Probability
Bayesian Networks
■
◆ Basics
◆ Additional
Basics
◆ Structured
representation
independence
◆ Naïve Bayes model
◆ Independence facts
structure
acquisition
◆ Conditional
◆ Knowledge
Inference
Decision making
■ Learning networks from data
■ Reasoning over time
■ Applications
■
■
13
14
Bayesian Networks
S ∈ {no, light , heavy} Smoking
P(S=no)
0.80
P(S=light) 0.15
P(S=heavy) 0.05
Product Rule
■
Cancer
C ∈ {none, benign, malignant}
Smoking=
P(C=none)
P(C=benign)
P(C=malig)
no
0.96
0.03
0.01
light
0.88
0.08
0.04
heavy
0.60
0.25
0.15
P(C,S) = P(C|S) P(S)
S⇓ C⇒
no
light
heavy
none
benign
0.768
0.132
0.035
malignant
0.024
0.012
0.010
0.008
0.006
0.005
15
16
Bayes Rule Revisited
Marginalization
P(S | C) =
S⇓ C⇒ none benign malig
total
0.768
0.024 0.008
.80
no
0.132
0.012
0.006
.15
light
0.035
0.010 0.005
.05
heavy
total 0.935 0.046 0.019
P(Smoke)
P(Cancer)
17
P(C | S )P(S) P(C, S )
=
P(C)
P(C)
S⇓ C⇒ none
benign
0.768/.935 0.024/.046
no
0.132/.935 0.012/.046
light
0.030/.935 0.015/.046
heavy
malig
Cancer=
P(S=no)
P(S=light)
P(S=heavy)
malignant
0.421
0.316
0.263
none
0.821
0.141
0.037
benign
0.522
0.261
0.217
0.008/.019
0.006/.019
0.005/.019
18
3
A Bayesian Network
Age
Independence
Gender
Age
Exposure
to Toxics
Gender
Smoking
P(A,G) = P(G)P(A)
P(A|G) = P(A) A ⊥ G
P(G|A) = P(G) G ⊥ A
Cancer
Serum
Calcium
Age and Gender are
independent.
Lung
Tumor
P(A,G) = P(G|A) P(A) = P(G)P(A)
P(A,G) = P(A|G) P(G) = P(A)P(G)
19
More Conditional Independence:
Naïve Bayes
Conditional Independence
Age
20
Cancer is independent
of Age and Gender
given Smoking.
Gender
Serum Calcium and Lung
Tumor are dependent
Cancer
Serum Calcium is
independent of Lung Tumor,
given Cancer
Smoking
P(C|A,G,S) = P(C|S)
Serum
Calcium
C ⊥ A,G | S
Lung
Tumor
P(L|SC,C) = P(L|C)
Cancer
21
22
More Conditional Independence:
Explaining Away
Naïve Bayes in general
H
E1
E2
E3
…...
Exposure
to Toxics
Smoking
En
Cancer
2n + 1 parameters:
P(h)
P(ei | h), P(ei | h ), i = 1, , n
Exposure to Toxics and
Smoking are independent
E⊥S
Exposure to Toxics is
dependent on Smoking,
given Cancer
P(E = heavy | C = malignant) >
P(E = heavy | C = malignant, S=heavy)
23
24
4
Put it all together
General Product (Chain) Rule
for Bayesian Networks
P ( A, G , E , S , C , L , SC ) =
Age
Gender
P ( A ) ⋅ P (G ) ⋅
n
Exposure
to Toxics
Smoking
P ( E | A ) ⋅ P ( S | A, G ) ⋅
i =1
Pai=parents(Xi)
P (C | E , S ) ⋅
Cancer
Serum
Calcium
, X n ) = ∏ P( X i | Pai )
P( X1, X 2 ,
Lung
Tumor
P ( SC | C ) ⋅ P ( L | C )
25
26
Conditional Independence
Another non-descendant
A variable (node) is conditionally independent
of its non-descendants given its parents.
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
Lung
Tumor
Age
Gender
Exposure
to Toxics
Smoking
Non-Descendants
Parents Cancer is independent
of Age and Gender
given Exposure to
Toxics and Smoking.
Cancer is independent
of Diet given
Exposure to Toxics
and Smoking.
Cancer
Diet
Serum
Calcium
Lung
Tumor
Descendants
27
28
Independence and Graph
Separation
Bayesian networks
■
Given a set of observations, is one set of
variables dependent on another set?
■ Observing effects can induce dependencies.
■ d-separation (Pearl 1988) allows us to check
conditional independence graphically.
■
Additional structure
◆ Nodes
as functions
independence
◆ Context specific dependencies
◆ Continuous variables
◆ Hierarchy and model construction
◆ Causal
29
30
5
Nodes as functions
■
lo : 0.7
A BN node is conditional distribution function
◆ its
parent values are the inputs
◆ its output is a distribution over its values
med : 0.1
a
A
a
ab ab
A
lo
b
XX med
hi
ab ab
0.1
0.4
0.7
0.5
0.3
0.2
0.1
0.3
0.6
0.4
0.2
0.2
med : 0.1
hi : 0.2
Any type of function
from Val(A,B)
XX
to distributions
over Val(X)
lo : 0.7
b
hi : 0.2
B
B
31
32
Causal Independence
Burglary
Fine-grained model
Burglary
Earthquake
■
■
e e
w 1-rE 0
w rE 1
b
b
m 1-rB 0
m rB 1
Alarm
■
Earthquake
Motion sensed
Burglary causes Alarm iff motion sensor clear
Earthquake causes Alarm iff wire loose
Enabling factors are independent of each other
deterministic or
33
Wire Move
Alarm
34
Noisy-Or model
Alarm false only if all mechanisms independently inhibited
CPCS
Network
Earthquake
Burglary
P(a) = 1 - Π
parent X
active
rX
# of parameters is linear in the # of parents
35
36
6
Context-specific Dependencies
Alarm-Set
Burglary
Asymmetric dependencies
Alarm-Set
Cat
A
Alarm
■
■
■
Node function
represented
as a tree
Alarm can go off only if it is Set
A burglar and the cat can both set off the alarm
If a burglar comes in, the cat hides and does not set
off the alarm
■
Outdoor
Temperature
A/C Setting
hi
97o
Local OK
Indoor
Temperature
Local
Transport
Function from Val(A,B)
Indoor
to density
functions
Temperature
over Val(X)
P(x)
Printer
Output
x
39
Gaussian (normal) distributions
P( x ) =
38
Continuous variables
Print
Data
Location
S s
s
(a: 0, a : 1)
b B b
c C c (a: 0.9, a : 0.1)
Alarm independent of
◆ Burglary, Cat given s
◆ Cat given s and b
Asymmetric Assessment
Net
Transport
Cat
(a: 0.01, a : 0.99) (a: 0.6, a : 0.4)
37
Net OK
Burglary
40
Gaussian networks
X ~ N ( µ , σ X2 )
 − ( x − µ )2 
1
exp

2σ
2π σ


X
Y
Y ~ N (ax + b, σ Y2 )
N(µ, σ)
Each variable is a linear
function of its parents,
with Gaussian noise
Joint probability density functions:
different mean
different variance
41
X
Y
X
Y
42
7
Owner
Owner
Income
Age
Composing functions
Maintenance
Recall: a BN node is a function
■ We can compose functions to get more
complex functions.
■ The result: A hierarchically structured BN.
■ Since functions can be called more than
once, we can reuse a BN model fragment in
multiple contexts.
■
Original-value
Age
Mileage
Brakes:
Brakes
Power
Car: Engine: Engine
Engine
Power
Tires:
RF-Tire
Tires
LF-Tire
Pressure Traction
Fuel-efficiency
Braking-power
43
44
Bayesian Networks
What is a variable?
■
■
Knowledge acquisition
Collectively exhaustive, mutually exclusive
values
◆ Variables
x1 ∨ x2 ∨ x3 ∨ x4
◆ Structure
◆ Numbers
Error Occured
¬ ( xi ∧ x j ) i ≠ j
■ Values
No Error
versus Probabilities
Risk of Smoking
Smoking
45
46
Clarity Test:
Knowable in Principle
Structuring
Weather {Sunny, Cloudy, Rain, Snow}
■ Gasoline: Cents per gallon
■ Temperature { ≥ 100F , < 100F}
■ User needs help on Excel Charting {Yes, No}
■ User’s personality {dominant, submissive}
■
Age
Gender
Exposure
to Toxic
Smoking
Cancer
Lung
Tumor
47
Network structure corresponding
to “causality” is usually good.
Genetic
Damage
Extending the conversation.
48
8
Do the numbers really matter?
Local Structure
Second decimal usually does not matter
■ Relative Probabilities
■
■
■
■
Causal independence: from
n
2 to n+1 parameters
Asymmetric assessment:
similar savings in practice.
Typical savings (#params):
◆
Zeros and Ones
■ Order of Magnitude : 10-9 vs 10-6
■ Sensitivity Analysis
◆
■
145 to 55 for a small
hardware network;
133,931,430 to 8254 for
CPCS !!
49
50
Course Contents
Inference
Concepts in Probability
■ Bayesian Networks
» Inference
■ Decision making
■ Learning networks from data
■ Reasoning over time
■ Applications
Patterns of reasoning
■ Basic inference
■ Exact inference
■ Exploiting structure
■ Approximate inference
■
■
51
52
Predictive Inference
Age
Combined
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
How likely are elderly males
to get malignant cancer?
P(C=malignant | Age>60, Gender= male)
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
Lung
Tumor
53
How likely is an elderly
male patient with high
Serum Calcium to have
malignant cancer?
P(C=malignant | Age>60,
Gender= male, Serum Calcium = high)
Lung
Tumor
54
9
Explaining away
Age
Inference in Belief Networks
Gender
Exposure
to Toxics
■
If we see a lung tumor,
the probability of heavy
smoking and of exposure
to toxics both go up.
■
If we then observe heavy
smoking, the probability
of exposure to toxics goes
back down.
Smoking
Cancer
Serum
Calcium
■
Lung
Tumor
Find P(Q=q|E= e)
◆ Q the query variable
◆ E set of evidence variables
P(q, e)
P(q | e) =
P(e)
X1,…, Xn are network variables except Q, E
P(q, e) = Σ P(q, e, x1,…, xn)
x1,…, xn
55
56
Basic Inference
A
Product Rule
S
B
■
P(b) = ?
C
P(C,S) = P(C|S) P(S)
S⇓ C⇒
no
light
heavy
none
benign
0.768
0.132
0.035
malignant
0.024
0.012
0.010
0.008
0.006
0.005
57
58
Basic Inference
Marginalization
A
S⇓ C⇒ none benign malig
total
0.768
0.024 0.008
.80
no
0.132
0.012
0.006
.15
light
0.035
0.010 0.005
.05
heavy
total 0.935 0.046 0.019
B
C
P(b) = Σ P(a, b) = Σ P(b | a) P(a)
a
P(Smoke)
a
P(c) = Σ P(c | b) P(b)
b
P(c) = Σ P(a, b, c) = Σ P(c | b) P(b | a) P(a)
b,a
b,a
= Σ P(c | b) Σ P(b | a) P(a)
P(Cancer)
b
a
P(b)
59
60
10
Inference in trees
Polytrees
■
A network is singly connected (a polytree)
if it contains no undirected loops.
'
Y2
Y1
&
X
P(x) =yΣ, y P(x | y1, y2) P(y1, y2)
1
Theorem: Inference in a singly connected
network can be done in linear time*.
2
because of independence of Y1, Y2:
Main idea: in variable elimination, need only maintain
distributions over single nodes.
= yΣ, y P(x | y1, y2) P(y1) P(y2)
1
2
61
c c
P(r) 0.99 0.01
0
Cloudy
Rain
P(g) =
Sprinkler
62
The problem with loops contd.
The problem with loops
P(c) 0.5
* in network size including table sizes.
c c
P(s) 0.01 0.99
0
P(g | r, s) P(r, s) + P(g | r, s) P(r, s)
+ P(g | r, s) P(r, s) + P(g | r, s) P(r, s)
0
1
deterministic or
Grass-wet
= P(r, s) ~ 0
The grass is dry only if no rain and no sprinklers.
= P(r) P(s) ~ 0.5 ·0.5 = 0.25
P(g) = P(r, s) ~ 0
63
Variable elimination
A
B
P(A)
■
a
P(b)
P(B | A)
A factor over X is a function from val(X) to
numbers in [0,1]:
◆A
◆A
C
x
P(B, A)
ΣA
■
P(B)
P(C | B)
x
P(C, B)
64
Inference as variable elimination
P(c) = Σ P(c | b) Σ P(b | a) P(a)
b
problem
CPT is a factor
joint distribution is also a factor
BN inference:
◆ factors
are multiplied to give new ones
in factors summed out
◆ variables
ΣB
■
P(C)
65
A variable can be summed out as soon as all
factors mentioning it have been multiplied.
66
11
Variable Elimination with loops
Join trees*
A join tree is a partially precompiled factorization
P(A) P(G) P(S | A,G)
Age
Gender
x
P(E | A)
Σ
G
P(A,G,S)
Exposure
to Toxics
Smoking
ΣA
x
Σ
E,S
P(C)
P(C,L)
Σ
P(E,S,C)
Serum
Calcium
Lung
Tumor
x
P(L | C)
Age
Gender
Exposure
to Toxics
Smoking
P(A,S)
P(L)
C
67
Serum
Calcium
E, S, C
C, S-C
* aka junction trees, Lauritzen-Spiegelhalter, Hugin alg., …
Earthquake Burglary Truck
Wire Move
E
B
T
W
E’
B’
T’
W’
or
or
Alarm
E:
Inference with continuous variables
◆
Fire
■
■
&
Smoke
Concentration
S ~ N (a F w + bF , σ F2 )
T
W
E’
B’
T’
W’
E:
or
or
70
Theorem: Inference in a multi-connected
Bayesian network is NP-hard.
Boolean 3CNF formula φ = (u∨ v ∨ w)∧ (u ∨ w ∨ y)
discrete variables cannot depend on continuous
Smoke
Concentration
B
Computational complexity
Gaussian networks: polynomial time inference
regardless of network structure
Conditional Gaussians:
Wind
Speed
E
Smaller families
Smaller factors
Faster inference
69
■
68
Wind
Alarm
Earthquake
Noisy or:
■
C, L
Noisy-or decomposition
Idea: explicitly decompose nodes
deterministic or
A, E, S
Lung
Tumor
Exploiting Structure
Motion sensed
A, G, S
Cancer
Complexity is exponential in the size of the factors
Burglary
P(A) x P(G) x
P(S | A,G) x
P(A,E,S)
P(C | E,S)
P(E,S)
Cancer
x
P(A,S)
'
U
Smoke
Alarm
V
W
Y
prior probability1/2
or
These techniques do not work for general hybrid
networks.
or
and
71
Probability ( ) =
1/2n
· # satisfying assignments of φ
72
12
Stochastic simulation
Burglary
P(b) 0.03
Likelihood weighting
Earthquake P(e)
0.001
Burglary
be be be be
Alarm
P(a) 0.98 0.7 0.4 0.01
Call = c
a a
P(c) 0.8 0.05
Newscast
e
a a
P(c) 0.8 0.05
e
P(n) 0.3 0.001
Samples:
P(b|c) ~
b e a c n
Call = c
# of live samples with B=b
total # of live samples
Newscast
weight of samples with B=b
P(b|c) =
total weight of samples
...
Samples:
b e a c n
Alarm
B E A C N weight
b e a c n 0.8
b e a c n 0.95
B E A C N
Earthquake
...
73
74
Other approaches
■
Search based techniques
CPCS
Network
◆ search
◆ use
■
for high-probability instantiations
instantiations to approximate probabilities
Structural approximation
◆ simplify
network
■ eliminate
edges, nodes
node values
■ simplify CPTs
■ abstract
◆ do
inference in simplified network
75
76
Course Contents
Decision making
Concepts in Probability
■ Bayesian Networks
■ Inference
» Decision making
■ Learning networks from data
■ Reasoning over time
■ Applications
Decisions, Preferences, and Utility functions
■ Influence diagrams
■ Value of information
■
■
77
78
13
A Decision Problem
Decision making
Decision - an irrevocable allocation of domain
resources
■ Decision should be made so as to maximize
expected utility.
■ View decision making in terms of
Should I have my party inside or outside?
■
dry
Regret
in
wet
dry
◆ Beliefs/Uncertainties
Relieved
Perfect!
out
◆ Alternatives/Decisions
wet
◆ Objectives/Utilities
Disaster
79
80
Value Function
■
Preference for Lotteries
A numerical score over all possible states of
the world.
Location?
in
in
out
out
Weather?
dry
wet
dry
wet
0.25
$30,000
0.75
$0
Value
$50
$60
$100
$0
0.2
$40,000
0.8
$0
≈
81
Desired Properties for
Preferences over Lotteries
82
Expected Utility
Properties of preference ⇒
existence of function U, that satisfies:
If you prefer $100 to $0 and p < q then
p1
p
$100
q
$100
1-q
$0
p2
x1
x2
q1
q2
y1
y2
1-p
pn
$0
qn
xn
yn
iff
(always)
Σi pi U(xi)
83
<
Σi qi U(yi)
84
14
Attitudes towards risk
Some properties of U
U
1
$40,000
0.8
U($500)
$30,000
≈
$0
.5
$0
$0
0
0
⇒U
$1,000
l:
U(l)
0.2
.5
≠ monetary payoff
1000
400 500
Certain equivalent
insurance/risk premium
$ reward
U convex risk averse
U concave risk seeking
U linear
risk neutral
85
Are people rational?
$40,000
0.2
86
Maximizing Expected Utility
dry
0.25
$30,000
0.75
$0
0.7
in
.652
0.3
1
$30,000
out
.605
0
$0
0.8
$0
0.2 • U($40k)
0.8 • U($40k)
dry
> 0.25 • U($30k)
> U($30k)
$40,000
0.8
wet
0.7
0.3
U(Regret)=.632
U(Relieved)=.699
U(Perfect)=.865
wet
U(Disaster ) =0
$0
0.2
0.8 • U($40k)
choose the action that maximizes expected utility
EU(in) = 0.7 ⋅ .632 + 0.3 ⋅ .699 = .652
< U($30k)
Choose in
EU(out) = 0.7 ⋅ .865 + 0.3 ⋅ 0 = .605
87
88
Multi-attribute utilities
Influence Diagrams
(or: Money isn’t everything)
Burglary
Earthquake
■
Many aspects of an outcome combine to
determine our preferences.
Alarm
Newcast
planning: cost, flying time, beach quality,
food quality, …
◆ medical decision making: risk of death (micromort),
quality of life (QALY), cost of treatment, …
Call
◆ vacation
■
For rational decision making, must combine all
relevant factors into single utility function.
Go
Home?
Miss
Meeting
89
Goods
Recovered
Big
Sale
Utility
90
15
Decision Making with Influence
Diagrams
Value-of-Information
Burglary
Earthquake
■ What
is it worth to get another piece of
information?
■ What is the increase in (maximized)
expected utility if I make a decision with an
additional piece of information?
■ Additional information (if free) cannot make
you worse off.
■ There is no value-of-information if you will
not change your decision.
Alarm
Call
Newcast
Call?
Go Home?
Neighbor Phoned Yes
No Phone Call
No
Go
Home?
Miss
Meeting
Goods
Recovered
Big
Sale
Utility
Expected Utility of this policy is 100
91
92
Value-of-Information is the
increase in Expected Utility
Value-of-Information in an
Influence Diagram
Burglary
Earthquake
Alarm
Burglary
Earthquake
Alarm
Call
Newcast
Call
Newcast
How much better
can we do when
this arc is here?
Go
Home?
Miss
Meeting
Phonecall?
Yes
Yes
No
No
Goods
Recovered
Big
Sale
Utility
Newscast?
Quake
No Quake
Quake
No Quake
Go Home?
No
Yes
No
No
Go
Home?
Miss
Meeting
Goods
Recovered
Big
Sale
Expected Utility of this policy is 112.5
93
Course Contents
Utility
94
Learning networks from data
Concepts in Probability
■ Bayesian Networks
■ Inference
■ Decision making
» Learning networks from data
■ Reasoning over time
■ Applications
The learning task
■ Parameter learning
■
■
◆ Fully
observable
observable
◆ Partially
■
■
95
Structure learning
Hidden variables
96
16
Parameter learning: one variable
The learning task
Burglary
B E A C N
b e a c n
Earthquake
■
Alarm
b e a c n
...
Call
◆ Let
Newscast
■
Input: training data
Unfamiliar coin:
Output: BN modeling data
If θ known (given), then
◆ P(X
Input: fully or partially observable data cases?
■ Output: parameters or also structure?
■
■
θ = bias of coin (long-run fraction of heads)
= heads | θ ) = θ
Different coin tosses independent given θ
⇒
P(X1, …, Xn | θ ) = θ h (1-θ)t
h heads, t tails
97
Bayesian approach
Maximum likelihood
■
Uncertainty about θ ⇒ distribution over its values
Input: a set of previous coin tosses
◆ X1,
…, Xn = {H, T, H, H, H, T, T, H, . . ., H}
h heads, t tails
■
Goal: estimate θ
■
The likelihood P(X1, …, Xn | θ ) = θ h (1-θ )t
■
98
P(θ )
The maximum likelihood solution is:
θ∗ = h
h+t
99
∫ P( X
−∞
θ
∞
= heads | θ ) P (θ ) dθ = ∫ θ P (θ ) dθ
−∞
100
Beta (α h , α t ) ∝ θ α
h heads, t tails
P(θ )
∞
Good parameter distribution:
Conditioning on data
D
P ( X = heads ) =
P(θ | D) ∝ P(θ ) P(D | θ )
= P(θ ) θ h (1-θ )t
α ) ∝ θ α h −1 (1 − θ )α t −1
1 head
1 tail
101
* Dirichlet
distribution generalizes Beta to non-binary variables.
102
17
General parameter learning
■
A multi-variable BN is composed of several
independent parameters (“coins”).
A
B
Three parameters:
θA, θB|a, θB|a
■
Earthquake
Alarm
b ? a ? n
Can use same techniques as one-variable
case to learn each one separately
Max likelihood estimate of θB|a would be:
θ∗B|a =
Burglary
B E A C N
b ? a c ?
Call
...
■
Partially observable data
Newscast
Fill in missing data with “expected” value
◆ expected
◆ use
#data cases with b, a
#data cases with a
= distribution over possible values
“best guess” BN to estimate distribution
103
Intuition
■
In fully observable case:
#data cases with n, e
θ∗n|e = #data cases with e =
I(e | dj) =
■
104
Expectation Maximization (EM)
Repeat :
Σj I(n,e | dj)
Σ j I(e | dj)
■
Expectation (E) step
◆
Use current parameters θ to estimate filled in data.
Iˆ( n, e | d j ) = Pθ ( n, e | d j )
1 if E=e in data case dj
0 otherwise
■
Maximization (M) step
◆
Use filled in data to do max likelihood estimation
~
θ n| e =
In partially observable case I is unknown.
Best estimate for I is:
Iˆ(n, e | d j ) = Pθ * ( n, e | d j )
■
Problem: θ* unknown.
∑ Iˆ(n, e | d )
∑ Iˆ(e | d )
j
j
j
~
Set: θ := θ
j
until convergence.
105
Structure learning
106
Search space
Space = network structures
Operators = add/reverse/delete edges
Goal:
find “good” BN structure (relative to data)
Solution:
do heuristic search over space of network
structures.
107
108
18
Heuristic search
Scoring
Use scoring function to do heuristic search (any algorithm).
Greedy hill-climbing with randomness works pretty well.
Fill in parameters using previous techniques
& score completed networks.
■ One possibility for score:
■
likelihood function: Score(B) = P(data | B)
'
Example: X, Y independent coin tosses
typical data = (27 h-h, 22 h-t, 25 t-h, 26 t-t)
score
Maximum likelihood network structure:
X
Y
Max. likelihood network typically fully connected
This is not surprising: maximum likelihood always overfits…
109
Better scoring functions
■
Hidden variables
MDL formulation: balance fit to data and
model complexity (# of parameters)
■
Score(B) = P(data | B) - model complexity
There may be interesting variables that we
never get to observe:
◆ topic
■
110
Full Bayesian formulation
of a document in information retrieval;
current task in online help system.
◆ user’s
◆ prior
on network structures & parameters
parameters ⇒ higher dimensional space
◆ get balance effect as a byproduct*
■
◆ more
* with Dirichlet parameter prior, MDL is an approximation
to full Bayesian score.
Our learning algorithm should
◆ hypothesize
◆ learn
the existence of such variables;
an appropriate state space for them.
111
112
E1
E1
E3
E3
E2
E2
randomly
scattered data
actual data
113
19
E1
Bayesian clustering (Autoclass)
Class
naïve Bayes model:
E1
■
■
■
■
…...
E2
En
(hypothetical) class variable never observed
if we know that there are k classes, just run EM
learned classes = clusters
Bayesian analysis allows us to choose k, trade off
fit to data with model complexity
E3
E2
Resulting cluster
distributions
115
Detecting hidden variables
■
Unexpected correlations
Course Contents
hidden variables.
Hypothesized model
Data model
Cholesterolemia
Cholesterolemia
Test1
Test2
Test3
Test1
Cholesterolemia
Test2
Concepts in Probability
■ Bayesian Networks
■ Inference
■ Decision making
■ Learning networks from data
» Reasoning over time
■ Applications
■
Test3
Hypothyroid
“Correct” model
Test1
Test2
Test3
117
118
Reasoning over time
Dynamic environments
Dynamic Bayesian networks
■ Hidden Markov models
■ Decision-theoretic planning
■
State(t)
decision problems
◆ Structured representation of actions
◆ The qualification problem & the frame problem
◆ Causality (and the frame problem revisited)
State(t+1)
State(t+2)
◆ Markov
■
Markov property:
◆ past
independent of future given current state;
conditional independence assumption;
◆ implied by fact that there are no arcs t→ t+2.
◆a
119
120
20
Dynamic Bayesian networks
■ State
■
Hidden Markov model
described via random variables.
An HMM is a simple model for a partially
observable stochastic domain.
■
Each variable depends only on few others.
Drunk(t)
Drunk(t+1)
Drunk(t+2)
Velocity(t)
Velocity(t+1)
Velocity(t+2)
Position(t)
Position(t+1)
Position(t+2)
Weather(t)
Weather(t+1)
Weather(t+2)
State(t)
...
State transition
model
State(t+1)
Obs(t)
Obs(t+1)
Observation
model
121
122
Hidden Markov models (HMMs)
HMMs and DBNs
Partially observable stochastic environment:
■
◆
◆
■
0.15
Mobile robots:
HMMs are just very simple DBNs.
■ Standard inference & learning algorithms for
HMMs are instances of DBN algorithms
■
0.05
states = location
observations = sensor input
Speech recognition:
◆ Forward-backward
= polytree
= EM
◆ Viterbi = most probable explanation.
0.8
states = phonemes
◆ observations = acoustic signal
◆
■
◆ Baum-Welch
Biological sequencing:
◆
◆
states = protein structure
observations = amino acids
123
124
Acting under uncertainty
Partially observable MDPs
Markov Decision Problem (MDP)
action model
agent
observes
state
Action(t)
State(t)
Action(t+1)
State(t+1)
Reward(t)
■
■
agent observes
Action(t)
Obs, not state
Obs depends
on state Obs(t)
State(t+2)
Reward(t+1)
+100
-1
Obs(t+1)
State(t+1)
Reward(t)
Overall utility = sum of momentary rewards.
Allows rich preference model, e.g.:
rewards corresponding
to “get to goal asap” =
State(t)
Action(t+1)
State(t+2)
Reward(t+1)
The optimal action at time t depends on the
entire history of previous observations.
■ Instead, a distribution over State(t) suffices.
■
goal states
other states
125
126
21
Structured representation
Position(t)
Position(t+1)
Direction(t)
Direction(t+1)
Holding(t)
Holding(t+1)
Preconditions
Move:
Causality
Effects
Modeling the effects of interventions
Observing vs. “setting” a variable
■ A form of persistence modeling
■
■
Turn:
Position(t)
Position(t+1)
Direction(t)
Direction(t+1)
Holding(t)
Holding(t+1)
Probabilistic action model
• allows for exceptions & qualifications;
• persistence arcs: a solution to the frame problem.
127
128
Causal Theory
Temperature
Distributor Cap
Setting vs. Observing
Cold temperatures can cause
the distributor cap to
become cracked.
Temperature
Distributor Cap
Car Starts
If the distributor cap is
cracked, then the car is less
likely to start.
The car does not start.
Will it start if we
replace the distributor?
Car Starts
129
130
Predicting the effects of
interventions
Temperature
Distributor Cap
Car Starts
Mechanism Nodes
Distributor
The car does not start.
Will it start if we
replace the distributor?
Mstart
Start
What is the probability
that the car will start if I
replace the distributor
cap?
131
Mstart
Always Starts
Always Starts
Never Starts
Never Starts
Normal
Normal
Inverse
Inverse
Distributor
Cracked
Normal
Cracked
Normal
Cracked
Normal
Cracked
Normal
Starts?
Yes
Yes
No
No
No
Yes
Yes
No
132
22
Persistence
Course Contents
Pre-action
Post-action
Temperature
Temperature
Concepts in Probability
Bayesian Networks
■ Inference
■ Decision making
■ Learning networks from data
■ Reasoning over time
» Applications
■
■
Mstart
Dist
Mstart
Persistence
Start arc
Dist
Set to
Normal
Start
Observed
Abnormal
Assumption:The mechanism relating Dist to Start is
unchanged by replacing the Distributor.
133
134
Applications
■
Why use Bayesian Networks?
Medical expert systems
■ Explicit
management of uncertainty/tradeoffs
implies maintainability
■ Better, flexible, and robust recommendation
strategies
◆ Pathfinder
◆ Parenting
■
■ Modularity
MSN
Fault diagnosis
◆ Ricoh
FIXIT
◆ Decision-theoretic
troubleshooting
Vista
■ Collaborative filtering
■
135
136
Studies of Pathfinder Diagnostic
Performance
Pathfinder
Pathfinder is one of the first BN systems.
It performs diagnosis of lymph-node diseases.
■ It deals with over 60 diseases and 100 findings.
■ Commercialized by Intellipath and Chapman
Hall publishing and applied to about 20 tissue
types.
■
■
137
Naïve Bayes performed considerably better
than certainty factors and Dempster-Shafer
Belief Functions.
■ Incorrect zero probabilities caused 10% of
cases to be misdiagnosed.
■ Full Bayesian network model with feature
dependencies did best.
■
138
23
On Parenting: Selecting problem
Commercial system: Integration
■
Expert System with advanced diagnostic capabilities
◆
◆
◆
◆
■
■
■
uses key features to form the differential diagnosis
recommends additional features to narrow the differential
diagnosis
recommends features needed to confirm the diagnosis
explains correct and incorrect decisions
■
■
■
Video atlases and text organized by organ system
“Carousel Mode” to build customized lectures
Anatomic Pathology Information System
Diagnostic indexing for Home
Health site on Microsoft Network
Enter symptoms for pediatric
complaints
Recommends multimedia content
139
On Parenting : MSN
140
Single Fault approximation
Original Multiple Fault Model
141
142
Performing diagnosis/indexing
On Parenting: Selecting problem
143
144
24
FIXIT: Ricoh copy machine
RICOH Fixit
■
Diagnostics and information retrieval
145
Online Troubleshooters
146
Define Problem
147
148
Get Recommendations
Gather Information
149
150
25
Vista Project: NASA Mission
Control
Costs & Benefits of Viewing
Information
Decision quality
Decision-theoretic methods for display for high-stakes aerospace
decisions
Quantity of relevant information
151
152
Status Quo at Mission Control
Time-Critical Decision Making
•
Consideration of time delay in temporal process
Utility
Action A,t
Duration of
Process
State of
System H, to
E1, to
E2, to
En, to
E1, t’
State of
System H, t’
E2, t’
En, t’
153
Simplification: Highlighting
Decisions
Simplification: Highlighting
Decisions
■ Variable
threshold to control amount of
highlighted information
Oxygen
Fuel Pres
Chamb Pres
15.6
10.5
14.2
11.8
5.4
4.8
154
■ Variable
threshold to control amount of
highlighted information
Oxygen
Fuel Pres
Chamb Pres
15.6
10.5
14.2
11.8
5.4
4.8
He Pres
17.7
14.7
He Pres
17.7
14.7
Delta v
33.3
63.3
Delta v
33.3
63.3
Oxygen
Fuel Pres
Chamb Pres
He Pres
Delta v
10.2
12.8
0.0
15.8
32.3
10.6
12.5
0.0
15.7
63.3
Oxygen
Fuel Pres
Chamb Pres
He Pres
Delta v
10.2
12.8
0.0
15.8
32.3
10.6
12.5
0.0
15.7
63.3
155
156
26
Simplification: Highlighting
Decisions
What is Collaborative Filtering?
■ Variable
threshold to control amount of
highlighted information
A way to find cool websites, news stories,
music artists etc
■ Uses data on the preferences of many users,
not descriptions of the content.
■ Firefly, Net Perceptions (GroupLens), and
others offer this technology.
■
Oxygen
Fuel Pres
Chamb Pres
15.6
10.5
14.2
11.8
5.4
4.8
He Pres
17.7
14.7
Delta v
33.3
63.3
Oxygen
Fuel Pres
Chamb Pres
He Pres
Delta v
10.2
12.8
0.0
15.8
32.3
10.6
12.5
0.0
15.7
63.3
157
Bayesian Clustering for
Collaborative Filtering
158
Applying Bayesian clustering
user classes
Probabilistic summary of the data
■ Reduces the number of parameters to
represent a set of preferences
■ Provides insight into usage patterns.
■ Inference:
■
title 1
title 2
title1
title2
title3
P(Like title i | Like title j, Like title k)
...
title n
class1
p(like)=0.2
p(like)=0.7
p(like)=0.99
...
class2 ...
p(like)=0.8
p(like)=0.1
p(like)=0.01
159
Top 5 shows by user class
MSNBC Story clusters
Readers of commerce and
technology stories (36%):
■
■
■
E-mail delivery isn' t exactly
guaranteed
Should you buy a DVD player?
Price low, demand high for
Nintendo
Sports Readers (19%):
■
■
■
Umps refusing to work is the
right thing
Cowboys are reborn in win over
eagles
Did Orioles spend money wisely?
Class 1
• Power rangers
• Animaniacs
• X-men
• Tazmania
• Spider man
Readers of top promoted
stories (29%):
■
■
■
757 Crashes At Sea
Israel, Palestinians Agree To
Direct Talks
Fuhrman Pleads Innocent To
Perjury
Readers of “Softer” News (12%):
■
■
■
160
The truth about what things cost
Fuhrman Pleads Innocent To
Perjury
Real Astrology
161
Class 2
• Young and restless
• Bold and the beautiful
• As the world turns
• Price is right
• CBS eve news
Class 4
• 60 minutes
• NBC nightly news
• CBS eve news
• Murder she wrote
• Matlock
Class 3
• Tonight show
• Conan O’Brien
• NBC nightly news
• Later with Kinnear
• Seinfeld
Class 5
• Seinfeld
• Friends
• Mad about you
• ER
• Frasier
162
27
Richer model
What’s old?
Decision theory & probability theory provide:
Age
Gender
Likes
soaps
User class
■
■
principled models of belief and preference;
techniques for:
◆ integrating
evidence (conditioning);
decision making (max. expected utility);
◆ targeted information gathering (value of info.);
◆ parameter estimation from data.
◆ optimal
Watches
Seinfeld
Watches
NYPD Blue
Watches
Power Rangers
163
What’s new?
Some Important AI Contributions
Bayesian networks exploit domain structure to allow
compact representations of complex models.
Knowledge
Acquisition
164
Learning
Key technology for diagnosis.
■ Better more coherent expert systems.
■ New approach to planning & action modeling:
■
◆ planning
using Markov decision problems;
framework for reinforcement learning;
◆ probabilistic solution to frame & qualification
problems.
◆ new
Structured
Representation
■
Inference
165
New techniques for learning models from data.
166
What’s in our future?
■
Better models for:
◆ preferences
Structured
Representation
& utilities;
numerical probabilities.
◆ not-so-precise
Inferring causality from data.
■ More expressive representation languages:
■
◆ structured
domains with multiple objects;
of abstraction;
◆ reasoning about time;
◆ hybrid (continuous/discrete) models.
◆ levels
167
28