Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Artificial Intelligence
Chapter 20
Learning and Acting with
Bayes Nets
Biointelligence Lab
School of Computer Sci. & Eng.
Seoul National University
A Network and a Training Data
Figure 20.1 A Network and Some Sample Values
(C) 2000-2002 SNU CSE Biointelligence Lab
2
Learning Bayes Nets
The problem of learning a Bayes network is the problem
of finding a network that best matches (according to
some scoring metric) a training set of data, .
 By “finding network”, we mean finding both the structure of the
DAG and the conditional probability tables (CPTs) associated
with each node in the DAG.
Known network structure
 No missing data
 Missing data
Learning network structure
 The scoring metric
 Searching network space
(C) 2000-2002 SNU CSE Biointelligence Lab
3
Known Network Structure
If we knew the structure of the network, we have only to
find the CPTs.
No missing data
 Easy
 Each member of the training set  has a value for every variable
represented in the network.
Missing data
 More difficult
 The values of some of the variables are missing for some of the
training records.
(C) 2000-2002 SNU CSE Biointelligence Lab
4
No Missing Data
If we have an ample number of training samples, we
have only to compute sample statistics for each node and
its parents.
CPT for some node Vi given its parents P(Vi)
 There are as many tables for the node Vi as there are different
values for Vi (less one).
 In Boolean case, just one CPT for a Vi.
 If Vi have ki parent nodes, then there are 2ki entries (rows) in the
table.
^
 The sample statistics for vi and pi p(Vi  vi | Pi  p i )
 Given
by the number of samples in  having Vi = vi and Pi = pi
divided by the number of samples having Pi = pi
(C) 2000-2002 SNU CSE Biointelligence Lab
5
An Example for No Missing Data
^
p ( B  True)  0.94
^
p ( L  True)  0.68
^
p ( M  True | B  True, L  False )
1
 0.03
30
(C) 2000-2002 SNU CSE Biointelligence Lab
6
Some Notices
Some of the sample statistics in this example are based on
very small samples.
 This can lead to possibly inaccurate estimates of the corresponding
underlying probabilities.
 In general, the exponentially large number of parameters of a CPT
may overwhelm the ability of the training set to produce good
estimates of these parameters.
 Mitigating this problem is the possibility that many of the
parameters will have the same (or close to the same) value.
It is possible that before samples are observed, we may
have prior probabilities for the entries in the CPTs.
 Bayesian updating of the CPTs, given a training set, gives
appropriate weight to the prior probabilities.
(C) 2000-2002 SNU CSE Biointelligence Lab
7
Missing Data
In gathering training data to be used by a learning
process, it frequently happens that some data are missing.
 Sometimes, data are inadvertently missing.
 Sometimes, the fact that data are missing is important in itself.
The latter case is more difficult to deal with than the
former.
 In this lecture, we only deal with the former case.
(C) 2000-2002 SNU CSE Biointelligence Lab
8
An Example of Missing Data
(C) 2000-2002 SNU CSE Biointelligence Lab
9
The Weighted Sample
For the three cases (G, M, B, L) = (False, True, *, True)
 p(B|-G,M,L) could be computed with the CPTs of the network.
(Of course, there are no CPTs yet.)
 Then, each of these three examples could be replaced by two
weighted samples.
 One
in which B = True, weighted by p(B|-G,M,L)
 The other in which B = False, weighted by p(-B|-G,M,L) = 1 –
p(B|-G,M,L)
Each of the seven cases (G, M, B, L) = (*, *, True, True)
could be replaced by for weighted samples.
Now, the estimates of the CPTs could be computed with
the weighted samples and the rest of the samples.
(C) 2000-2002 SNU CSE Biointelligence Lab
10
The Expectation-Maximization (EM)
Algorithm
First, random values are selected for the parameters in
the CPTs for the entire network.
Secondly, the needed weights are computed.
Thirdly, these weights are in turn used to estimate new
CPTs.
Then, the second step and the third step are iterated until
the CPTs converge.
(C) 2000-2002 SNU CSE Biointelligence Lab
11
Learning Network Structure
If the network structure is not known, we must then
attempt to find that structure, as well as its associated
CPTs, that best fits the training data.
The scoring metric
 To score candidate networks
Searching among possible structures
(C) 2000-2002 SNU CSE Biointelligence Lab
12
The Scoring Metric
Several measures can be used to score competing
networks.
 One is based on a description length.
The idea based on the description length
 Suppose we wanted to transmit the training set, , to someone.
 To do so, we encode the values of the variables into a string of
bits, and send the bits.
Efficient codes take advantage of the statistical properties
of the data to be sent, and it is these statistical properties
that we are attempting to model in the Bayes network.
The best encoding requires L(,B) bits
L(, B)   log p[]
(C) 2000-2002 SNU CSE Biointelligence Lab
13
Minimum Description Length
Given some particular data, , we might to try to find the
network B0 that minimizes L(,B).
log p[] ( consists of m samples v1, …, vm.)
p()  i 1 p( vi ),
m
 log p()  i 1 log p( vi )
m
 Given a network structure and a training set, the CPTs that
minimize L(,B) are just those that are obtained from the
sample statistics computed from .
L(,B) alone favors large networks with many arcs.
 In order to transmit , we must also transmit a description of B
so that the receiver will be able to decode the message.
L' (, B)  i 1 log p( v i ) 
m
| B | log m
2
(C) 2000-2002 SNU CSE Biointelligence Lab
14
An Example for the Network Score
p ( firstentry)
 p(G | B) p( M | B, L) p( B) p( L)
 0.569
...
L(, B)  196.68
| B | log 100 8  log 100
 26.58
2
2
L' (, B)  196.68  26.58  223.26
(C) 2000-2002 SNU CSE Biointelligence Lab
15
Searching Network Space
The set of all possible Bayes Nets is so large that we
could not even contemplate any kind of exhaustive
search.
Hill-descending or greedy search
 We start with a given initial network, evaluate L’(,B), and then
make small changes to it to see if these changes produce
networks that decrease L’(,B).
The computation of description length is decomposable
into the computations over each CPT in the network.
(C) 2000-2002 SNU CSE Biointelligence Lab
16
An Example of Structural Learning
(1/2)
Target network generates
training data.
(C) 2000-2002 SNU CSE Biointelligence Lab
17
An Example of Structural Learning
(2/2)
Induced network
learned from
prior network
and training data
(C) 2000-2002 SNU CSE Biointelligence Lab
18
Hidden Nodes
The description-length score of the network on the right
will be better if this one also does as well or better at
fitting the data.
Hidden nodes can be added in the search process and the
values of the corresponding hidden variables are missing,
so the EM algorithm is used.
(C) 2000-2002 SNU CSE Biointelligence Lab
19
Probabilistic Inference and Action
The general setting
 An agent that uses a sense/plan/act cycle
 A goal
 A schedule
of rewards that are given in certain environmental
states.
 The rewards induce a value for each state in terms of the total
discounted future reward that would be realized by an agent that
acted so as to maximize its reward.
 Our new agent knows only the probabilities that it is in various
states.
 An action taken in a given state might lead to any one of a set of
new states-with a probability associated with each.
 Through
planning and sensing, an agent selects the action that
maximizes its expected utility.
(C) 2000-2002 SNU CSE Biointelligence Lab
20
An Extended Example
 E: a state variable {-2, -1, 0, 1, 2}
 Each location has a utility U.
 E0 = 0
 Ai: the action at the i-th time step {L, R}
A successful move 0.5; no effect 0.25; an opposite move 0.25
 Si: the sensory signal at the i-th time step
 The same value with Ei 0.9; Each of the other values 0.025
(C) 2000-2002 SNU CSE Biointelligence Lab
21
Dynamic Decision Networks (1/2)
(C) 2000-2002 SNU CSE Biointelligence Lab
22
Dynamic Decision Networks (2/2)
A special type of belief network
After given the values E0 = 0, A0 = R, and S1 = 1, we can
use ordinary probabilistic inference to calculate the
expected utility value, U2, that would result first from A1
= R, and then from A1 = L.
Box-shaped nodes (): decision nodes
Diamond-shaped nodes (): utility variables
(C) 2000-2002 SNU CSE Biointelligence Lab
23
Computation of Ex[U2] (1/2)
The environment is Markovian by this network structure.
Ex[U2|E0 = 0, A0 = R, S1 = 1, A1 = R]
Ex[U2|E0 = 0, A0 = R, S1 = 1, A1 = L]
Using the polytree algorithm
p ( E2 | E0  0, A0  R, S1  1, A1  R)
  p ( E2 | E0 0, A0  R, S1  1, A1  R, E1 ) p ( E1 | E0  0, A0  R, S1  1, A1  R)
E1
  p ( E2 | A1  R, E1 ) p( E1 | E0  0, A0  R, S1  1)
E1
(C) 2000-2002 SNU CSE Biointelligence Lab
24
Computation of Ex[U2] (2/2)
p( E1 | E0  0, A0  R, S1  1)  kp( S1  1 | E0  0, A0  R, E1 ) p( E1 | E0  0, A0  R)
 kp( S1  1 | E1 ) p( E1 | E0  0, A0  R)
p( E2 | E0  0, A0  R, S1  1, A1  R)
 k  p( E2 | A1  R, E1 ) p( S1  1 | E1 ) p( E1 | E0  0, A0  R)
E1
•With this probability, the Ex[U2] given A1=R can be calculated.
• Similarly, Ex[U2] given A1=L can be calculated.
•Then the action that yields the larger value is selected.
(C) 2000-2002 SNU CSE Biointelligence Lab
25
Generalizing the Example
p( E2 | values before t  1 , S1  1, A1 )
 k  p( E2 | A1 , E1 ) p( S1  1 | E1 ) p( E1 | values before t  1 )
E1
p( Ei 1 | values before t  i , Si  si , Ai )
 k  p( Ei 1 | Ai , Ei ) p( Si  si | Ei ) p( Ei | values before t  i )
Ei
(C) 2000-2002 SNU CSE Biointelligence Lab
26
Making Decisions about Actions (1/2)
1. From the last time step, (i - 1) (and after sensing Si – 1 = si - 1),
we have already calculated p(Ei|<values before t = i>) for all
values of Ei.
2. At time t = i, we sense Si = si and use the sensor model to
calculate p(Si = si|Ei) for all values of Ei.
3. From the action model, we calculate p(Ei + 1|Ai, Ei) for all
values of Ei and Ai.
4. For each value of Ai, and for a particular value of Ei + 1, we
sum the product p(Ei + 1|Ai, Ei)p(Si = si|Ei)p(Ei|<values before
t = i>) over all values Ei and multiply by a constant, k, to
yield values proportional to p(Ei + 1|<values before t = i>, Si =
si, Ai).
(C) 2000-2002 SNU CSE Biointelligence Lab
27
Making Decisions about Actions (2/2)
5. We repeat the preceding step for all the other values of Ei+1
and calculate the constant k to get the actual values of
p(Ei+1|<values before t = i>, Si = si, Ai) for each value of Ei+1
and Ai.
6. Using these probability values, we calculate the expected
value of Ui+1 for each value of Ai, and select that Ai that
maximizes that expected value.
7. We take the action selected in the previous step, advance i by
1, and iterate.
(C) 2000-2002 SNU CSE Biointelligence Lab
28
Additional Readings and Discussion
Learning Bayes nets is an active field of research with
important new papers appearing annually.
[Neal 1991] describes methods for learning Bayes nets
using neural networks.
[Friedman 1997] describes a technique for learning
Bayes nets when both the structure of the network is
unknown and when there is missing data.
The evaluation of utilities in stochastic situation
constitutes the subject matter of decision theory.
(C) 2000-2002 SNU CSE Biointelligence Lab
29