Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Master of Science in Artificial Intelligence, 2010-2012
Knowledge Representation
and Reasoning
University "Politehnica" of Bucharest
Department of Computer Science
Fall 2010
Adina Magda Florea
http://turing.cs.pub.ro/krr_10
curs.cs.pub.ro
Lecture 11
Uncertain representation of
knowledge
Lecture outline
 Uncertain knowledge
 Belief networks
 Bayesian prediction
2
1. Uncertain knowledge
 Probability theory – 2 main interpretations
 Statistical = measure of proportion of individuals (long
range frequency of a set of events)

Prob of a bird flying = proportion of birds that fly out of the set af all birds
 Personal, subjective or Bayesian = an agent's
measure of belief in some proposition based on the
agent's knowledge

Prob of a bird flying = the agent's measure of belief in the flying ability of an
individual based on the knowledge that the individual is a bird
 Can be viewed as a measure over all the worlds that are
possible, given the agent's knowledge about a particular
situation (in each possible world, the bird either flies or it
does not)
3
Bayesian probability
 Both views have the same calculus
 We talk about the second view
 We assume uncertainty is epistemological - pertaining
to the agent's knowledge about the world, rather than
ontological – how the world is
Semantics of (prior) probability
 Interpretations – on possible worlds
 Specify not only the truth of formulas but also how likely
the real world is as compared to these formulas
 Modal logics – possible worlds + accessibility relation
 Probabilities – possible worlds + a measure on p.w.
4
Semantics of probability
 A possible world is an assignment of exactly one value to
every random variable.
 Let W be the set of all possible worlds. If wW and f is a
formula, f is true in w (w |= f) is defined inductively on the
structure of f:
•
•
•
•
w |= x=v iff
w assigns value v to x
w |=W f iff
w |=/ f
w |= f  g iff
w |= f and w |= g
w |= f  g iff
w |= f or w |=W g
(or w |= ¬f)
 Associated with each possible world is a measure. When
there are only a finite no. of worlds:
 0p(w) for all wW
wW p(w) = 1
5
Semantics of probability
 The probability of a formula f is the sum of all measures
of the possible worlds in which f is true.
P(f)= w |= f p(w)
Semantics of conditional probability
 A formula e representing the conjunction of all agent's
observations of the world is called evidence
 The measure of belief in formula h based on formula e is
called conditional probability of h given e, P(h|e)
 Evidence e will rule out all possible worlds that are
incompatible with evidence e
6
Semantics of probability
 Evidence e introduces a new measure pe over possible
worlds where all worlds in which e is false have measure 0
and the remaining worlds are normalized so that the sum
of the measures of the worlds is 1
 pe(w) =
p(w)/P(e)
if
w |= f
0
if
w |=/ f
P(h|e) =  w |=h pe(w) = ( w |= he p(w) )/ P(e) = P(he)/P(e)
 We assume P(e)>0. If P(e) = 0 then e is false in all
possible worlds and thus can not be observed
 Chain rule
P(f1  … fn)=P(f1) x P(f2|f1) x …P(fn|f1  …  fn-1)
7
Bayes theorem
 Given the current belief in a proposition H based on
evidence K, P(H|K), we observe E.
 P(H|EK) =
P(E|H  K) * P(H|K)
P(E|K)
 If the background knowledge K is implicit
 P(H|E) =
P(E|H) * P(H)
P(E)
8
Independence assumptions
 Independence. The knowledge of the truth of one
proposition does not affect the belief in another
 A random variable X is independent of a random
variable Y given a random variable Z if for all
values of the random variables (i.e., ai, bj, ck)
P(X=ai|Y=bj  Z=ck) = P(X=ai|Z=ck)
 Knowledge of Y's value does not affect the belief
in the value of X, given the value of Z.
9
2. Belief networks
 A BN (Belief Network or Bayesian Network) is a
graphical representation of conditional independence
 It is represented a Directed Acyclic Graph (DAG)
 The nodes represent random variables.
 The edges represent direct dependence among the
variables.
 XY: X has a direct influence on Y (represents a
statistical dependence)
 X = Parent(Y) if XY
 X = Ancestor(Y) if there is a direct path from X to Y
(X..  Y)
 Z = Descendant(Y) if Z=Y or there is a direct path from
Y to Z (Y..  Z)
10
BN
The independence assumption embedded in a BN
is:
 Each random variable is independent of its
nondescendants given its parents
Y1,..Yn – parents of X
P(X=a|Y1=v1  …  Yn=vn  R)= P(X=a|Y1=v1  …  Yn=vn)
if R does not involve descendants, including itself
 The number of probabilities needed to be specified
for each variable is exponential in the number of
parents of a variable
 BN contains a set of conditional probability tables
P(X=a|Y1=v1  …  Yn=vn)
11
BN
 Therefore a BN defines a Joint Probability Distribution
(JPD) over the variables in the network
 A value of the JPD can be computed as:
P(X1=x1  … Xn=xn) = i=1,n P(Xi=xi | parents(Xi))
where parents(xi) represent the specific values of
Parents(Xi)
 P(X1=x1  … Xn=xn) = P(x1,…, xn) =
P(xn | xn-1,…, x1) * P(xn-1,…, x1) = … = i=1,n P(xi | xi-1,…, x1)
 Order of variables in the BN
P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi)  { Xi-1,…, X1}
12
BN
 P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that
Parents(Xi)  { Xi-1,…, X1}
 A BN is a correct representation of the
domain, provided that each node is
conditionally independent of its
predecessors, given its parents
13
P(T)
0.002
Tampering
F
T
T
F
F
T
T
F
T
F
P(A)
0.5
0.99
0.85
0.0001
A P(L)
T 0.88
F 0.001
P(F)
0.001
Fire
Alarm
Smoke
F P(S)
T 0.9
F 0.01
Leaving
Report
L P(R)
T 0.75
F 0.01
Instead of computing the joint distribution of all the variables by the chain rule
P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T)
the BN defines a unique JPD in a factored form, i.e.
P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L)
Inferences
 The probability of a variable given nondescendants can
be computed using the "reasoning by case" rule
 P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))=
P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S))
 P(A|S) = P(A|F,T)*P(F,T|S) +
P(A|F,~T)*P(F,~T|S) +
P(A|~F,T)*P(~F,T|S) +
P(A|~F,~T)*P(~F,~T|S)
 The right hand side of each product can be computed
using the multiplicative rule
P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T)
 For computing P(F|T,S) we can not use the
independence assumption because S is a descendant of
F; we can use Bayes rule instead
P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F))
/ P(S|T)
Inferences
 The prior probabilities (with no evidence) of each variable are:
P(Tampering) = 0.02
P(Fire) = 0.1
P(Report) = 0.028
P(Smoke) = 0.0189
 Observing the Report gives
P(Tampering|Report) = 0.399
P(Fire|Report) = 0.2305
P(Smoke|Report) = 0.215
 The probability of both Tampering and Fire are increased by
the Report
 Because Fire is increased, so is the probability of Smoke
16
Inferences
 Suppose instead that Smoke was observed
P(Tampering|Smoke) = 0.02
P(Fire|Smoke) = 0.476
P(Report|Smoke) = 0.320
 Note that the probability of tampering is not affected by
observing Smoke, however the probability of Report and Fire
are increased
 Suppose that both Report and Smoke were observed
P(Tampering|Report, Smoke) = 0.0284
P(Fire|Report, Smoke) = 0.964
 Thus, observing both makes Fire more likely
 However, in the context of Report, the presence of Smoke
makes Tampering less likely.
17
Inferences
 Suppose instead that there is a Report but no Smoke
P(Tampering|Report,~Smoke) = 0.501
P(Fire|Report,~Smoke) = 0.0294
 In the context of Report, Fire becomes much less likely and
so the probability of Tampering increases to explain Report.
18
Determining posterior distributions
 Problem = computing conditional probabilities given the
evidence
 Estimating posterior probabilities in a BN within an
absolute error (of less than 0.5) is NP-hard
 3 main approaches
(1) Exploit the structure of the network
 Clique tree propagation method – the network is
transformed into a tree with nodes labeled with sets of
variables. Reasoning is performed by passing messages
between the nodes in the tree
 Time complexity is linear in the number of nodes of the
tree; the tree is in fact a polytree, so its size may be
exponential in the size of the belief network
19
Determining posterior distributions
(2) Search-based approaches
 Enumerate all possible worlds and estimate posterior
probabilities from the worlds in general
(3) Stochastic simulation
 Random cases are generated according to a probability
distribution. By treating these cases as a set of samples,
one can estimate the marginal distribution on any
combination of variables
20
A structure approach method
 Based on the notion that a BN specifies a factorization of
the JPD
 A factor is a representation of a function from a tuple of
random variables into a number.
 f(X1,..,Xn), X1,..,Xn are the variables of the factor, f is a
factor on X1,..,Xn;
 if f(X1,..,Xn) is a factor and each vi is an element of the
domain of Xi
 f(X1=v1,..,Xj=vj) is a number that is the value of f when
each Xi has value vj
21
A structure approach method
 The product of two factors f1 and f2 is a factor on the
union of the variables
(f1 x f2)(X1,…,Xi,Y1,…,Yj,Z1,…,Zk) =
f1(X1,…,Xi,Y1,…,Yj) x f2(Y1,…,Yj,Z1,…,Zk)
 Given a factor f(X1,…,Xi), one can sum out a variable,
say X1, and the result is a factor on X2,…,Xi
(X1f)(X2,…,Xi) = f(X1=v1,…,Xi)+…+ f(X1=vk,…,Xi)
 A conditional probability distribution can be seen as
f(X=u,Y1=v1…Yj=vj) = P(X=u|Y1=v1….Yj=vj)
22
A structure approach method
 BN inference problem = computing the
posterior distribution of a variable given some
evidence
 can be reduced to the problem of computing the
probabilities of conjunctions
 Given the evidence Y1=v1… Yj=vj and the query
variable Z:
P(Z|v1,…. vj) = P(Z,v1,…vj) / P(v1,..vj)
= P(Z,v1,…vj) / zP(z,v1,..vj)
 => compute the factor P(Z,v1,…vj) and
normalize
23
A structure approach method
 The variables of the BN are X1,…,Xn.
 To compute the factor P(Z,v1,…vj) we must sum out the
other variables from the JPD.
 Be Z1,…Zk an enumeration of the other variables in the
BN
 Z1,…Zk = {X1,…,Xn} - {Z} - {Y1,…,Yj}
 The factor can be computing by summing out on Zi.
 The order of the Zi is an elimination order
 P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
24
A structure approach method
 P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj
 There is a possible world for each assignment of a value
to each variable.
 The JPD P(X1,…Xn) gives the probability (measure) for
each possible world
 The approach selects the worlds with the observed
values for the Y's and sum over possible worlds with the
same value for Z => in fact this is the definition of
conditional probability
25
A structure approach method
 By the rule for conjunction of probabilities and the
definition of a BN:
P(X1,…Xn)=P(X1|Parents(X1)) * …*P(Xn|Parents(Xn))
 Now the BN inference problem is reduced to a problem
of summing out a set of variables from a product of
factors.
 To compute the posterior distribution of a query variable
given observations:
• Construct the JPD in terms of a product of factors
• Set the observed variables to their observed values
• Sum out each of the other variables (the Z1…Zk)
• Multiply the remaining factors and normalize
26
A structure approach method
 To sum out a variable Z from a product f1…fk of factors:
 We must first partition the factors into those that do not
contain Z, say f1,..,fi, and those that contain Z, say fi+1…fk
 Then
Zf1 x …x fk = f1 x .. x fi x (Z fi+1 x … x fk)
 Then explicitly construct a representation (in terms of a
multidimensional array, a tree, or a set of rules) of the
rightmost factor
 The factor size is exponential in the number of variables
of the factor
27
3. Bayesian prediction
5 bags of candies
Candies
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
H (set of hypothesis) – type of bag with values h1 .. h5
Collect evidence (random variables): d1, d2, … with
possible values cherry or lime
Goal: predict the flavour of the next candy
28
Bayesian prediction
Be D the data with observed value d
The probability of each hypothesis, based on Bayes' rule, is:
P(hi|d) =  P(d|hi) P(hi)
(1)
The prediction on an unknown hypothesis X is
P(X|d) = Σi P(X|hi) P(hi|d)
(2)
 Key elements: prior probabilities P(hi) and the probability of
an evidence for each hypothesis P(d|hi)
P(d|hi) = Πj P(dj|hi)
(3)
We assume the prior probability:
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
29
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) =  P(d|hi) P(hi) (1)
h1: 100% cherry
h2: 75% cherry
25% lime
h3: 50% cherry
50% lime
h4: 25% cherry
75% lime
h5: 100% lime
P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5
 = 1/0.5 = 2
P(h1|lime) =  P(lime|h1)P(h1) = 2*0.1*0 = 0
P(h2|lime) =  P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1
P(h3|lime) =  P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4
P(h4|lime) =  P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3
P(h5|lime) =  P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2
30
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
P(hi|d) =  P(d|hi) P(hi) (1)
h1: 100% cherry
P(d|hi) = Πj P(dj|hi) (3)
h2: 75% cherry 25% lime
h3: 50% cherry 50% lime
h4: 25% cherry 75% lime
h5: 100% lime
P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+
0.1*1*1 = 0.325
 = 1/0.325 = 3.0769
P(h1|lime,lime) =  P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0
P(h2|lime,lime) =  P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375
P(h3|lime,lime) =  P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3
P(h4|lime,lime) =  P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375
P(h5|lime,lime) =  P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3
31
 P(hi|d1,…,d10) from equation (1)
32
h1 h2 h3 h4 h5
0.1 0.2 0.4 0.2 0.1
h1: 100% cherry
h2: 75% cherry
h3: 50% cherry
h4: 25% cherry
h5: 100% lime
25% lime
50% lime
75% lime
P(X|d) = Σi P(X|hi) P(hi|d) (2)
P(d2=lime|d1)=P(d2|h1)*P(h1|d1)
+
P(d2|h2)*P(h2|d1)
P(d2|h3)*P(h3|d1)
+ P(d2|h4)*P(h4|d1) + P(d2|h5)*P(h5|d1) =
= 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65
+
Bayesian prediction
33
Remarks
 The true hypothesis will finally dominate the
prediction
 Problems if the hypothesis space is big
 Aproximation
 Prediction based on the most probable
hypothesis
 MAP Learning – maximum aposteriori
 P(X|d)=~P(X|hMAP)
 In the xemaple hMAP=h5 after 3 evidences so 1.0
 As more data is collected MAP and Bayes tend to
be closer
34