Download Slides

Master of Science in Artificial Intelligence, 2010-2012 Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea http://turing.cs.pub.ro/krr_10 curs.cs.pub.ro Lecture 11 Uncertain representation of knowledge Lecture outline  Uncertain knowledge  Belief networks  Bayesian prediction 2 1. Uncertain knowledge  Probability theory – 2 main interpretations  Statistical = measure of proportion of individuals (long range frequency of a set of events)  Prob of a bird flying = proportion of birds that fly out of the set af all birds  Personal, subjective or Bayesian = an agent's measure of belief in some proposition based on the agent's knowledge  Prob of a bird flying = the agent's measure of belief in the flying ability of an individual based on the knowledge that the individual is a bird  Can be viewed as a measure over all the worlds that are possible, given the agent's knowledge about a particular situation (in each possible world, the bird either flies or it does not) 3 Bayesian probability  Both views have the same calculus  We talk about the second view  We assume uncertainty is epistemological - pertaining to the agent's knowledge about the world, rather than ontological – how the world is Semantics of (prior) probability  Interpretations – on possible worlds  Specify not only the truth of formulas but also how likely the real world is as compared to these formulas  Modal logics – possible worlds + accessibility relation  Probabilities – possible worlds + a measure on p.w. 4 Semantics of probability  A possible world is an assignment of exactly one value to every random variable.  Let W be the set of all possible worlds. If wW and f is a formula, f is true in w (w |= f) is defined inductively on the structure of f: • • • • w |= x=v iff w assigns value v to x w |=W f iff w |=/ f w |= f  g iff w |= f and w |= g w |= f  g iff w |= f or w |=W g (or w |= ¬f)  Associated with each possible world is a measure. When there are only a finite no. of worlds:  0p(w) for all wW wW p(w) = 1 5 Semantics of probability  The probability of a formula f is the sum of all measures of the possible worlds in which f is true. P(f)= w |= f p(w) Semantics of conditional probability  A formula e representing the conjunction of all agent's observations of the world is called evidence  The measure of belief in formula h based on formula e is called conditional probability of h given e, P(h|e)  Evidence e will rule out all possible worlds that are incompatible with evidence e 6 Semantics of probability  Evidence e introduces a new measure pe over possible worlds where all worlds in which e is false have measure 0 and the remaining worlds are normalized so that the sum of the measures of the worlds is 1  pe(w) = p(w)/P(e) if w |= f 0 if w |=/ f P(h|e) =  w |=h pe(w) = ( w |= he p(w) )/ P(e) = P(he)/P(e)  We assume P(e)>0. If P(e) = 0 then e is false in all possible worlds and thus can not be observed  Chain rule P(f1  … fn)=P(f1) x P(f2|f1) x …P(fn|f1  …  fn-1) 7 Bayes theorem  Given the current belief in a proposition H based on evidence K, P(H|K), we observe E.  P(H|EK) = P(E|H  K) * P(H|K) P(E|K)  If the background knowledge K is implicit  P(H|E) = P(E|H) * P(H) P(E) 8 Independence assumptions  Independence. The knowledge of the truth of one proposition does not affect the belief in another  A random variable X is independent of a random variable Y given a random variable Z if for all values of the random variables (i.e., ai, bj, ck) P(X=ai|Y=bj  Z=ck) = P(X=ai|Z=ck)  Knowledge of Y's value does not affect the belief in the value of X, given the value of Z. 9 2. Belief networks  A BN (Belief Network or Bayesian Network) is a graphical representation of conditional independence  It is represented a Directed Acyclic Graph (DAG)  The nodes represent random variables.  The edges represent direct dependence among the variables.  XY: X has a direct influence on Y (represents a statistical dependence)  X = Parent(Y) if XY  X = Ancestor(Y) if there is a direct path from X to Y (X..  Y)  Z = Descendant(Y) if Z=Y or there is a direct path from Y to Z (Y..  Z) 10 BN The independence assumption embedded in a BN is:  Each random variable is independent of its nondescendants given its parents Y1,..Yn – parents of X P(X=a|Y1=v1  …  Yn=vn  R)= P(X=a|Y1=v1  …  Yn=vn) if R does not involve descendants, including itself  The number of probabilities needed to be specified for each variable is exponential in the number of parents of a variable  BN contains a set of conditional probability tables P(X=a|Y1=v1  …  Yn=vn) 11 BN  Therefore a BN defines a Joint Probability Distribution (JPD) over the variables in the network  A value of the JPD can be computed as: P(X1=x1  … Xn=xn) = i=1,n P(Xi=xi | parents(Xi)) where parents(xi) represent the specific values of Parents(Xi)  P(X1=x1  … Xn=xn) = P(x1,…, xn) = P(xn | xn-1,…, x1) * P(xn-1,…, x1) = … = i=1,n P(xi | xi-1,…, x1)  Order of variables in the BN P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that Parents(Xi)  { Xi-1,…, X1} 12 BN  P(Xi | Xi-1,…, X1) = P(Xi | Parents(Xi)) provided that Parents(Xi)  { Xi-1,…, X1}  A BN is a correct representation of the domain, provided that each node is conditionally independent of its predecessors, given its parents 13 P(T) 0.002 Tampering F T T F F T T F T F P(A) 0.5 0.99 0.85 0.0001 A P(L) T 0.88 F 0.001 P(F) 0.001 Fire Alarm Smoke F P(S) T 0.9 F 0.01 Leaving Report L P(R) T 0.75 F 0.01 Instead of computing the joint distribution of all the variables by the chain rule P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T) the BN deﬁnes a unique JPD in a factored form, i.e. P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L) Inferences  The probability of a variable given nondescendants can be computed using the "reasoning by case" rule  P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))= P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S))  P(A|S) = P(A|F,T)*P(F,T|S) + P(A|F,~T)*P(F,~T|S) + P(A|~F,T)*P(~F,T|S) + P(A|~F,~T)*P(~F,~T|S)  The right hand side of each product can be computed using the multiplicative rule P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T)  For computing P(F|T,S) we can not use the independence assumption because S is a descendant of F; we can use Bayes rule instead P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F)) / P(S|T) Inferences  The prior probabilities (with no evidence) of each variable are: P(Tampering) = 0.02 P(Fire) = 0.1 P(Report) = 0.028 P(Smoke) = 0.0189  Observing the Report gives P(Tampering|Report) = 0.399 P(Fire|Report) = 0.2305 P(Smoke|Report) = 0.215  The probability of both Tampering and Fire are increased by the Report  Because Fire is increased, so is the probability of Smoke 16 Inferences  Suppose instead that Smoke was observed P(Tampering|Smoke) = 0.02 P(Fire|Smoke) = 0.476 P(Report|Smoke) = 0.320  Note that the probability of tampering is not affected by observing Smoke, however the probability of Report and Fire are increased  Suppose that both Report and Smoke were observed P(Tampering|Report, Smoke) = 0.0284 P(Fire|Report, Smoke) = 0.964  Thus, observing both makes Fire more likely  However, in the context of Report, the presence of Smoke makes Tampering less likely. 17 Inferences  Suppose instead that there is a Report but no Smoke P(Tampering|Report,~Smoke) = 0.501 P(Fire|Report,~Smoke) = 0.0294  In the context of Report, Fire becomes much less likely and so the probability of Tampering increases to explain Report. 18 Determining posterior distributions  Problem = computing conditional probabilities given the evidence  Estimating posterior probabilities in a BN within an absolute error (of less than 0.5) is NP-hard  3 main approaches (1) Exploit the structure of the network  Clique tree propagation method – the network is transformed into a tree with nodes labeled with sets of variables. Reasoning is performed by passing messages between the nodes in the tree  Time complexity is linear in the number of nodes of the tree; the tree is in fact a polytree, so its size may be exponential in the size of the belief network 19 Determining posterior distributions (2) Search-based approaches  Enumerate all possible worlds and estimate posterior probabilities from the worlds in general (3) Stochastic simulation  Random cases are generated according to a probability distribution. By treating these cases as a set of samples, one can estimate the marginal distribution on any combination of variables 20 A structure approach method  Based on the notion that a BN specifies a factorization of the JPD  A factor is a representation of a function from a tuple of random variables into a number.  f(X1,..,Xn), X1,..,Xn are the variables of the factor, f is a factor on X1,..,Xn;  if f(X1,..,Xn) is a factor and each vi is an element of the domain of Xi  f(X1=v1,..,Xj=vj) is a number that is the value of f when each Xi has value vj 21 A structure approach method  The product of two factors f1 and f2 is a factor on the union of the variables (f1 x f2)(X1,…,Xi,Y1,…,Yj,Z1,…,Zk) = f1(X1,…,Xi,Y1,…,Yj) x f2(Y1,…,Yj,Z1,…,Zk)  Given a factor f(X1,…,Xi), one can sum out a variable, say X1, and the result is a factor on X2,…,Xi (X1f)(X2,…,Xi) = f(X1=v1,…,Xi)+…+ f(X1=vk,…,Xi)  A conditional probability distribution can be seen as f(X=u,Y1=v1…Yj=vj) = P(X=u|Y1=v1….Yj=vj) 22 A structure approach method  BN inference problem = computing the posterior distribution of a variable given some evidence  can be reduced to the problem of computing the probabilities of conjunctions  Given the evidence Y1=v1… Yj=vj and the query variable Z: P(Z|v1,…. vj) = P(Z,v1,…vj) / P(v1,..vj) = P(Z,v1,…vj) / zP(z,v1,..vj)  => compute the factor P(Z,v1,…vj) and normalize 23 A structure approach method  The variables of the BN are X1,…,Xn.  To compute the factor P(Z,v1,…vj) we must sum out the other variables from the JPD.  Be Z1,…Zk an enumeration of the other variables in the BN  Z1,…Zk = {X1,…,Xn} - {Z} - {Y1,…,Yj}  The factor can be computing by summing out on Zi.  The order of the Zi is an elimination order  P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj 24 A structure approach method  P(Z,Y1=v1,…Yj=vj) = Zk….Z1P(X1,…Xn)Y1=v1,…,Yj=vj  There is a possible world for each assignment of a value to each variable.  The JPD P(X1,…Xn) gives the probability (measure) for each possible world  The approach selects the worlds with the observed values for the Y's and sum over possible worlds with the same value for Z => in fact this is the definition of conditional probability 25 A structure approach method  By the rule for conjunction of probabilities and the definition of a BN: P(X1,…Xn)=P(X1|Parents(X1)) * …*P(Xn|Parents(Xn))  Now the BN inference problem is reduced to a problem of summing out a set of variables from a product of factors.  To compute the posterior distribution of a query variable given observations: • Construct the JPD in terms of a product of factors • Set the observed variables to their observed values • Sum out each of the other variables (the Z1…Zk) • Multiply the remaining factors and normalize 26 A structure approach method  To sum out a variable Z from a product f1…fk of factors:  We must first partition the factors into those that do not contain Z, say f1,..,fi, and those that contain Z, say fi+1…fk  Then Zf1 x …x fk = f1 x .. x fi x (Z fi+1 x … x fk)  Then explicitly construct a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor  The factor size is exponential in the number of variables of the factor 27 3. Bayesian prediction 5 bags of candies Candies h1: 100% cherry h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime H (set of hypothesis) – type of bag with values h1 .. h5 Collect evidence (random variables): d1, d2, … with possible values cherry or lime Goal: predict the flavour of the next candy 28 Bayesian prediction Be D the data with observed value d The probability of each hypothesis, based on Bayes' rule, is: P(hi|d) =  P(d|hi) P(hi) (1) The prediction on an unknown hypothesis X is P(X|d) = Σi P(X|hi) P(hi|d) (2)  Key elements: prior probabilities P(hi) and the probability of an evidence for each hypothesis P(d|hi) P(d|hi) = Πj P(dj|hi) (3) We assume the prior probability: h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 29 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 P(hi|d) =  P(d|hi) P(hi) (1) h1: 100% cherry h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5  = 1/0.5 = 2 P(h1|lime) =  P(lime|h1)P(h1) = 2*0.1*0 = 0 P(h2|lime) =  P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1 P(h3|lime) =  P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4 P(h4|lime) =  P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3 P(h5|lime) =  P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2 30 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 P(hi|d) =  P(d|hi) P(hi) (1) h1: 100% cherry P(d|hi) = Πj P(dj|hi) (3) h2: 75% cherry 25% lime h3: 50% cherry 50% lime h4: 25% cherry 75% lime h5: 100% lime P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+ 0.1*1*1 = 0.325  = 1/0.325 = 3.0769 P(h1|lime,lime) =  P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0 P(h2|lime,lime) =  P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375 P(h3|lime,lime) =  P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3 P(h4|lime,lime) =  P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375 P(h5|lime,lime) =  P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3 31  P(hi|d1,…,d10) from equation (1) 32 h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 h1: 100% cherry h2: 75% cherry h3: 50% cherry h4: 25% cherry h5: 100% lime 25% lime 50% lime 75% lime P(X|d) = Σi P(X|hi) P(hi|d) (2) P(d2=lime|d1)=P(d2|h1)*P(h1|d1) + P(d2|h2)*P(h2|d1) P(d2|h3)*P(h3|d1) + P(d2|h4)*P(h4|d1) + P(d2|h5)*P(h5|d1) = = 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65 + Bayesian prediction 33 Remarks  The true hypothesis will finally dominate the prediction  Problems if the hypothesis space is big  Aproximation  Prediction based on the most probable hypothesis  MAP Learning – maximum aposteriori  P(X|d)=~P(X|hMAP)  In the xemaple hMAP=h5 after 3 evidences so 1.0  As more data is collected MAP and Bayes tend to be closer 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides