Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch.6 Uncertain Knowledge Representation Hantao Zhang http://www.cs.uiowa.edu/∼ hzhang/c145 The University of Iowa Department of Computer Science Artificial Intelligence – p.1/39 Logic and Uncertainty One problem with logical approaches: Agents almost never have access to the whole truth about their environments. Very often, even in simple worlds, there are important questions for which there is no boolean answer. In that case, an agent must reason under uncertainty. Uncertainty also arises because of an agent’s incomplete or incorrect understanding of its environment. Artificial Intelligence – p.2/39 Uncertainty Let action Lt = “leave for airport t minutes before flight”. Will Lt get me there on time? Problems partial observability (road state, other drivers’ plans, etc.) noisy sensors (unreliable traffic reports) uncertainty in action outcomes (flat tire, etc.) immense complexity of modelling and predicting traffic Hence a purely logical approach either 1. risks falsehood (“A25 will get me there on time”), or 2. leads to conclusions that are too weak for decision making (“A25 will get me there on time if there’s no accident on the way, it doesn’t rain, my tires remain intact, . . . ”) Artificial Intelligence – p.3/39 Handling Uncertain Knowledge FOL-based approaches fail to cope with domains like medical diagnosis: too much work to write complete axioms, or too hard to work with the enormous sentences that result. Laziness: Theoretical Ignorance: The available knowledge of the domain is incomplete. The theoretical knowledge of the domain is complete but some evidential facts are missing. Practical Ignorance: Artificial Intelligence – p.4/39 Degrees of Belief In several real-world domains the agent’s knowledge can only provide a degree of belief in the relevant sentences. The agent cannot say whether a sentence is true, but only that is true x% of the times. The main tool for handling degrees of belief is Probability Theory. The use of probability summarizes the uncertainty that stems from our laziness or ignorance about the domain. Artificial Intelligence – p.5/39 Probability Theory Probability Theory makes the same ontological commitments as FOL: Every sentence ϕ is either true or false. The degree of belief that ϕ is true is a number P between 0 and 1. P (ϕ) = 1 P (ϕ) = 0 P (ϕ) = 0.65 −→ −→ −→ ϕ is certainly true. ϕ is certainly not true. ϕ is true with a 65% chance. Artificial Intelligence – p.6/39 Probability of Facts Let A be a propositional variable, a symbol denoting a proposition that is either true or false. P (A) denotes the probability that A is true in the absence of any other information. Similarly, P (¬A) = probability that A is false P (A ∧ B) = probability that both A and B are true P (A ∨ B) = probability that either A or B (or both) are true Examples: P (¬Blonde), P (Blonde ∧ BlueEyed), P (Blonde ∨ BlueEyed) Artificial Intelligence – p.7/39 Joint Probability Distribution If X1 , . . . , Xn are random variables, P(X1 = x1 , . . . , Xn = xn ), or simply P(X1 , . . . , Xn ) denotes their joint probability distribution (JPD), an n-dimensional matrix specifying the probability of every possible combination of values for X1 , . . . , Xn . Example Sky : {sunny, cloudy, rain, snow} W ind : {true, f alse} P(W ind, Sky) = true f alse sunny cloudy rain 0.30 0.15 0.17 0.30 0.05 0.01 snow 0.01 0.01 Artificial Intelligence – p.8/39 Joint Probability Distribution All relevant probabilities about a vector hX1 , . . . , Xn i of random variables can be computed from P(X1 , . . . , Xn ). S = sunny S = cloudy S = rain S = snow P(W ) W 0.30 0.15 0.17 0.01 0.63 ¬W 0.30 0.05 0.01 0.01 0.37 P(S) 0.60 0.20 0.18 0.02 1.00 P (S = rain ∧ W ) = 0.17 P (W ) = 0.30 + 0.15 + 0.17 + 0.01 = 0.63 P (S = rain) = 0.17 + 0.01 P (S = rain|W ) = P (S = rain ∧ W )/P (W ) = 0.17/0.63 = 0.18 = 0.27 Artificial Intelligence – p.9/39 Joint Probability Distribution A joint probability distribution P(X1 , . . . , Xn ) provides complete information about the probabilities of its random variables. However, JPD’s are often hard to create (again because of incomplete knowledge of the domain). Even when available, JPD tables are very expensive, or impossible, to store because of their size. A JPD table for n random variables, each ranging over k distinct values, has k n entries! A better approach is to come up with conditional probabilities as needed and compute the others from them. Artificial Intelligence – p.10/39 An alternative to JPD: The Bayes Rule Recall that for any fact A and B , P (A ∧ B) = P (A|B)P (B) = P (B|A)P (A) From this we obtain the Bayes Rule: P (A|B)P (B) P (B|A) = P (A) The rule is useful in practice because it is often easier to figure out P (A|B), P (B), P (A) than to figure out P (B|A) directly. Artificial Intelligence – p.11/39 Applying the Bayes Rule What is the probability that Sam is infected by HIV (humman immunodeficiency virus) if he is tested positive in a routine blood test using ELISA? ELISA (Enzyme-Linked Immunosorbent Assay) is a highly accurate and relatively cheap test for HIV. Let H = HIV present, E = positive in ELISA P (E|H)P (H) P (H|E) = P (E) Artificial Intelligence – p.12/39 Applying the Bayes Rule P (E|H)P (H) P (H|E) = P (E) P (E|H) is easier to estimate than P (H|E) because it refers to causal knowledge: people affected by HIV have high rate of testing positive in ELISA . P (E|H) can be estimated from past medical cases: say 9990 of 10000 HIV patients tested positive in ELISA. Similarly, P (H) can be estimated from past medical cases. If 1 in in 100,000 men in Sam’s state is affected with HIV, then P(H) = .00001. How to estimate P(E), the probability of testing positive in ELISA? Artificial Intelligence – p.13/39 Applying the Bayes Rule P (E|H)P (H) P (H|E) = P (E) How to estimate P(E), the probability of testing positive in ELISA? If we know the false positive rate (say .002) for those who are not affected by HIV, then P (E) = P (E|H)P (H) + P (E|H)P (H) = 0.999 ∗ 0.00001 + 0.002 ∗ 0.99999 = 0.00200997 P (E|H)P (H) P (H|E) = P (E) = 0.999 ∗ 0.00001/0.00200997 = 0.004970 Artificial Intelligence – p.14/39 Applying the Bayes Rule The Bayes rule is helpful even in absence of (immediate) causal relationships. What is the probability that a blonde (B) is Swedish (S)? P (S|B) = P (B|S)P (S) P (B) All P (B|S), P (S), P (B) are easily estimated from statistical information. P (B|S) ≈ P (S) ≈ P (B) ≈ # of blonde Swedish Swedish population Swedish population world population # of blondes world population = 9 10 = ... = ... Artificial Intelligence – p.15/39 Conditional Probability JPD is a collection of probabilties: P(X1 , . . . , Xn ) = {P (X1 = x1 ∧· · ·∧Xn = xn ) | xi ∈ Domain(Xi )} Conditional Probability: P (X1 = x1 |X2 = x2 ) = P (X1 =x1 ∧X2 =x2 ) P (X2 =x2 ) or P (X1 = x1 ∧ X2 = x2 ) = P (X1 = x1 |X2 = x2 )P (X2 = x2 ) P (X1 = x1 |X2 = x2 , ..., Xn = xn ) = P (X1 =x1 ∧X2 =x2 ∧···∧Xn =xn ) P (X2 =x2 ∧···∧Xn =xn ) or P (X1 = x1 ∧ X2 = x2 ∧ · · · ∧ Xn = xn ) = P (X1 = x1 |X2 = x2 , ..., Xn = xn )P (X2 = x2 ∧ · · · ∧ Xn = xn ) Qn = i=1 P (Xi = xi |Xi+1 = xi+1 , ..., Xn = xn ) Artificial Intelligence – p.16/39 Conditional Independence Conditional Independence: If P (X1 = x1 |X2 = x2 , ..., Xn−1 = xn−1 , Xn = xn ) = P (X1 = x1 |X2 = x2 , ..., Xn−1 = xn−1 ) then we say X1 = x1 is conditionally indepedent of Xn = xn given the evidence of X2 = x2 , ..., Xn−1 = xn−1 . Artificial Intelligence – p.17/39 Bayesian Networks A joint probability distribution (JPD) contains all the relevant information to reason about the various kinds of probabilities of a set {X1 , . . . , Xn } of random variables. Unfortunately, JPD tables are difficult to create and also very expensive to store. One possibility is to work with conditional probabilities and exploit the fact that many random variables are conditionally independent. are a successful example of probabilistic systems that exploit conditional independence to reason effectively under uncertainty. Belief Networks Artificial Intelligence – p.18/39 Belief Networks Let X1 , . . . , Xn be random variables. A belief network (or Bayesian network) for X1 , . . . , Xn is a graph with m nodes such that there is a node for each Xi , all the edges between two nodes are directed, there are no cycles, each node has a conditional probability table (CPT), given in terms of the node’s parents. The intuitive meaning of an edge from a node Xi to a node Xj is that Xi has a direct influence on Xj . Artificial Intelligence – p.19/39 A Belief Network A Belief Network is an acyclic graph G = (V, E), where V is a set of random variables and E expresses conditional dependencies among them. Artificial Intelligence – p.20/39 Detecting Credit Card Fraud We condider the following random variables: Fraud (F): Whether the current purchase is fraudulent Gas(G): Whether gas has been purchased in the past 24 hours Jewelry (J): Whether jewelry has been purchased in the past 24 hours Age (A): Age of the card holder Sex (S) Gender of the card holder These variables are all causally related. Artificial Intelligence – p.21/39 Conditional Probability Tables Each node Xi in a belief network has a CPT expressing the probability of Xi , given its parents as evidence. Artificial Intelligence – p.22/39 Some Notations Given a joint probability distribution P , IP (X, A|B) means X is conditionally independent of A given B, i.e., P (X|A, B) = P (X|B). A and B could be sets of random variables. Example: L means the letter is A; S means the shape is a square, and C means the color is black. Then IP (L, S|C), i.e., P (L|S, C) = P (L|C) (and P (S|L, C) = P (S|C)). Artificial Intelligence – p.23/39 Some Notations Given a DAG G = (V, E), if (X, Y ) is an edge in E, X is a parent of Y and Y is a child of Y . If there exists a path from X to Y , then X is an ancestor of Y and Y is is a decedent of X. Artificial Intelligence – p.24/39 Markov Condition A DAG G = (V, E) can be a Bayesian network if G satisfies the Markov condition: For any node X, let P AX and N DX be the sets of parents and nondescendents of X, respectively, then IP (X, N DX |P AX ). Artificial Intelligence – p.25/39 Conditional Probability Tables (CPT) Each node Xi in a belief network has a CPT expressing the probability of Xi , given its parents as evidence. P (S, L, C) = P (S|L, C)P (L, C) = P (S|C)P (L|C)P (C) Note: P (S, L, C) = P (s1 , l1 , c1 ), ..., P (S, L, C) = P (s2 , l2 , c2 ), in the textbook, Artificial Intelligence – p.26/39 The Semantics of Belief Networks There are two equivalent ways to interpret a belief network for the variables X1 , . . . , Xn : 1. The network is a representation of the JPD P(X1 , . . . , Xn ). 2. The network is a collection of conditional dependence statements about X1 , . . . , Xn . Interpretation 1 is helpful in understanding how to construct belief networks. Interpretation 2 is helpful in designing inference procedures on top of them. Artificial Intelligence – p.27/39 Belief Network as JPD The whole joint P(X1 , . . . , Xn ) can be computed from a belief network for X1 , . . . , Xn and its CPTs. For each tuple hx1 , . . . , xn i of possible values for hX1 , . . . , Xn i, P (X1 = x1 ∧ · · · ∧ Xn = xn ) = n Y P (Xi = xi |P AXi ) i=1 where P AXi is the parents of Xi in the DAG. Artificial Intelligence – p.28/39 Belief Network as JPD P (X, Y, Z, W ) = P (X) P (Y |X) P (Z|Y ) P (W |Z) = (1 − 0.4)(0.8)(1 − 0.7)0.6 = 0.0864 P (Y ) = P (Y |X)P (X) + P (Y |X)P (X) = 0.9(0.4) + 0.8(0.6) = 0.84 Artificial Intelligence – p.29/39 Belief Network as JPD P (Z|X) = P (Z|Y, X)P (Y |X) + P (Z|Y , X)P (Y |X) = P (Z|Y )P (Y |X) + P (Z|Y )P (Y |X) = 0.7(0.9) + 0.4(0.1) = 0.67 P (X|Z) = P (Z|X)P (X)/P (Z) = 0.67(0.4)/(0.652) = 0.411 Artificial Intelligence – p.30/39 Another Example How to compute P (Y |Z)? Artificial Intelligence – p.31/39 Constructing BN: General Procedure 1. Choose a set of random variables Xi that describe the domain. 2. Order the variables into a list L. 3. Start with an empty network. 4. While there are variables left in L: (a) Remove the first variable Xi of L and add a node for it to the network. (b) Choose a minimal set of nodes already in the network which satisfy the conditional dependence property with Xi . (c) Make these nodes the parents of Xi . (d) Fill in the CPT for Xi . Artificial Intelligence – p.32/39 Ordering the Variables The order in which add the variables to the network is important. “Wrong” orderings produces more complex networks. Artificial Intelligence – p.33/39 Ordering the Variables Right A general, effective heuristic for constructing simpler belief networks is to exploit causal links between random variables whenever possible. That is, add the variables to the network so that causes get added before effects. Artificial Intelligence – p.34/39 Locally Structured Systems In addition to being a complete and non-redundant representation of the domain, a belief network is often more compact than a full joint. The reason is that probabilistic domains are often representable as a locally structured system. In a locally structured system, each subcomponent interacts with only a bounded number of other components, regardless of the size of the system. The complexity of local structures generally grows linearly, instead of exponentially. Artificial Intelligence – p.35/39 BN and Locally Structured Systems In many real-world domains, each random variable is influenced by at most k others, for some fixed constant k. Assuming that we have n variables and, for simplicity, that they are all Boolean, a joint table will have 2n entries. In a well constructed belief network, each node will have at most k parents. This means that each node will have a CPT with at most 2k entries, for a total of 2k n entries. Example: n=20, k=5. entries in network CPTs ≤ 25 × 20 = 640 entries in joint table = 220 = 1, 048, 576 Artificial Intelligence – p.36/39 Inference in Belief Networks Main task of a belief network: Compute the conditional probability of a set of query variables, given exact values for some evidence variables. P (Query|Evidence) Belief networks are flexible enough so that any node can serve as either a query or an evidence variable. In general, to decide what actions to take, an agent 1. first gets values for some variables from its percepts, or from its own reasoning, 2. then asks the network about the possible values of the other variables. Artificial Intelligence – p.37/39 Probabilistic Inference with BN Belief networks are a very flexible tool for probabilistic inference because they allow several kinds of inference: Diagnostic inference Example: Infer (from causes to effects) P (JohnCalls|Burglary) Causal inference Infer (from effects to causes) P (Burglary|JohnCalls) Example: (between causes of a common effect) Example: Infer P (Burglary|Alarm ∧ Earthquake) Intercausal inference (combination of the above) Example: P (Alarm|JohnCalls ∧ ¬Earthquake) Mixed inference Infer Artificial Intelligence – p.38/39 Types of Inference in Belief Networks Q = query variable E = evidence variable Artificial Intelligence – p.39/39