Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview ■ Decision-theoretic techniques ◆ Explicit management of uncertainty and tradeoffs theory ◆ Maximization of expected utility Tutorial on Bayesian Networks ◆ Probability ■ Applications to AI problems ◆ Diagnosis Jack Breese & Daphne Koller ◆ Expert systems ◆ Planning First given as a AAAI’97 tutorial. ◆ Learning 2 1 Applications Science- AAAI-97 ■ Model Minimization in Markov Decision Processes ■ Effective Bayesian Inference for Stochastic Programs ■ Learning Bayesian Networks from Incomplete Data ■ Summarizing CSP Hardness With Continuous Probability Distributions ■ Speeding Safely: Multi-criteria Optimization in Probabilistic Planning ■ Structured Solution Methods for Non-Markovian Decision Processes Microsoft' s cost-cutting helps users 04/21/97 A Microsoft Corp. strategy to cut its support costs by letting users solve their own problems using electronic means is paying off for users.In March, the company began rolling out a series of Troubleshooting Wizards on its World Wide Web site. Troubleshooting Wizards save time and money for users who don' t have Windows NT specialists on hand at all times, said Paul Soares, vice president and general manager of Alden Buick Pontiac, a General Motors Corp. car dealership in Fairhaven, Mass 3 4 Teenage Bayes Course Contents » Microsoft Researchers Exchange Brainpower with Eighth-grader Concepts in Probability ◆ Probability ◆ Random ◆ Basic Teenager Designs AwardWinning Science Project variables properties (Bayes rule) Bayesian Networks Inference ■ Decision making ■ Learning networks from data ■ Reasoning over time ■ Applications ■ .. For her science project, which she ■ called "Dr. Sigmund Microchip," Tovar wanted to create a computer program to diagnose the probability of certain personality types. With only answers from a few questions, the program was able to accurately diagnose the correct personality type 90 percent of the time. 5 6 1 Probabilities ■ Discrete Random Variables Probability distribution P(X|ξ) ◆X ■ is a random variable Finite set of possible outcomes X ∈ {x1 , x2 , x3 ,..., xn } ■ Discrete ■ Continuous ◆ P( xi ) ≥ 0 ξ is background state of information 0.4 0.35 0.3 n ∑ 0.25 P ( xi ) = 1 i =1 X binary: 0.2 0.15 0.1 0.05 P ( x ) + P ( x )= 1 0 X1 X2 X3 7 8 Continuous Random Variable ■ More Probabilities Probability distribution (density function) over continuous values X ∈ [0,10 ] ■ Joint P ( x, y ) ≡ P ( X = x ∧ Y = y ) that both X=x and Y=y P ( x) ≥ 0 ◆ Probability 10 ∫ P ( x ) dx =1 ■ P(x) ◆ Probability 7 P (5 ≤ x ≤ 7 ) = ∫ P ( x ) dx 5 Conditional P ( x | y) ≡ P( X = x | Y = y) that X=x given we know that Y=y 0 5 7 x 9 Rules of Probability ■ Product Rule P( H , E ) = P( H | E ) P( E ) = P( E | H ) P( H ) Marginalization P (Y ) = P( H | E ) = n ∑ 10 Bayes Rule P( X , Y ) = P( X | Y ) P(Y ) = P(Y | X ) P( X ) ■ X4 P (Y , x i ) P( E | H ) P( H ) P( E ) i =1 X binary: P(Y ) = P(Y , x) + P(Y , x ) 11 12 2 Course Contents ■ » Bayesian networks Concepts in Probability Bayesian Networks ■ ◆ Basics ◆ Additional Basics ◆ Structured representation independence ◆ Naïve Bayes model ◆ Independence facts structure acquisition ◆ Conditional ◆ Knowledge Inference Decision making ■ Learning networks from data ■ Reasoning over time ■ Applications ■ ■ 13 14 Bayesian Networks S ∈ {no, light , heavy} Smoking P(S=no) 0.80 P(S=light) 0.15 P(S=heavy) 0.05 Product Rule ■ Cancer C ∈ {none, benign, malignant} Smoking= P(C=none) P(C=benign) P(C=malig) no 0.96 0.03 0.01 light 0.88 0.08 0.04 heavy 0.60 0.25 0.15 P(C,S) = P(C|S) P(S) S⇓ C⇒ no light heavy none benign 0.768 0.132 0.035 malignant 0.024 0.012 0.010 0.008 0.006 0.005 15 16 Bayes Rule Revisited Marginalization P(S | C) = S⇓ C⇒ none benign malig total 0.768 0.024 0.008 .80 no 0.132 0.012 0.006 .15 light 0.035 0.010 0.005 .05 heavy total 0.935 0.046 0.019 P(Smoke) P(Cancer) 17 P(C | S )P(S) P(C, S ) = P(C) P(C) S⇓ C⇒ none benign 0.768/.935 0.024/.046 no 0.132/.935 0.012/.046 light 0.030/.935 0.015/.046 heavy malig Cancer= P(S=no) P(S=light) P(S=heavy) malignant 0.421 0.316 0.263 none 0.821 0.141 0.037 benign 0.522 0.261 0.217 0.008/.019 0.006/.019 0.005/.019 18 3 A Bayesian Network Age Independence Gender Age Exposure to Toxics Gender Smoking P(A,G) = P(G)P(A) P(A|G) = P(A) A ⊥ G P(G|A) = P(G) G ⊥ A Cancer Serum Calcium Age and Gender are independent. Lung Tumor P(A,G) = P(G|A) P(A) = P(G)P(A) P(A,G) = P(A|G) P(G) = P(A)P(G) 19 More Conditional Independence: Naïve Bayes Conditional Independence Age 20 Cancer is independent of Age and Gender given Smoking. Gender Serum Calcium and Lung Tumor are dependent Cancer Serum Calcium is independent of Lung Tumor, given Cancer Smoking P(C|A,G,S) = P(C|S) Serum Calcium C ⊥ A,G | S Lung Tumor P(L|SC,C) = P(L|C) Cancer 21 22 More Conditional Independence: Explaining Away Naïve Bayes in general H E1 E2 E3 …... Exposure to Toxics Smoking En Cancer 2n + 1 parameters: P(h) P(ei | h), P(ei | h ), i = 1, , n Exposure to Toxics and Smoking are independent E⊥S Exposure to Toxics is dependent on Smoking, given Cancer P(E = heavy | C = malignant) > P(E = heavy | C = malignant, S=heavy) 23 24 4 Put it all together General Product (Chain) Rule for Bayesian Networks P ( A, G , E , S , C , L , SC ) = Age Gender P ( A ) ⋅ P (G ) ⋅ n Exposure to Toxics Smoking P ( E | A ) ⋅ P ( S | A, G ) ⋅ i =1 Pai=parents(Xi) P (C | E , S ) ⋅ Cancer Serum Calcium , X n ) = ∏ P( X i | Pai ) P( X1, X 2 , Lung Tumor P ( SC | C ) ⋅ P ( L | C ) 25 26 Conditional Independence Another non-descendant A variable (node) is conditionally independent of its non-descendants given its parents. Age Gender Exposure to Toxics Smoking Cancer Serum Calcium Lung Tumor Age Gender Exposure to Toxics Smoking Non-Descendants Parents Cancer is independent of Age and Gender given Exposure to Toxics and Smoking. Cancer is independent of Diet given Exposure to Toxics and Smoking. Cancer Diet Serum Calcium Lung Tumor Descendants 27 28 Independence and Graph Separation Bayesian networks ■ Given a set of observations, is one set of variables dependent on another set? ■ Observing effects can induce dependencies. ■ d-separation (Pearl 1988) allows us to check conditional independence graphically. ■ Additional structure ◆ Nodes as functions independence ◆ Context specific dependencies ◆ Continuous variables ◆ Hierarchy and model construction ◆ Causal 29 30 5 Nodes as functions ■ lo : 0.7 A BN node is conditional distribution function ◆ its parent values are the inputs ◆ its output is a distribution over its values med : 0.1 a A a ab ab A lo b XX med hi ab ab 0.1 0.4 0.7 0.5 0.3 0.2 0.1 0.3 0.6 0.4 0.2 0.2 med : 0.1 hi : 0.2 Any type of function from Val(A,B) XX to distributions over Val(X) lo : 0.7 b hi : 0.2 B B 31 32 Causal Independence Burglary Fine-grained model Burglary Earthquake ■ ■ e e w 1-rE 0 w rE 1 b b m 1-rB 0 m rB 1 Alarm ■ Earthquake Motion sensed Burglary causes Alarm iff motion sensor clear Earthquake causes Alarm iff wire loose Enabling factors are independent of each other deterministic or 33 Wire Move Alarm 34 Noisy-Or model Alarm false only if all mechanisms independently inhibited CPCS Network Earthquake Burglary P(a) = 1 - Π parent X active rX # of parameters is linear in the # of parents 35 36 6 Context-specific Dependencies Alarm-Set Burglary Asymmetric dependencies Alarm-Set Cat A Alarm ■ ■ ■ Node function represented as a tree Alarm can go off only if it is Set A burglar and the cat can both set off the alarm If a burglar comes in, the cat hides and does not set off the alarm ■ Outdoor Temperature A/C Setting hi 97o Local OK Indoor Temperature Local Transport Function from Val(A,B) Indoor to density functions Temperature over Val(X) P(x) Printer Output x 39 Gaussian (normal) distributions P( x ) = 38 Continuous variables Print Data Location S s s (a: 0, a : 1) b B b c C c (a: 0.9, a : 0.1) Alarm independent of ◆ Burglary, Cat given s ◆ Cat given s and b Asymmetric Assessment Net Transport Cat (a: 0.01, a : 0.99) (a: 0.6, a : 0.4) 37 Net OK Burglary 40 Gaussian networks X ~ N ( µ , σ X2 ) − ( x − µ )2 1 exp 2σ 2π σ X Y Y ~ N (ax + b, σ Y2 ) N(µ, σ) Each variable is a linear function of its parents, with Gaussian noise Joint probability density functions: different mean different variance 41 X Y X Y 42 7 Owner Owner Income Age Composing functions Maintenance Recall: a BN node is a function ■ We can compose functions to get more complex functions. ■ The result: A hierarchically structured BN. ■ Since functions can be called more than once, we can reuse a BN model fragment in multiple contexts. ■ Original-value Age Mileage Brakes: Brakes Power Car: Engine: Engine Engine Power Tires: RF-Tire Tires LF-Tire Pressure Traction Fuel-efficiency Braking-power 43 44 Bayesian Networks What is a variable? ■ ■ Knowledge acquisition Collectively exhaustive, mutually exclusive values ◆ Variables x1 ∨ x2 ∨ x3 ∨ x4 ◆ Structure ◆ Numbers Error Occured ¬ ( xi ∧ x j ) i ≠ j ■ Values No Error versus Probabilities Risk of Smoking Smoking 45 46 Clarity Test: Knowable in Principle Structuring Weather {Sunny, Cloudy, Rain, Snow} ■ Gasoline: Cents per gallon ■ Temperature { ≥ 100F , < 100F} ■ User needs help on Excel Charting {Yes, No} ■ User’s personality {dominant, submissive} ■ Age Gender Exposure to Toxic Smoking Cancer Lung Tumor 47 Network structure corresponding to “causality” is usually good. Genetic Damage Extending the conversation. 48 8 Do the numbers really matter? Local Structure Second decimal usually does not matter ■ Relative Probabilities ■ ■ ■ ■ Causal independence: from n 2 to n+1 parameters Asymmetric assessment: similar savings in practice. Typical savings (#params): ◆ Zeros and Ones ■ Order of Magnitude : 10-9 vs 10-6 ■ Sensitivity Analysis ◆ ■ 145 to 55 for a small hardware network; 133,931,430 to 8254 for CPCS !! 49 50 Course Contents Inference Concepts in Probability ■ Bayesian Networks » Inference ■ Decision making ■ Learning networks from data ■ Reasoning over time ■ Applications Patterns of reasoning ■ Basic inference ■ Exact inference ■ Exploiting structure ■ Approximate inference ■ ■ 51 52 Predictive Inference Age Combined Gender Exposure to Toxics Smoking Cancer Serum Calcium How likely are elderly males to get malignant cancer? P(C=malignant | Age>60, Gender= male) Age Gender Exposure to Toxics Smoking Cancer Serum Calcium Lung Tumor 53 How likely is an elderly male patient with high Serum Calcium to have malignant cancer? P(C=malignant | Age>60, Gender= male, Serum Calcium = high) Lung Tumor 54 9 Explaining away Age Inference in Belief Networks Gender Exposure to Toxics ■ If we see a lung tumor, the probability of heavy smoking and of exposure to toxics both go up. ■ If we then observe heavy smoking, the probability of exposure to toxics goes back down. Smoking Cancer Serum Calcium ■ Lung Tumor Find P(Q=q|E= e) ◆ Q the query variable ◆ E set of evidence variables P(q, e) P(q | e) = P(e) X1,…, Xn are network variables except Q, E P(q, e) = Σ P(q, e, x1,…, xn) x1,…, xn 55 56 Basic Inference A Product Rule S B ■ P(b) = ? C P(C,S) = P(C|S) P(S) S⇓ C⇒ no light heavy none benign 0.768 0.132 0.035 malignant 0.024 0.012 0.010 0.008 0.006 0.005 57 58 Basic Inference Marginalization A S⇓ C⇒ none benign malig total 0.768 0.024 0.008 .80 no 0.132 0.012 0.006 .15 light 0.035 0.010 0.005 .05 heavy total 0.935 0.046 0.019 B C P(b) = Σ P(a, b) = Σ P(b | a) P(a) a P(Smoke) a P(c) = Σ P(c | b) P(b) b P(c) = Σ P(a, b, c) = Σ P(c | b) P(b | a) P(a) b,a b,a = Σ P(c | b) Σ P(b | a) P(a) P(Cancer) b a P(b) 59 60 10 Inference in trees Polytrees ■ A network is singly connected (a polytree) if it contains no undirected loops. ' Y2 Y1 & X P(x) =yΣ, y P(x | y1, y2) P(y1, y2) 1 Theorem: Inference in a singly connected network can be done in linear time*. 2 because of independence of Y1, Y2: Main idea: in variable elimination, need only maintain distributions over single nodes. = yΣ, y P(x | y1, y2) P(y1) P(y2) 1 2 61 c c P(r) 0.99 0.01 0 Cloudy Rain P(g) = Sprinkler 62 The problem with loops contd. The problem with loops P(c) 0.5 * in network size including table sizes. c c P(s) 0.01 0.99 0 P(g | r, s) P(r, s) + P(g | r, s) P(r, s) + P(g | r, s) P(r, s) + P(g | r, s) P(r, s) 0 1 deterministic or Grass-wet = P(r, s) ~ 0 The grass is dry only if no rain and no sprinklers. = P(r) P(s) ~ 0.5 ·0.5 = 0.25 P(g) = P(r, s) ~ 0 63 Variable elimination A B P(A) ■ a P(b) P(B | A) A factor over X is a function from val(X) to numbers in [0,1]: ◆A ◆A C x P(B, A) ΣA ■ P(B) P(C | B) x P(C, B) 64 Inference as variable elimination P(c) = Σ P(c | b) Σ P(b | a) P(a) b problem CPT is a factor joint distribution is also a factor BN inference: ◆ factors are multiplied to give new ones in factors summed out ◆ variables ΣB ■ P(C) 65 A variable can be summed out as soon as all factors mentioning it have been multiplied. 66 11 Variable Elimination with loops Join trees* A join tree is a partially precompiled factorization P(A) P(G) P(S | A,G) Age Gender x P(E | A) Σ G P(A,G,S) Exposure to Toxics Smoking ΣA x Σ E,S P(C) P(C,L) Σ P(E,S,C) Serum Calcium Lung Tumor x P(L | C) Age Gender Exposure to Toxics Smoking P(A,S) P(L) C 67 Serum Calcium E, S, C C, S-C * aka junction trees, Lauritzen-Spiegelhalter, Hugin alg., … Earthquake Burglary Truck Wire Move E B T W E’ B’ T’ W’ or or Alarm E: Inference with continuous variables ◆ Fire ■ ■ & Smoke Concentration S ~ N (a F w + bF , σ F2 ) T W E’ B’ T’ W’ E: or or 70 Theorem: Inference in a multi-connected Bayesian network is NP-hard. Boolean 3CNF formula φ = (u∨ v ∨ w)∧ (u ∨ w ∨ y) discrete variables cannot depend on continuous Smoke Concentration B Computational complexity Gaussian networks: polynomial time inference regardless of network structure Conditional Gaussians: Wind Speed E Smaller families Smaller factors Faster inference 69 ■ 68 Wind Alarm Earthquake Noisy or: ■ C, L Noisy-or decomposition Idea: explicitly decompose nodes deterministic or A, E, S Lung Tumor Exploiting Structure Motion sensed A, G, S Cancer Complexity is exponential in the size of the factors Burglary P(A) x P(G) x P(S | A,G) x P(A,E,S) P(C | E,S) P(E,S) Cancer x P(A,S) ' U Smoke Alarm V W Y prior probability1/2 or These techniques do not work for general hybrid networks. or and 71 Probability ( ) = 1/2n · # satisfying assignments of φ 72 12 Stochastic simulation Burglary P(b) 0.03 Likelihood weighting Earthquake P(e) 0.001 Burglary be be be be Alarm P(a) 0.98 0.7 0.4 0.01 Call = c a a P(c) 0.8 0.05 Newscast e a a P(c) 0.8 0.05 e P(n) 0.3 0.001 Samples: P(b|c) ~ b e a c n Call = c # of live samples with B=b total # of live samples Newscast weight of samples with B=b P(b|c) = total weight of samples ... Samples: b e a c n Alarm B E A C N weight b e a c n 0.8 b e a c n 0.95 B E A C N Earthquake ... 73 74 Other approaches ■ Search based techniques CPCS Network ◆ search ◆ use ■ for high-probability instantiations instantiations to approximate probabilities Structural approximation ◆ simplify network ■ eliminate edges, nodes node values ■ simplify CPTs ■ abstract ◆ do inference in simplified network 75 76 Course Contents Decision making Concepts in Probability ■ Bayesian Networks ■ Inference » Decision making ■ Learning networks from data ■ Reasoning over time ■ Applications Decisions, Preferences, and Utility functions ■ Influence diagrams ■ Value of information ■ ■ 77 78 13 A Decision Problem Decision making Decision - an irrevocable allocation of domain resources ■ Decision should be made so as to maximize expected utility. ■ View decision making in terms of Should I have my party inside or outside? ■ dry Regret in wet dry ◆ Beliefs/Uncertainties Relieved Perfect! out ◆ Alternatives/Decisions wet ◆ Objectives/Utilities Disaster 79 80 Value Function ■ Preference for Lotteries A numerical score over all possible states of the world. Location? in in out out Weather? dry wet dry wet 0.25 $30,000 0.75 $0 Value $50 $60 $100 $0 0.2 $40,000 0.8 $0 ≈ 81 Desired Properties for Preferences over Lotteries 82 Expected Utility Properties of preference ⇒ existence of function U, that satisfies: If you prefer $100 to $0 and p < q then p1 p $100 q $100 1-q $0 p2 x1 x2 q1 q2 y1 y2 1-p pn $0 qn xn yn iff (always) Σi pi U(xi) 83 < Σi qi U(yi) 84 14 Attitudes towards risk Some properties of U U 1 $40,000 0.8 U($500) $30,000 ≈ $0 .5 $0 $0 0 0 ⇒U $1,000 l: U(l) 0.2 .5 ≠ monetary payoff 1000 400 500 Certain equivalent insurance/risk premium $ reward U convex risk averse U concave risk seeking U linear risk neutral 85 Are people rational? $40,000 0.2 86 Maximizing Expected Utility dry 0.25 $30,000 0.75 $0 0.7 in .652 0.3 1 $30,000 out .605 0 $0 0.8 $0 0.2 • U($40k) 0.8 • U($40k) dry > 0.25 • U($30k) > U($30k) $40,000 0.8 wet 0.7 0.3 U(Regret)=.632 U(Relieved)=.699 U(Perfect)=.865 wet U(Disaster ) =0 $0 0.2 0.8 • U($40k) choose the action that maximizes expected utility EU(in) = 0.7 ⋅ .632 + 0.3 ⋅ .699 = .652 < U($30k) Choose in EU(out) = 0.7 ⋅ .865 + 0.3 ⋅ 0 = .605 87 88 Multi-attribute utilities Influence Diagrams (or: Money isn’t everything) Burglary Earthquake ■ Many aspects of an outcome combine to determine our preferences. Alarm Newcast planning: cost, flying time, beach quality, food quality, … ◆ medical decision making: risk of death (micromort), quality of life (QALY), cost of treatment, … Call ◆ vacation ■ For rational decision making, must combine all relevant factors into single utility function. Go Home? Miss Meeting 89 Goods Recovered Big Sale Utility 90 15 Decision Making with Influence Diagrams Value-of-Information Burglary Earthquake ■ What is it worth to get another piece of information? ■ What is the increase in (maximized) expected utility if I make a decision with an additional piece of information? ■ Additional information (if free) cannot make you worse off. ■ There is no value-of-information if you will not change your decision. Alarm Call Newcast Call? Go Home? Neighbor Phoned Yes No Phone Call No Go Home? Miss Meeting Goods Recovered Big Sale Utility Expected Utility of this policy is 100 91 92 Value-of-Information is the increase in Expected Utility Value-of-Information in an Influence Diagram Burglary Earthquake Alarm Burglary Earthquake Alarm Call Newcast Call Newcast How much better can we do when this arc is here? Go Home? Miss Meeting Phonecall? Yes Yes No No Goods Recovered Big Sale Utility Newscast? Quake No Quake Quake No Quake Go Home? No Yes No No Go Home? Miss Meeting Goods Recovered Big Sale Expected Utility of this policy is 112.5 93 Course Contents Utility 94 Learning networks from data Concepts in Probability ■ Bayesian Networks ■ Inference ■ Decision making » Learning networks from data ■ Reasoning over time ■ Applications The learning task ■ Parameter learning ■ ■ ◆ Fully observable observable ◆ Partially ■ ■ 95 Structure learning Hidden variables 96 16 Parameter learning: one variable The learning task Burglary B E A C N b e a c n Earthquake ■ Alarm b e a c n ... Call ◆ Let Newscast ■ Input: training data Unfamiliar coin: Output: BN modeling data If θ known (given), then ◆ P(X Input: fully or partially observable data cases? ■ Output: parameters or also structure? ■ ■ θ = bias of coin (long-run fraction of heads) = heads | θ ) = θ Different coin tosses independent given θ ⇒ P(X1, …, Xn | θ ) = θ h (1-θ)t h heads, t tails 97 Bayesian approach Maximum likelihood ■ Uncertainty about θ ⇒ distribution over its values Input: a set of previous coin tosses ◆ X1, …, Xn = {H, T, H, H, H, T, T, H, . . ., H} h heads, t tails ■ Goal: estimate θ ■ The likelihood P(X1, …, Xn | θ ) = θ h (1-θ )t ■ 98 P(θ ) The maximum likelihood solution is: θ∗ = h h+t 99 ∫ P( X −∞ θ ∞ = heads | θ ) P (θ ) dθ = ∫ θ P (θ ) dθ −∞ 100 Beta (α h , α t ) ∝ θ α h heads, t tails P(θ ) ∞ Good parameter distribution: Conditioning on data D P ( X = heads ) = P(θ | D) ∝ P(θ ) P(D | θ ) = P(θ ) θ h (1-θ )t α ) ∝ θ α h −1 (1 − θ )α t −1 1 head 1 tail 101 * Dirichlet distribution generalizes Beta to non-binary variables. 102 17 General parameter learning ■ A multi-variable BN is composed of several independent parameters (“coins”). A B Three parameters: θA, θB|a, θB|a ■ Earthquake Alarm b ? a ? n Can use same techniques as one-variable case to learn each one separately Max likelihood estimate of θB|a would be: θ∗B|a = Burglary B E A C N b ? a c ? Call ... ■ Partially observable data Newscast Fill in missing data with “expected” value ◆ expected ◆ use #data cases with b, a #data cases with a = distribution over possible values “best guess” BN to estimate distribution 103 Intuition ■ In fully observable case: #data cases with n, e θ∗n|e = #data cases with e = I(e | dj) = ■ 104 Expectation Maximization (EM) Repeat : Σj I(n,e | dj) Σ j I(e | dj) ■ Expectation (E) step ◆ Use current parameters θ to estimate filled in data. Iˆ( n, e | d j ) = Pθ ( n, e | d j ) 1 if E=e in data case dj 0 otherwise ■ Maximization (M) step ◆ Use filled in data to do max likelihood estimation ~ θ n| e = In partially observable case I is unknown. Best estimate for I is: Iˆ(n, e | d j ) = Pθ * ( n, e | d j ) ■ Problem: θ* unknown. ∑ Iˆ(n, e | d ) ∑ Iˆ(e | d ) j j j ~ Set: θ := θ j until convergence. 105 Structure learning 106 Search space Space = network structures Operators = add/reverse/delete edges Goal: find “good” BN structure (relative to data) Solution: do heuristic search over space of network structures. 107 108 18 Heuristic search Scoring Use scoring function to do heuristic search (any algorithm). Greedy hill-climbing with randomness works pretty well. Fill in parameters using previous techniques & score completed networks. ■ One possibility for score: ■ likelihood function: Score(B) = P(data | B) ' Example: X, Y independent coin tosses typical data = (27 h-h, 22 h-t, 25 t-h, 26 t-t) score Maximum likelihood network structure: X Y Max. likelihood network typically fully connected This is not surprising: maximum likelihood always overfits… 109 Better scoring functions ■ Hidden variables MDL formulation: balance fit to data and model complexity (# of parameters) ■ Score(B) = P(data | B) - model complexity There may be interesting variables that we never get to observe: ◆ topic ■ 110 Full Bayesian formulation of a document in information retrieval; current task in online help system. ◆ user’s ◆ prior on network structures & parameters parameters ⇒ higher dimensional space ◆ get balance effect as a byproduct* ■ ◆ more * with Dirichlet parameter prior, MDL is an approximation to full Bayesian score. Our learning algorithm should ◆ hypothesize ◆ learn the existence of such variables; an appropriate state space for them. 111 112 E1 E1 E3 E3 E2 E2 randomly scattered data actual data 113 19 E1 Bayesian clustering (Autoclass) Class naïve Bayes model: E1 ■ ■ ■ ■ …... E2 En (hypothetical) class variable never observed if we know that there are k classes, just run EM learned classes = clusters Bayesian analysis allows us to choose k, trade off fit to data with model complexity E3 E2 Resulting cluster distributions 115 Detecting hidden variables ■ Unexpected correlations Course Contents hidden variables. Hypothesized model Data model Cholesterolemia Cholesterolemia Test1 Test2 Test3 Test1 Cholesterolemia Test2 Concepts in Probability ■ Bayesian Networks ■ Inference ■ Decision making ■ Learning networks from data » Reasoning over time ■ Applications ■ Test3 Hypothyroid “Correct” model Test1 Test2 Test3 117 118 Reasoning over time Dynamic environments Dynamic Bayesian networks ■ Hidden Markov models ■ Decision-theoretic planning ■ State(t) decision problems ◆ Structured representation of actions ◆ The qualification problem & the frame problem ◆ Causality (and the frame problem revisited) State(t+1) State(t+2) ◆ Markov ■ Markov property: ◆ past independent of future given current state; conditional independence assumption; ◆ implied by fact that there are no arcs t→ t+2. ◆a 119 120 20 Dynamic Bayesian networks ■ State ■ Hidden Markov model described via random variables. An HMM is a simple model for a partially observable stochastic domain. ■ Each variable depends only on few others. Drunk(t) Drunk(t+1) Drunk(t+2) Velocity(t) Velocity(t+1) Velocity(t+2) Position(t) Position(t+1) Position(t+2) Weather(t) Weather(t+1) Weather(t+2) State(t) ... State transition model State(t+1) Obs(t) Obs(t+1) Observation model 121 122 Hidden Markov models (HMMs) HMMs and DBNs Partially observable stochastic environment: ■ ◆ ◆ ■ 0.15 Mobile robots: HMMs are just very simple DBNs. ■ Standard inference & learning algorithms for HMMs are instances of DBN algorithms ■ 0.05 states = location observations = sensor input Speech recognition: ◆ Forward-backward = polytree = EM ◆ Viterbi = most probable explanation. 0.8 states = phonemes ◆ observations = acoustic signal ◆ ■ ◆ Baum-Welch Biological sequencing: ◆ ◆ states = protein structure observations = amino acids 123 124 Acting under uncertainty Partially observable MDPs Markov Decision Problem (MDP) action model agent observes state Action(t) State(t) Action(t+1) State(t+1) Reward(t) ■ ■ agent observes Action(t) Obs, not state Obs depends on state Obs(t) State(t+2) Reward(t+1) +100 -1 Obs(t+1) State(t+1) Reward(t) Overall utility = sum of momentary rewards. Allows rich preference model, e.g.: rewards corresponding to “get to goal asap” = State(t) Action(t+1) State(t+2) Reward(t+1) The optimal action at time t depends on the entire history of previous observations. ■ Instead, a distribution over State(t) suffices. ■ goal states other states 125 126 21 Structured representation Position(t) Position(t+1) Direction(t) Direction(t+1) Holding(t) Holding(t+1) Preconditions Move: Causality Effects Modeling the effects of interventions Observing vs. “setting” a variable ■ A form of persistence modeling ■ ■ Turn: Position(t) Position(t+1) Direction(t) Direction(t+1) Holding(t) Holding(t+1) Probabilistic action model • allows for exceptions & qualifications; • persistence arcs: a solution to the frame problem. 127 128 Causal Theory Temperature Distributor Cap Setting vs. Observing Cold temperatures can cause the distributor cap to become cracked. Temperature Distributor Cap Car Starts If the distributor cap is cracked, then the car is less likely to start. The car does not start. Will it start if we replace the distributor? Car Starts 129 130 Predicting the effects of interventions Temperature Distributor Cap Car Starts Mechanism Nodes Distributor The car does not start. Will it start if we replace the distributor? Mstart Start What is the probability that the car will start if I replace the distributor cap? 131 Mstart Always Starts Always Starts Never Starts Never Starts Normal Normal Inverse Inverse Distributor Cracked Normal Cracked Normal Cracked Normal Cracked Normal Starts? Yes Yes No No No Yes Yes No 132 22 Persistence Course Contents Pre-action Post-action Temperature Temperature Concepts in Probability Bayesian Networks ■ Inference ■ Decision making ■ Learning networks from data ■ Reasoning over time » Applications ■ ■ Mstart Dist Mstart Persistence Start arc Dist Set to Normal Start Observed Abnormal Assumption:The mechanism relating Dist to Start is unchanged by replacing the Distributor. 133 134 Applications ■ Why use Bayesian Networks? Medical expert systems ■ Explicit management of uncertainty/tradeoffs implies maintainability ■ Better, flexible, and robust recommendation strategies ◆ Pathfinder ◆ Parenting ■ ■ Modularity MSN Fault diagnosis ◆ Ricoh FIXIT ◆ Decision-theoretic troubleshooting Vista ■ Collaborative filtering ■ 135 136 Studies of Pathfinder Diagnostic Performance Pathfinder Pathfinder is one of the first BN systems. It performs diagnosis of lymph-node diseases. ■ It deals with over 60 diseases and 100 findings. ■ Commercialized by Intellipath and Chapman Hall publishing and applied to about 20 tissue types. ■ ■ 137 Naïve Bayes performed considerably better than certainty factors and Dempster-Shafer Belief Functions. ■ Incorrect zero probabilities caused 10% of cases to be misdiagnosed. ■ Full Bayesian network model with feature dependencies did best. ■ 138 23 On Parenting: Selecting problem Commercial system: Integration ■ Expert System with advanced diagnostic capabilities ◆ ◆ ◆ ◆ ■ ■ ■ uses key features to form the differential diagnosis recommends additional features to narrow the differential diagnosis recommends features needed to confirm the diagnosis explains correct and incorrect decisions ■ ■ ■ Video atlases and text organized by organ system “Carousel Mode” to build customized lectures Anatomic Pathology Information System Diagnostic indexing for Home Health site on Microsoft Network Enter symptoms for pediatric complaints Recommends multimedia content 139 On Parenting : MSN 140 Single Fault approximation Original Multiple Fault Model 141 142 Performing diagnosis/indexing On Parenting: Selecting problem 143 144 24 FIXIT: Ricoh copy machine RICOH Fixit ■ Diagnostics and information retrieval 145 Online Troubleshooters 146 Define Problem 147 148 Get Recommendations Gather Information 149 150 25 Vista Project: NASA Mission Control Costs & Benefits of Viewing Information Decision quality Decision-theoretic methods for display for high-stakes aerospace decisions Quantity of relevant information 151 152 Status Quo at Mission Control Time-Critical Decision Making • Consideration of time delay in temporal process Utility Action A,t Duration of Process State of System H, to E1, to E2, to En, to E1, t’ State of System H, t’ E2, t’ En, t’ 153 Simplification: Highlighting Decisions Simplification: Highlighting Decisions ■ Variable threshold to control amount of highlighted information Oxygen Fuel Pres Chamb Pres 15.6 10.5 14.2 11.8 5.4 4.8 154 ■ Variable threshold to control amount of highlighted information Oxygen Fuel Pres Chamb Pres 15.6 10.5 14.2 11.8 5.4 4.8 He Pres 17.7 14.7 He Pres 17.7 14.7 Delta v 33.3 63.3 Delta v 33.3 63.3 Oxygen Fuel Pres Chamb Pres He Pres Delta v 10.2 12.8 0.0 15.8 32.3 10.6 12.5 0.0 15.7 63.3 Oxygen Fuel Pres Chamb Pres He Pres Delta v 10.2 12.8 0.0 15.8 32.3 10.6 12.5 0.0 15.7 63.3 155 156 26 Simplification: Highlighting Decisions What is Collaborative Filtering? ■ Variable threshold to control amount of highlighted information A way to find cool websites, news stories, music artists etc ■ Uses data on the preferences of many users, not descriptions of the content. ■ Firefly, Net Perceptions (GroupLens), and others offer this technology. ■ Oxygen Fuel Pres Chamb Pres 15.6 10.5 14.2 11.8 5.4 4.8 He Pres 17.7 14.7 Delta v 33.3 63.3 Oxygen Fuel Pres Chamb Pres He Pres Delta v 10.2 12.8 0.0 15.8 32.3 10.6 12.5 0.0 15.7 63.3 157 Bayesian Clustering for Collaborative Filtering 158 Applying Bayesian clustering user classes Probabilistic summary of the data ■ Reduces the number of parameters to represent a set of preferences ■ Provides insight into usage patterns. ■ Inference: ■ title 1 title 2 title1 title2 title3 P(Like title i | Like title j, Like title k) ... title n class1 p(like)=0.2 p(like)=0.7 p(like)=0.99 ... class2 ... p(like)=0.8 p(like)=0.1 p(like)=0.01 159 Top 5 shows by user class MSNBC Story clusters Readers of commerce and technology stories (36%): ■ ■ ■ E-mail delivery isn' t exactly guaranteed Should you buy a DVD player? Price low, demand high for Nintendo Sports Readers (19%): ■ ■ ■ Umps refusing to work is the right thing Cowboys are reborn in win over eagles Did Orioles spend money wisely? Class 1 • Power rangers • Animaniacs • X-men • Tazmania • Spider man Readers of top promoted stories (29%): ■ ■ ■ 757 Crashes At Sea Israel, Palestinians Agree To Direct Talks Fuhrman Pleads Innocent To Perjury Readers of “Softer” News (12%): ■ ■ ■ 160 The truth about what things cost Fuhrman Pleads Innocent To Perjury Real Astrology 161 Class 2 • Young and restless • Bold and the beautiful • As the world turns • Price is right • CBS eve news Class 4 • 60 minutes • NBC nightly news • CBS eve news • Murder she wrote • Matlock Class 3 • Tonight show • Conan O’Brien • NBC nightly news • Later with Kinnear • Seinfeld Class 5 • Seinfeld • Friends • Mad about you • ER • Frasier 162 27 Richer model What’s old? Decision theory & probability theory provide: Age Gender Likes soaps User class ■ ■ principled models of belief and preference; techniques for: ◆ integrating evidence (conditioning); decision making (max. expected utility); ◆ targeted information gathering (value of info.); ◆ parameter estimation from data. ◆ optimal Watches Seinfeld Watches NYPD Blue Watches Power Rangers 163 What’s new? Some Important AI Contributions Bayesian networks exploit domain structure to allow compact representations of complex models. Knowledge Acquisition 164 Learning Key technology for diagnosis. ■ Better more coherent expert systems. ■ New approach to planning & action modeling: ■ ◆ planning using Markov decision problems; framework for reinforcement learning; ◆ probabilistic solution to frame & qualification problems. ◆ new Structured Representation ■ Inference 165 New techniques for learning models from data. 166 What’s in our future? ■ Better models for: ◆ preferences Structured Representation & utilities; numerical probabilities. ◆ not-so-precise Inferring causality from data. ■ More expressive representation languages: ■ ◆ structured domains with multiple objects; of abstraction; ◆ reasoning about time; ◆ hybrid (continuous/discrete) models. ◆ levels 167 28