Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99. Advanced I WS 06/07 MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION PAP SHUNT VENTMACH VENTLUNG DISCONNECT VENITUBE PRESS MINOVL ANAPHYLAXIS SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP LVFAILURE FIO2 VENTALV PVSAT ARTCO2 INSUFFANESTH CATECHOL STROEVOLUME HISTORY ERRBLOWOUTPUT PCWP Graphical Models - Learning - EXPCO2 CO HR HREKG ERRCAUTER Parameter Estimation HRSAT HRBP BP Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany Outline • Introduction • Reminder: Probability theory • Basics of Bayesian Networks • Modeling Bayesian networks • Inference (VE, Junction tree) • [Excourse: Markov Networks] • Learning Bayesian networks • Relational Models Bayesian Networks Advanced I WS 06/07 Advanced I WS 06/07 What is Learning? The problem of understanding intelligence is said to be the greatest problem in science today and “the” problem for this century – as deciphering the genetic code was for the second half of the last one…the problem of learning represents a gateway to understanding intelligence in man and machines. -- Tomasso Poggio and Steven Smale, 2003. Bayesian Networks - Learning • Agents are said to learn if they improve their performance over time based on experience. Why bothering with learning? • Bottleneck of knowledge aquisition – Expensive, difficult – Normally, no expert is around • Data is cheap ! – Huge amount of data avaible, e.g. •Clinical tests •Web mining, e.g. log files •.... Bayesian Networks - Learning Advanced I WS 06/07 Advanced I WS 06/07 Why Learning Bayesian Networks? • Conditional independencies & graphical language capture structure of many real-world distributions • Graph structure provides much insight into domain – Allows “knowledge discovery” • Supports all the features of probabilistic learning – Model selection criteria – Dealing with missing data & hidden variables Bayesian Networks • Learned model can be used for many tasks Advanced I WS 06/07 Learning With Bayesian Networks B E Data + Priori Info E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Bayesian Networks - Learning A Advanced I WS 06/07 Learning With Bayesian Networks B E Data + Priori Info Learning algorithm E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Bayesian Networks - Learning A What does the data look like? attributes/variables A1 A2 A3 A4 A5 A6 true true false true false false X1 false true true true false false X2 ... ... ... ... ... ... ... complete data set true false false false true true XM data cases Bayesian Networks - Learning Advanced I WS 06/07 What does the data look like? Advanced I WS 06/07 A1 A2 A3 A4 A5 A6 true true ? true false false ? true ? ? false false ... ... ... ... ... ... true false ? false true ? Real-world data: states of some random variables are missing – E.g. medical diagnose: not all patient are subjects to all test – Parameter reduction, e.g. clustering, ... Bayesian Networks - Learning incomplete data set What does the data look like? Advanced I WS 06/07 A1 A2 A3 A4 A5 A6 true true ? true false false ? true ? ? false false ... ... ... ... ... ... true false ? false true ? Real-world data: states of some random variables are missing – E.g. medical diagnose: not all patient are subjects to all test – Parameter reduction, e.g. clustering, ... missing value Bayesian Networks - Learning incomplete data set What does the data look like? Advanced I WS 06/07 A1 A2 A3 A4 A5 A6 true true ? true false false ? true ? ? false false ... ... ... ... ... ... true false ? false true ? Real-world data: states of some random variables are missing – E.g. medical diagnose: not all patient are subjects to all test – Parameter reduction, e.g. clustering, ... missing value Bayesian Networks - Learning incomplete data set Hidden variable – Examples Advanced I WS 06/07 1. Parameter reduction: 2 X1 X2 2 X3 118 2 2 2 X1 X2 X3 hidden 16 Y1 Y2 32 Y3 64 H H 34 16 Y1 Y2 Y3 4 4 4 Bayesian Networks - Learning 2 Hidden variable – Examples Advanced I WS 06/07 2. Clustering: hidden X1 assignment of cluster Cluster X2 ... Xn attributes • Autoclass model: – Hidden variables assigns class labels – Observed attributes are independent given the class Bayesian Networks - Learning • Hidden variables also appear in clustering [Slides due to Eamonn Keogh] Bayesian Networks - Learning Advanced I WS 06/07 [Slides due to Eamonn Keogh] Advanced I WS 06/07 The cluster means are randomly assigned Bayesian Networks - Learning Iteration 1 [Slides due to Eamonn Keogh] Advanced I WS 06/07 Bayesian Networks - Learning Iteration 2 [Slides due to Eamonn Keogh] Advanced I WS 06/07 Bayesian Networks - Learning Iteration 5 [Slides due to Eamonn Keogh] Advanced I WS 06/07 Bayesian Networks - Learning Iteration 25 [Slides due to Eamonn Keogh] What is a natural grouping among these objects? [Slides due to Eamonn Keogh] What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males Learning With Bayesian Networks Fixed structure A B Fixed variables A ? B fully Partially observed Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known Hidden variables A ? B ? H Scientific discouvery Bayesian Networks - Learning Advanced I WS 06/07 Parameter Estimation Advanced I WS 06/07 • Let be set of data over m RVs X i X • is called a data case – All data cases are independently sampled from identical distributions Find: Parameters of CPDs which match the data best Bayesian Networks - Learning • iid - assumption: X Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation Bayesian Networks - Learning • What does „best matching“ mean ? Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation Find paramteres which have most likely produced the data Bayesian Networks - Learning • What does „best matching“ mean ? Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? Bayesian Networks - Learning 1. MAP parameters Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? 1. MAP parameters Bayesian Networks - Learning 2. Data is equally likely for all parameters. Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? 2. Data is equally likely for all parameters 3. All parameters are apriori equally likely Bayesian Networks - Learning 1. MAP parameters Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? Find: Bayesian Networks - Learning ML parameters Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? Find: Likelihood the data of the paramteres given Bayesian Networks - Learning ML parameters Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation • What does „best matching“ mean ? Find: Likelihood the data of the paramteres given Log-Likelihood of the paramteres given the data Bayesian Networks - Learning ML parameters Advanced I WS 06/07 Maximum Likelihood • Consistent: estimate converges to best possible value as the number of examples grow • Asymptotic efficiency: estimate is as close to the true value as possible given a particular training set Bayesian Networks - Learning • This is one of the most commonly used estimators in statistics • Intuitively appealing Learning With Bayesian Networks Fixed structure A B fully Easiest problem counting Partially observed ? Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Fixed variables A ? B Hidden variables A ? B ? H Selection of arcs New domain with no domain expert Data mining Encompasses to difficult subproblem, „Only“ Structural EM is known Scientific discouvery Bayesian Networks - Learning Advanced I WS 06/07 Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E B B E B E A A P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? Learning algorithm • Network structure is specified – Learner needs to estimate parameters • Data does not contain missing values E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Bayesian Networks - Learning Advanced I WS 06/07 ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Only local parameters of family of Aj involved Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Only local parameters of family of Aj involved Each factor individually !! Bayesian Networks - Learning Advanced I WS 06/07 (iid) ML Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false ... ... ... ... ... ... true false false false true true (BN semantics) Decomposability of the likelihood Only local parameters of family of Aj involved Each factor individually !! Bayesian Networks - Learning Advanced I WS 06/07 Advanced I WS 06/07 Decomposability of Likelihood If the data set if complete (no question marks) • we can maximize each local likelihood function independently, and • then combine the solutions to get an MLE solution. independent, local sub-problems. This allows efficient solutions to the MLE problem. Bayesian Networks decomposition of the global problem to Advanced I WS 06/07 Likelihood for Multinominals • Random variable V with 1,...,K values • where Nk is the counts of state k in data X Bayesian Networks - Learning This constraint implies that the choice on I influences the choice on j (i<>j) Advanced I WS 06/07 Likelihood for Binominals (2 states only) • Compute partial derivative 1 + 2 =1 => MLE is Bayesian Networks - Learning • Set partial derivative zero X Advanced I WS 06/07 Likelihood for Binominals (2 states only) • Compute partial derivative 1 + 2 =1 In general, for multinomials (>2 states), the MLE is => MLE is Bayesian Networks - Learning • Set partial derivative zero X Advanced I WS 06/07 Likelihood for Conditional Multinominals • multinomial for each joint state pa of the parents of V: • MLE Bayesian Networks - Learning • Learning With Bayesian Networks Fixed structure A Fixed variables A B ? B fully Partially observed Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known ? Hidden variables A ? B ? H Scientific discouvery Bayesian Networks - Learning Advanced I WS 06/07 Known Structure, Incomplete Data E, B, A <Y,?,N> <Y,N,?> <N,N,Y> <N,Y,Y> . . <?,Y,Y> E B B E B E A A P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? Learning algorithm • Network structure is specified • Data contains missing values – Need to consider assignments to missing values E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Bayesian Networks - Learning Advanced I WS 06/07 EM Idea Advanced I WS 06/07 • In the case of complete data, ML parameter estimation is easy: Incomplete data ? 1. Complete data (Imputation) • most probable?, average?, ... value 2. Count 3. Iterate Bayesian Networks - Learning – simply counting (1 iteration) EM Idea: complete the data A B incomplete data A B true true true ? false true true false false ? Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data complete complete data A B N true true 1.5 true false 1.5 false true 1.5 false false 0.5 expected counts A B true true true ? false true true false false ? Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data A B true true true ? false true true false false ? complete complete data expected counts A B N true true 1.0 A B N true ?=true 0.5 true true 1.5 true ?=false 0.5 true false 1.5 false true 1.0 false true 1.5 true false 1.0 false false 0.5 false ?=true 0.5 false ?=false 0.5 Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data complete complete data A B N true true 1.5 true false 1.5 false true 1.5 false false 0.5 expected counts maximize A B true true true ? false true true false false ? Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data complete complete data A B N true true 1.5 true false 1.5 false true 1.5 false false 0.5 expected counts maximize A B true true true ? false true true false false ? iterate Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data A B true true true ? false true true false false ? Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data complete complete data A B true true true false 1.5 false true 1.75 false false 0.25 expected counts N 1.5 maximize A B true true true ? false true true false false ? iterate Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data A B true true true ? false true true false false ? Bayesian Networks - Learning Advanced I WS 06/07 EM Idea: complete the data A B incomplete data complete complete data A B true true true false 1.5 false true 1.875 false false 0.125 expected counts N 1.5 maximize A B true true true ? false true true false false ? iterate Bayesian Networks - Learning Advanced I WS 06/07 incomplete-data likelihood Assume complete data complete-data likelihood Complete-data likelihood A1 A2 A3 A4 A5 A6 true true ? true false false ? true ? ? false false ... ... ... ... ... ... true false ? false true ? exists with Bayesian Networks - Learning Advanced I WS 06/07 EM Algorithm - Abstract Expectation Step Maximization Step Bayesian Networks - Learning Advanced I WS 06/07 EM Algorithm - Principle Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. Bayesian Networks - Learning P(Y|) Advanced I WS 06/07 EM Algorithm - Principle Current point k Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. Bayesian Networks - Learning P(Y|) Advanced I WS 06/07 EM Algorithm - Principle P(Y|) Advanced I WS 06/07 Current point k Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. Bayesian Networks - Learning new function Q(,k) EM Algorithm - Principle P(Y|) Advanced I WS 06/07 Current point k Maximum k+1 Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. Bayesian Networks - Learning new function Q(,k) Advanced I WS 06/07 EM for Multi-Nominals • Random variable V with 1,...,K values where ENk are the expected counts of X state k in the data, i.e. • „MLE“: Bayesian Networks - Learning • Advanced I WS 06/07 • EM for Conditional Multinominals multinomial for each joint state pa of the parents of V: • „MLE“ Bayesian Networks - Learning • Advanced I WS 06/07 Learning Parameters: incomplete data Non-decomposable likelihood (missing value, hidden nodes) Data Current model S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? ……… B 1> 1> ?> 1> Bayesian Networks - Learning Initial parameters Advanced I WS 06/07 Learning Parameters: incomplete data Non-decomposable likelihood (missing value, hidden nodes) Data Current model Expectation Inference: P(S|X=0,D=1,C=0,B=1) S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? ……… B 1> 1> ?> 1> Expected counts S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 ……….. B 1 1 0 1 Bayesian Networks - Learning Initial parameters Advanced I WS 06/07 Learning Parameters: incomplete data Non-decomposable likelihood (missing value, hidden nodes) Data Current model Expectation Inference: P(S|X=0,D=1,C=0,B=1) S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? ……… B 1> 1> ?> 1> Expected counts Maximization Update parameters (ML, MAP) S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 ……….. B 1 1 0 1 Bayesian Networks - Learning Initial parameters Advanced I WS 06/07 Learning Parameters: incomplete data Non-decomposable likelihood (missing value, hidden nodes) Data Current model Expectation Inference: P(S|X=0,D=1,C=0,B=1) S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? ……… B 1> 1> ?> 1> Expected counts Maximization Update parameters (ML, MAP) EM-algorithm: iterate until convergence S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 ……….. B 1 1 0 1 Bayesian Networks - Learning Initial parameters Advanced I WS 06/07 Learning Parameters: incomplete data 1. Initialize parameters 2. Compute pseudo counts for each variable 3. Set parameters to the (completed) ML estimates 4. If not converged, iterate to 2 Bayesian Networks - Learning junction tree algorithm Advanced I WS 06/07 Monotonicity • (discrete) Bayesian networks: for any initial, nonuniform value the EM algorithm converges to a (local or global) maximum. Bayesian Networks - Learning • (Dempster, Laird, Rubin ´77): the incompletedata likelihood fuction is not decreased after an EM iteration LL on training set (Alarm) Experiment by Bauer, Koller and Singer [UAI97] Bayesian Networks - Learning Advanced I WS 06/07 Parameter value (Alarm) Experiment by Baur, Koller and Singer [UAI97] Bayesian Networks - Learning Advanced I WS 06/07 EM in Practice Advanced I WS 06/07 Initial parameters: • Random parameters setting • “Best” guess from other source Stopping criteria: Avoiding bad local maxima: • Multiple restarts • Early “pruning” of unpromising ones Speed up: • various methods to speed convergence Bayesian Networks - Learning • Small change in likelihood of data • Small change in parameter values Advanced I WS 06/07 Gradient Ascent • Main result • Requires same BN inference computations as EM – Flexible – Closely related to methods in neural network training • Cons: – Need to project gradient onto space of legal parameters – To get reasonable convergence we need to combine with “smart” optimization techniques Bayesian Networks - Learning • Pros: Advanced I WS 06/07 Parameter Estimation: Summary • Parameter estimation is a basic task for learning with Bayesian networks • Due to missing values non-linear optimization • EM for multi-nominal random variables – Fully observed data: counting – Partially observed data: pseudo counts • Junction tree to do multiple inference Bayesian Networks - Learning – EM, Gradient, ...