Download BBNFriedmanKollerAdapted

Learning Bayesian Networks from Data Nir Friedman Hebrew U. . Daphne Koller Stanford Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 2 Bayesian Networks Compact representation of probability distributions via conditional independence Family of Alarm Qualitative part: Earthquake Directed acyclic graph (DAG)  Nodes - random variables Radio  Edges - direct influence Burglary Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions P (B , E , A,C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A) 3 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients  37 variables  509 parameters …instead of 254 MINVOLSET PULMEMBOLUS PAP KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG DISCONNECT VENITUBE PRESS MINOVL ANAPHYLAXIS SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO2 VENTALV PVSAT ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 4 Inference Posterior  Probability of any event given any evidence Most  likely explanation Scenario that explains evidence Rational   probabilities decision making Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call 5 Why learning? Knowledge acquisition bottleneck  Knowledge acquisition is an expensive process  Often we don’t have an expert Data is cheap  Amount of available information growing rapidly  Learning allows us to construct models from raw data 6 Why Learn Bayesian Networks?  Conditional independencies & graphical language capture structure of many real-world distributions  Graph  structure provides much insight into domain Allows “knowledge discovery”  Learned model can be used for many tasks  Supports   all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables 7 Learning Bayesian networks B E Data + Prior Information Learner R A C E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 8 Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ?  Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified   B E Inducer needs to estimate parameters Data does not contain missing values 9 Unknown Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ?  Learner A B E A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is not specified   B E Inducer needs to select arcs & estimate parameters Data does not contain missing values 10 Known Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified  Data contains missing values   Need to consider assignments to missing values 11 Unknown Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is not specified  Data contains missing values   Need to consider assignments to missing values 12 Overview  Introduction  Parameter Estimation  Likelihood function  Bayesian estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 13 Learning Parameters  Training data has the form: B E A  E [1] B [1] A[1] C [1]         D         E [ M ] B [ M ] A [ M ] C [ M ]   C 14 Likelihood Function  Assume i.i.d. samples  Likelihood function is L( : D )   P (E [m], B [m], A[m],C [m] : ) B E A C m 15 Likelihood Function  By B E definition of network, we get A L( : D )   P (E [m], B [m], A[m], C [m] : ) C m  P (E [m] : )     P (B [m] : )     m  P (A[m ] | B [m ], E [m ] : )   P (C [m] | A[m] : )     E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   16 Likelihood Function  Rewriting B E terms, we get A L( : D )   P (E [m], B [m], A[m ], C [m] : ) C m  P (E [m] : ) m  P (B [m] : ) = m  P (A[m] | B [m], E [m] : ) m  P (C [m] | A[m] : ) m  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   17 General Bayesian Networks Generalizing for any Bayesian network: L( : D )   P (x1 [m ], , xn [m ] : ) m   P (xi [m ] | Pai [m ] : i ) i m   Li (i : D ) i Decomposition  Independent estimation problems 18 Likelihood Function: Multinomials L( : D )  P (D | )   P (x [m ] | ) m likelihood for the sequence H,T, T, H, H is L( :D)  The L( : D )    (1  )  (1  )     0 0.2 0.4 General case:  0.6 0.8 Count of kth outcome in D 1 K L (  : D )    k Nk k 1 Probability of kth outcome 19 Bayesian Inference  Represent uncertainty about parameters using a probability distribution over parameters, data  Learning using Bayes rule Likelihood Prior P (x [1],  x [M ] | )P () P ( | x [1],  x [M ])  P (x [1],  x [M ]) Posterior Probability of data 20 Bayesian Inference  Represent Bayesian distribution as Bayes net  X[1] X[2] X[m] Observed data values of X are independent given   P(x[m] |  ) =   Bayesian prediction is inference in this network  The 21 Bayesian Nets & Bayesian Prediction Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  Priors for each parameter group are independent  Data instances are independent given the unknown parameters 22 Bayesian Nets & Bayesian Prediction Y|X X X[1] X[2] X[M] Y[1] Y[2] Y[M] Observed data  We can also “read” from the network: Complete data  posteriors on parameters are independent  Can compute posterior over parameters separately! 23 Learning Parameters: Summary  Estimation   relies on sufficient statistics For multinomials: counts N(xi,pai) Parameter estimation N (xi , pai ) ˆ x |pa  i i N ( pai ) MLE (xi , pai )  N (xi , pai ) ~ x |pa  i i ( pai )  N ( pai ) Bayesian (Dirichlet)  Both are asymptotically equivalent and consistent  Both can be implemented in an on-line manner by accumulating sufficient statistics 24 Overview  Introduction  Parameter Learning  Model Selection  Scoring function  Structure search  Structure Discovery  Incomplete Data  Learning from Structured Data 25 Why Struggle for Accurate Structure? Earthquake Alarm Set Burglary Sound Missing an arc Earthquake Alarm Set Adding an arc Burglary Earthquake Sound Cannot be compensated for by fitting parameters  Wrong assumptions about domain structure  Alarm Set Burglary Sound Increases the number of parameters to be estimated  Wrong assumptions about domain structure  26 Scorebased Learning Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> B E E E A A B B A Search for a structure that maximizes the score 27 Likelihood Score for Structure (G : D)  log L(G : D)  M  I(Xi ; PaiG )  H(Xi )  i Mutual information between Xi and its parents  Larger dependence of Xi on Pai  higher score  Adding    arcs always helps I(X; Y)  I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea… 28 Bayesian Score Likelihood score: L(G : D)  P(D | G, θˆG ) Max likelihood params Bayesian approach:  Deal with uncertainty by assigning probability to all possibilities P (D | G )   P (D | G , )P ( | G ) d Marginal Likelihood Likelihood Prior over parameters P(D |G)P(G) P(G |D)  P(D) 29 Heuristic Search  Define a search space:  search states are possible structures  operators make small changes to structure  Traverse space looking for high-scoring structures  Search techniques:  Greedy hill-climbing  Best first search  Simulated Annealing  ... 30 Local Search  Start    with a given network empty network best tree a random network  At each iteration  Evaluate all possible changes  Apply change based on score  Stop when no modification improves score 31 Heuristic Search  Typical S operations: C E S C To update score after local change,D E only re-score families that changed D S C score = S({C,E} D) - S({E} D) S C E E D D 32 Learning in Practice: Alarm domain 2 true distribution KL Divergence to 1.5 Structure known, fit params 1 Learn both structure & params 0.5 0 0 500 1000 1500 2000 2500 3000 #samples 3500 4000 4500 5000 33 Local Search: Possible Pitfalls  Local  search can get stuck in: Local Maxima:  All  one-edge changes reduce the score Plateaux:  Some  Standard    one-edge changes leave the score unchanged heuristics can escape both Random restarts TABU search Simulated annealing 34 Improved Search: Weight Annealing Standard annealing process:  Take bad steps with probability  exp(score/t)  Probability increases with temperature  Weight annealing:  Take uphill steps relative to perturbed score  Perturbation increases with temperature Score(G|D)  G 35 Perturbing the Score  Perturb the score by reweighting instances  Each weight sampled from distribution:  Mean = 1  Variance  temperature  Instances sampled from “original” distribution  … but perturbation changes emphasis Benefit:  Allows global moves in the search space 36 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search True structure Learned params Greedy hill-climbing Annealed search 37 Structure Search: Summary  Discrete optimization problem  In some cases, optimization problem is easy  Example: learning trees  In general, NP-Hard  Need to resort to heuristic search  In practice, search is relatively fast (~100 vars in ~2-5 min):  Decomposability  Sufficient  statistics Adding randomness to search is critical 38 Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 39 Structure Discovery Task: Discover structural properties  Is there a direct connection between X & Y  Does X separate between two “subsystems”  Does X causally effect Y Example: scientific data mining  Disease properties and symptoms  Interactions between the expression of genes 40 Discovering Structure P(G|D) E R B A C  Current   practice: model selection Pick a single high-scoring model Use that model to infer domain structure 41 Discovering Structure P(G|D) E B R A C R E B E A C R B A C E R B A C E R B A C Problem    Small sample size  many high scoring models Answer based on one model often useless Want features common to many models 42 Bayesian Approach  Posterior distribution over structures  Estimate probability of features  Edge XY  Path X…  Y  … Bayesian score for G P (f | D )  f (G )P (G | D ) Feature of G, e.g., XY G Indicator function for feature f 43 MCMC over Networks   Cannot enumerate structures, so sample structures 1 P (f (G ) | D )  n n f (G ) i 1 i MCMC Sampling  Define Markov chain over BNs  Run chain to get samples from posterior P(G | D) Possible pitfalls:  Huge (superexponential) number of networks  Time for chain to converge to posterior is unknown  Islands of high posterior, connected by low bridges 44 ICU Alarm BN: No Mixing  500 instances: Score of cuurent sample score -8400 -8600 -8800 -9000 -9200 empty greedy -9400 0 100000 200000 300000 400000 500000 600000 iteration MCMC Iteration  The runs clearly do not mix 45 Effects of Non-Mixing MCMC runs over same 500 instances  Probability estimates for edges for two runs True BN Random start  Two True BN True BN Probability estimates highly variable, nonrobust 46 Fixed Ordering Suppose that  We know the ordering of variables  say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1  Limit number of parents per nodes to k 2k•n•log n networks Intuition: Order decouples choice of parents  Choice of Pa(X7) does not restrict choice of Pa(X12) Upshot: Can compute efficiently in closed form  Likelihood P(D | )  Feature probability P(f | D, ) 47 Our Approach: Sample Orderings We can write P (f | D )   P (f |,D )P ( | D )  Sample orderings and approximate n P (f | D )   P (f |  i , D ) i 1  MCMC Sampling  Define Markov chain over orderings  Run chain to get samples from posterior P( | D) 48 Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances   fewer iterations than MCMC-Nets approximately same amount of computation -8400 Score of cuurent sample score -8405 -8410 -8415 -8420 -8425 -8430 -8435 -8440 -8445 random greedy -8450 0 10000 20000 30000 40000 50000 60000 iteration MCMC Iteration Process appears to be mixing! 49 Mixing of MCMC runs  Two MCMC runs over same instances  Probability estimates for edges 500 instances 1000 instances Probability estimates very robust 50 Application to Gene Array Analysis 51 Chips and Features 52 Application to Gene Array Analysis  See www.cs.tau.ac.il/~nin/BioInfo04  Bayesian Inference : http://www.cs.huji.ac.il/~nir/Papers/FLNP1Full.pdf 53 Application: Gene expression Input: Measurement of gene expression under different conditions  Thousands of genes  Hundreds of experiments Output: Models of gene interaction  Uncover pathways 54 Map of Feature Confidence Yeast data [Hughes et al 2000]  600 genes  300 experiments 55 “Mating response” Substructure KAR4 SST2 TEC1 NDJ1 KSS1 YLR343W YLR334C MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6 FIG1 FUS3 YEL059W Automatically constructed sub-network of highconfidence edges  Almost exact reconstruction of yeast mating pathway  56 Summary of the course  Bayesian learning  The idea of considering model parameters as variables with prior distribution  PAC learning  Assigning confidence and accepted error to the learning problem and analyzing polynomial L T  Boosting and Bagging  Use of the data for estimating multiple models and fusion between them 57 Summary of the course  Hidden Markov Models  Markov property is widely assumed, hidden Markov model very powerful & easily estimated  Model selection and validation  Crucial for any type of modeling!  (Artificial) neural networks  The brain performs computations radically different than modern computers & often much better! we need to learn how?  ANN powerful modeling tool (BP, RBF) 58 Summary of the course  Evolutionary learning  Very different learning rule, has its merits  VC dimensionality  Powerful theoretical tool in defining solvable problems, difficult for practical use  Support Vector Machine  Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim  Bayesian networks: compact way to represent conditional dependencies between variables 59 Final project 60 Probabilistic Relational Models Key ideas: Universals: Probabilistic patterns hold for all objects in class  Locality: Represent direct probabilistic dependencies   Links give us potential interactions! Professor Teaching-Ability Student Intelligence Course Difficulty A B C easy,weak Reg Grade Satisfaction easy,smart hard,weak hard,smart 0% 20% 40% 60% 80% 100% 61 PRM Semantics Prof. Jones Teaching-ability Prof. Smith Teaching-ability Instantiated PRM BN  variables: attributes of all objects  dependencies: determined by Grade|Intell,Diffic links & PRM Grade Welcome to Intelligence Satisfac Forrest Gump Geo101 Grade Welcome to Difficulty Satisfac CS101 Grade Difficulty Satisfac Intelligence Jane Doe 62 The Web of Influence Welcome to  Objects are all correlated CS101  Need to perform inference over entire model  For large databases, use approximate inference:  Loopy beliefCpropagation 0% 0% 50% 50% 100% 100% Welcome to Geo101 easy / hard A weak smart weak / smart 63 PRM Learning: Complete Data Prof. Jones Low Teaching-ability Prof. Smith High Teaching-ability Grade|Intell,Diffic Grade C Satisfac Like Weak Intelligence Welcome to Introduce Geo101 prior over parameters  Entire database is single “instance” B  Update prior with sufficient Grade statistics:  Parameters Easyused many times in instance Difficulty  Hate Satisfac Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi) Welcome to CS101 A Grade Easy Difficulty Smart Intelligence Like Satisfac 64 PRM Learning: Incomplete Data ??? ??? C Hi ??? Welcome to Use expected sufficient statistics Geo101  But, everything is correlated: A  E-step uses (approx) inference over entire model ???  Welcome to Low CS101 ??? B Hi ??? 65 Example: Binomial Data uniform for  in [0,1]  P( |D)  the likelihood L( :D)  Prior: P ( | x [1],x [M])  P (x [1],x [M] |  )  P ( ) (NH,NT ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 0 0.2 0.4 0.6 0.8 1 5 P (x [M  1]  H | D )    P ( | D )d   0.7142  7 66 Dirichlet Priors  Recall that the likelihood function is K L (  : D )    k Nk k 1  Dirichlet prior with hyperparameters 1,…,K K P ()   k k 1 k 1  the posterior has the same form, with hyperparameters 1+N 1,…,K +N K K K K k 1 k 1 k 1 P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1 67 Dirichlet Priors - Example 5 4.5 Dirichlet(heads, tails) 4 P(heads) 3.5 3 2.5 Dirichlet(0.5,0.5) Dirichlet(5,5) 2 Dirichlet(2,2) 1.5 Dirichlet(1,1) 1 0.5 0 0 0.2 0.4 0.6 0.8 1 heads 68 Dirichlet Priors (cont.)  If P() is Dirichlet with hyperparameters 1,…,K k P (X [1]  k )   k P ()d      Since the posterior is also Dirichlet, we get  k  Nk P (X [M  1]  k | D )   k P ( | D )d   (   N  )  69 Learning Parameters: Case Study 1.4 Instances sampled from ICU Alarm network KL Divergence to true distribution 1.2 M’ — strength of prior 1 0.8 MLE 0.6 Bayes; M'=20 0.4 Bayes; M'=50 0.2 Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 70 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form  P() is Dirichlet with hyperparameters 1,…,K  D is a dataset with sufficient statistics N1,…,NK Then           (   N ) P (D )   (  )      (   N )     71 Marginal Likelihood: Bayesian Networks  Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Network 1: Two Dirichlet marginal likelihoods X P( ) Integral over X P( ) Integral over Y Y 72 Marginal Likelihood: Bayesian Networks  Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Network 2: Three Dirichlet marginal likelihoods X Y P( ) Integral over X P( ) Integral over Y|X=H P( ) Integral over Y|X=T 73 Marginal Likelihood for Networks The marginal likelihood has the form: P (D | G )   Dirichlet marginal likelihood i   paiG   ( pa iG ) for multinomial P(Xi | pai)    ( paiG )  N ( pa iG )  xi ( (xi , paiG )  N (xi , paiG )) ( (xi , pa iG )) N(..) are counts from the data (..) are hyperparameters for each family given G 74 Bayesian Score: Asymptotic Behavior log M log P(D | G)  (G : D)  dim(G)  O(1) 2 log M  M  I(Xi ; PaiG )  H(Xi )   dim(G)  O(1) 2 i Fit dependencies in empirical distribution  As   Complexity penalty M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise  Asymptotic equivalence to MDL score  Bayesian score is consistent  Observed data eventually overrides prior 75 Structure Search as Optimization Input:  Training data  Scoring function  Set of possible structures Output:  A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G ) 76 Tree-Structured Networks MINVOLSET PULMEMBOLUS Trees:  At most one parent per variable PAP KINKEDTUBE INTUBATION SHUNT VENTLUNG SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP DISCONNECT VENITUBE PRESS MINOVL VENTALV PVSAT Why trees?  Elegant math  we can solve the optimization problem  Sparse parameterization  avoid overfitting VENTMACH LVFAILURE STROEVOLUME ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 77 Learning Trees p(i) denote parent of Xi  We can write the Bayesian score as  Let Score (G : D )  Score (Xi : Pai ) i     Score (Xi : X p (i ) )  Score (Xi )  Score (Xi ) i Improvement over “empty” network i Score of “empty” network Score = sum of edge scores + constant 78 Learning Trees w(ji) =Score( Xj  Xi ) - Score(Xi)  Find tree (or forest) with maximal weight  Set  Standard max spanning tree algorithm — O(n2 log n) Theorem: This procedure finds tree with max score 79 Beyond Trees When we consider more complex network, the problem is not as easy  Suppose we allow at most two parents per node  A greedy algorithm is no longer guaranteed to find the optimal network  In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1 80

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BBNFriedmanKollerAdapted