Download Bayes net structure learning

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Structure Learning AGENDA Learning probability distributions from example data  To what extent can Bayes net structure be learned?  Constraint methods (inferring conditional independence)  Scoring methods (learning => optimization)  BASIC QUESTION  Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*? BASIC QUESTION  Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G* construct a network that encodes the same independence relations as G*? G* G1  G2  LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins  Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT  Model 1 X Model 2 Y X Y LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins  Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT  Model 1 X Model 2 Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins  Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT  Model 1 X Model 2 Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger! PRINCIPLE Learning structure must trade off fit of data vs. complexity of network  Complex networks  More parameters to learn  More data fragmentation = greater sensitivity to noise  APPROACH #1: CONSTRAINT-BASED LEARNING  First, identify an undirected skeleton of edges in G* If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent  If X-Y is not in G*, then we can find evidence variables to make X and Y independent   Then, assign directionality to preserve independences BUILD-SKELETON ALGORITHM Given X={X1,…,Xn}, query Independent?(X,Y,U)  H = complete graph over X  For all pairs Xi, Xj, test separation as follows:  Enumerate all possible separating sets U  If Independent?(Xi,Xj,U) then remove Xi—Xj from H  In practice: • Must restrict to bounded size subsets |U|d (i.e., assume G* has bounded degree). O(n2(n-2)d) tests • Independence can’t be tested exactly ASSIGNING DIRECTIONALITY  Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are independent given Y  In fact Y must be given for X and Z to be independent   Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Z Y Directionality is irrelevant ASSIGNING DIRECTIONALITY  Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are independent given Y  In fact Y must be given for X and Z to be independent   Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Y separates X, Z Z Y Directionality is irrelevant X Z Y Not a v-structure ASSIGNING DIRECTIONALITY  Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are independent given Y  In fact Y must be given for X and Z to be independent   Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Y separates X, Z Z Y Directionality is irrelevant X YU separates X, Z Z Y Not a v-structure X Z Y A v-structure ASSIGNING DIRECTIONALITY  Note that V-structures XYZ introduce a dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and Z are independent given Y  In fact Y must be given for X and Z to be independent   Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Y separates X, Z Z Y Directionality is irrelevant X YU separates X, Z Z Y Not a v-structure X Z Y A v-structure STATISTICAL INDEPENDENCE TESTING Question: are X and Y independent?  Null hypothesis H0: X and Y are independent  Alternative hypothesis HA: X and Y are not independent  STATISTICAL INDEPENDENCE TESTING     Question: are X and Y independent? Null hypothesis H0: X and Y are independent Alternative hypothesis HA: X and Y are not independent 2 test: use the statistic 2 𝑀 𝑋 = 𝑥, 𝑌 = 𝑦 − 𝑀 𝑃 𝑥 𝑃 𝑦 = 𝑥,𝑦 with 𝑃 𝑥 =   𝑀 𝑋=𝑥 𝑀 2 𝑀𝑃 𝑥 𝑃 𝑦 the empirical probability of X Can compute (table lookup) the probability of getting a value at least this extreme if H0 is true (p-value) If p < some threshold, e.g. 1-0.95, H0 is rejected APPROACH #2: SCORE-BASED METHODS Learning => optimization  Define scoring function Score(G;D) that evaluates quality of structure G, and optimize it    Combinatorial optimization problem Issues: Choice of scoring function: maximum likelihood score, Bayesian score  Efficient optimization techniques  MAXIMUM-LIKELIHOOD SCORES  ScoreL(G;D) = likelihood of the BN with the most likely parameter settings under structure G Let L(G,G;D) be the likelihood of data using parameters G with structure G  Let G* = arg max L(,G;D) as described in last lecture   Then ScoreL(G;D) = L(G*,G;D) ISSUE WITH ML SCORE ISSUE WITH ML SCORE  Independent coin example G1 G2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Likelihood score log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20) log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11) ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)] ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)] ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)] = 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)] ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)] = 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)] = 20 𝑦 𝑥 𝑃 𝑥 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ] ISSUE WITH ML SCORE G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)] = 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)] = 20 𝑦 𝑥 𝑃(𝑥, 𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ] MUTUAL INFORMATION PROPERTIES 𝐼𝑃 (𝑥, 𝑦) (the mutual information between X and Y) 𝑃 𝑦𝑥 𝐼𝑃 𝑥, 𝑦 = 𝑃 𝑥, 𝑦 log 𝑃 𝑦 𝑥,𝑦 = 𝑃 𝑥, 𝑦 log 𝑥,𝑦 𝑃(𝑥, 𝑦) 𝑃(𝑥)𝑃 𝑦 = 𝐷(𝑃| 𝑄 with Q(x,y) = P(x)P(y)  0 by nonnegativity of KL divergence Implication: ML scores do not decrease for more connected graphs => Overfitting to data! POSSIBLE SOLUTIONS  Fix complexity of graphs (e.g., bounded in-degree)   See HW7 Penalize complex graphs  Bayesian scores IDEA OF BAYESIAN SCORING Note that parameters are uncertain  Bayesian approach: put a prior on parameter values and marginalize them out    P(D|G) = 𝜃𝐺 𝑃 𝐷 𝜃𝐺 , 𝐺 𝑃 𝜃𝐺 𝐺 𝑑𝜃𝐺 For example, use Beta/Dirichlet priors => marginal is manageable to compute E.g., uniform hyperparameter  over network  Set virtual counts to  2^-|PaXi|  LARGE SAMPLE APPROXIMATION log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1)  With M the number of samples, Dim[G] the number of free parameters of G  Bayesian Information Criterion (BIC) score:   ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G] LARGE SAMPLE APPROXIMATION log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1)  With M the number of samples, Dim[G] the number of free parameters of G  Bayesian Information Criterion (BIC) score:   ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G] Fit data set Prefer simple models STRUCTURE OPTIMIZATION, GIVEN A SCORE…  The problem is well-defined, but combinatorially complex!   Superexponential in # of variables Idea: search locally through the space of graphs using graph operators Add edge  Delete edge  Reverse edge  SEARCH STRATEGIES  Greedy Pick operator that leads to greatest  score  Local minima? Plateaux?   Overcoming plateaux Search with basin flooding  Tabu search  Perturbation methods (similar to simulated annealing, except on data weighting)   Implementation details:  Evaluate ’s between structures quickly (local decomposibility) RECAP Bayes net structure learning: from equivalence class of networks that encode the same conditional independences  Constraint-based methods    Statistical independence tests Score-based methods  Learning => optimization

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bayes net structure learning