Download Learning I: Introduction, Parameter Estimation

Learning Bayesian Networks from Data Nir Friedman Hebrew U. . Daphne Koller Stanford Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 2 Bayesian Networks Compact representation of probability distributions via conditional independence Family of Alarm Qualitative part: Earthquake Directed acyclic graph (DAG)  Nodes - random variables Radio  Edges - direct influence Burglary Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions P (B , E , A,C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A) 3 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients  37 variables  509 parameters …instead of 254 MINVOLSET PULMEMBOLUS PAP KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG DISCONNECT VENITUBE PRESS MINOVL ANAPHYLAXIS SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO2 VENTALV PVSAT ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 4 Inference Posterior  Probability of any event given any evidence Most  likely explanation Scenario that explains evidence Rational   probabilities decision making Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call 5 Why learning? Knowledge acquisition bottleneck  Knowledge acquisition is an expensive process  Often we don’t have an expert Data is cheap  Amount of available information growing rapidly  Learning allows us to construct models from raw data 6 Why Learn Bayesian Networks?  Conditional independencies & graphical language capture structure of many real-world distributions  Graph  structure provides much insight into domain Allows “knowledge discovery”  Learned model can be used for many tasks  Supports   all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables 7 Learning Bayesian networks B E Data + Prior Information Learner R A C E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 8 Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ?  Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified   B E Inducer needs to estimate parameters Data does not contain missing values 9 Unknown Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ?  Learner A B E A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is not specified   B E Inducer needs to select arcs & estimate parameters Data does not contain missing values 10 Known Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is specified  Data contains missing values   Need to consider assignments to missing values 11 Unknown Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Learner B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 Network structure is not specified  Data contains missing values   Need to consider assignments to missing values 12 Overview  Introduction  Parameter Estimation  Likelihood function  Bayesian estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 13 Learning Parameters  Training data has the form: B E A  E [1] B [1] A[1] C [1]         D         E [ M ] B [ M ] A [ M ] C [ M ]   C 14 Likelihood Function  Assume i.i.d. samples  Likelihood function is L( : D )   P (E [m], B [m], A[m],C [m] : ) B E A C m 15 Likelihood Function  By B E definition of network, we get A L( : D )   P (E [m], B [m], A[m], C [m] : ) C m  P (E [m] : )     P (B [m] : )     m  P (A[m ] | B [m ], E [m ] : )   P (C [m] | A[m] : )     E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   16 Likelihood Function  Rewriting B E terms, we get A L( : D )   P (E [m], B [m], A[m ], C [m] : ) C m  P (E [m] : ) m  P (B [m] : ) = m  P (A[m] | B [m], E [m] : ) m  P (C [m] | A[m] : ) m  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   17 General Bayesian Networks Generalizing for any Bayesian network: L( : D )   P (x1 [m ], , xn [m ] : ) m   P (xi [m ] | Pai [m ] : i ) i m   Li (i : D ) i Decomposition  Independent estimation problems 18 Likelihood Function: Multinomials L( : D )  P (D | )   P (x [m ] | ) m likelihood for the sequence H,T, T, H, H is L( :D)  The L( : D )    (1  )  (1  )     0 0.2 0.4 General case:  0.6 0.8 Count of kth outcome in D 1 K L (  : D )    k Nk k 1 Probability of kth outcome 19 Bayesian Inference  Represent uncertainty about parameters using a probability distribution over parameters, data  Learning using Bayes rule Likelihood Prior P (x [1],  x [M ] | )P () P ( | x [1],  x [M ])  P (x [1],  x [M ]) Posterior Probability of data 20 Bayesian Inference  Represent Bayesian distribution as Bayes net  X[1] X[2] X[m] Observed data values of X are independent given   P(x[m] |  ) =   Bayesian prediction is inference in this network  The 21 Example: Binomial Data uniform for  in [0,1]  P( |D)  the likelihood L( :D)  Prior: P ( | x [1],x [M])  P (x [1],x [M] |  )  P ( ) (NH,NT ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 0 0.2 0.4 0.6 0.8 1 5 P (x [M  1]  H | D )    P ( | D )d   0.7142  7 22 Dirichlet Priors  Recall that the likelihood function is K L (  : D )    k Nk k 1  Dirichlet prior with hyperparameters 1,…,K K P ()   k k 1 k 1  the posterior has the same form, with hyperparameters 1+N 1,…,K +N K K K K k 1 k 1 k 1 P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1 23 Dirichlet Priors - Example 5 4.5 Dirichlet(heads, tails) 4 P(heads) 3.5 3 2.5 Dirichlet(0.5,0.5) Dirichlet(5,5) 2 Dirichlet(2,2) 1.5 Dirichlet(1,1) 1 0.5 0 0 0.2 0.4 0.6 0.8 1 heads 24 Dirichlet Priors (cont.)  If P() is Dirichlet with hyperparameters 1,…,K k P (X [1]  k )   k P ()d      Since the posterior is also Dirichlet, we get  k  Nk P (X [M  1]  k | D )   k P ( | D )d   (   N  )  25 Bayesian Nets & Bayesian Prediction Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  Priors for each parameter group are independent  Data instances are independent given the unknown parameters 26 Bayesian Nets & Bayesian Prediction Y|X X X[1] X[2] X[M] Y[1] Y[2] Y[M] Observed data  We can also “read” from the network: Complete data  posteriors on parameters are independent  Can compute posterior over parameters separately! 27 Learning Parameters: Summary  Estimation   relies on sufficient statistics For multinomials: counts N(xi,pai) Parameter estimation N (xi , pai ) ˆ x |pa  i i N ( pai ) MLE (xi , pai )  N (xi , pai ) ~ x |pa  i i ( pai )  N ( pai ) Bayesian (Dirichlet)  Both are asymptotically equivalent and consistent  Both can be implemented in an on-line manner by accumulating sufficient statistics 28 Learning Parameters: Case Study 1.4 Instances sampled from ICU Alarm network KL Divergence to true distribution 1.2 M’ — strength of prior 1 0.8 MLE 0.6 Bayes; M'=20 0.4 Bayes; M'=50 0.2 Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 Overview  Introduction  Parameter Learning  Model Selection  Scoring function  Structure search  Structure Discovery  Incomplete Data  Learning from Structured Data 30 Why Struggle for Accurate Structure? Earthquake Alarm Set Burglary Sound Missing an arc Earthquake Alarm Set Adding an arc Burglary Earthquake Sound Cannot be compensated for by fitting parameters  Wrong assumptions about domain structure  Alarm Set Burglary Sound Increases the number of parameters to be estimated  Wrong assumptions about domain structure  31 Scorebased Learning Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> B E E E A A B B A Search for a structure that maximizes the score 32 Likelihood Score for Structure (G : D)  log L(G : D)  M  I(Xi ; PaiG )  H(Xi )  i Mutual information between Xi and its parents  Larger dependence of Xi on Pai  higher score  Adding    arcs always helps I(X; Y)  I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea… 33 Bayesian Score Likelihood score: L(G : D)  P(D | G, θˆG ) Max likelihood params Bayesian approach:  Deal with uncertainty by assigning probability to all possibilities P (D | G )   P (D | G , )P ( | G ) d Marginal Likelihood Likelihood Prior over parameters P(D |G)P(G) P(G |D)  P(D) 34 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form  P() is Dirichlet with hyperparameters 1,…,K  D is a dataset with sufficient statistics N1,…,NK Then           (   N ) P (D )   (  )      (   N )     35 Marginal Likelihood: Bayesian Networks  Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Network 1: Two Dirichlet marginal likelihoods X P( ) Integral over X P( ) Integral over Y Y 36 Marginal Likelihood: Bayesian Networks  Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Network 2: Three Dirichlet marginal likelihoods X Y P( ) Integral over X P( ) Integral over Y|X=H P( ) Integral over Y|X=T 37 Marginal Likelihood for Networks The marginal likelihood has the form: P (D | G )   Dirichlet marginal likelihood i   paiG   ( pa iG ) for multinomial P(Xi | pai)    ( paiG )  N ( pa iG )  xi ( (xi , paiG )  N (xi , paiG )) ( (xi , pa iG )) N(..) are counts from the data (..) are hyperparameters for each family given G 38 Bayesian Score: Asymptotic Behavior log M log P(D | G)  (G : D)  dim(G)  O(1) 2 log M  M  I(Xi ; PaiG )  H(Xi )   dim(G)  O(1) 2 i Fit dependencies in empirical distribution  As   Complexity penalty M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise  Asymptotic equivalence to MDL score  Bayesian score is consistent  Observed data eventually overrides prior 39 Structure Search as Optimization Input:  Training data  Scoring function  Set of possible structures Output:  A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G ) 40 Tree-Structured Networks MINVOLSET PULMEMBOLUS Trees:  At most one parent per variable PAP KINKEDTUBE INTUBATION SHUNT VENTLUNG SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP DISCONNECT VENITUBE PRESS MINOVL VENTALV PVSAT Why trees?  Elegant math  we can solve the optimization problem  Sparse parameterization  avoid overfitting VENTMACH LVFAILURE STROEVOLUME ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 41 Learning Trees p(i) denote parent of Xi  We can write the Bayesian score as  Let Score (G : D )  Score (Xi : Pai ) i     Score (Xi : X p (i ) )  Score (Xi )  Score (Xi ) i Improvement over “empty” network i Score of “empty” network Score = sum of edge scores + constant 42 Learning Trees w(ji) =Score( Xj  Xi ) - Score(Xi)  Find tree (or forest) with maximal weight  Set  Standard max spanning tree algorithm — O(n2 log n) Theorem: This procedure finds tree with max score 43 Beyond Trees When we consider more complex network, the problem is not as easy  Suppose we allow at most two parents per node  A greedy algorithm is no longer guaranteed to find the optimal network  In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1 44 Heuristic Search  Define a search space:  search states are possible structures  operators make small changes to structure  Traverse space looking for high-scoring structures  Search techniques:  Greedy hill-climbing  Best first search  Simulated Annealing  ... 45 Local Search  Start    with a given network empty network best tree a random network  At each iteration  Evaluate all possible changes  Apply change based on score  Stop when no modification improves score 46 Heuristic Search  Typical S operations: C E S C To update score after local change,D E only re-score families that changed D S C score = S({C,E} D) - S({E} D) S C E E D D 47 Learning in Practice: Alarm domain 2 true distribution KL Divergence to 1.5 Structure known, fit params 1 Learn both structure & params 0.5 0 0 500 1000 1500 2000 2500 3000 #samples 3500 4000 4500 5000 48 Local Search: Possible Pitfalls  Local  search can get stuck in: Local Maxima:  All  one-edge changes reduce the score Plateaux:  Some  Standard    one-edge changes leave the score unchanged heuristics can escape both Random restarts TABU search Simulated annealing 49 Improved Search: Weight Annealing Standard annealing process:  Take bad steps with probability  exp(score/t)  Probability increases with temperature  Weight annealing:  Take uphill steps relative to perturbed score  Perturbation increases with temperature Score(G|D)  G 50 Perturbing the Score  Perturb the score by reweighting instances  Each weight sampled from distribution:  Mean = 1  Variance  temperature  Instances sampled from “original” distribution  … but perturbation changes emphasis Benefit:  Allows global moves in the search space 51 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search True structure Learned params Greedy hill-climbing Annealed search 52 Structure Search: Summary  Discrete optimization problem  In some cases, optimization problem is easy  Example: learning trees  In general, NP-Hard  Need to resort to heuristic search  In practice, search is relatively fast (~100 vars in ~2-5 min):  Decomposability  Sufficient  statistics Adding randomness to search is critical 53 Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 54 Structure Discovery Task: Discover structural properties  Is there a direct connection between X & Y  Does X separate between two “subsystems”  Does X causally effect Y Example: scientific data mining  Disease properties and symptoms  Interactions between the expression of genes 55 Discovering Structure P(G|D) E R B A C  Current   practice: model selection Pick a single high-scoring model Use that model to infer domain structure 56 Discovering Structure P(G|D) E B R A C R E B E A C R B A C E R B A C E R B A C Problem    Small sample size  many high scoring models Answer based on one model often useless Want features common to many models 57 Bayesian Approach  Posterior distribution over structures  Estimate probability of features  Edge XY  Path X…  Y  … Bayesian score for G P (f | D )  f (G )P (G | D ) Feature of G, e.g., XY G Indicator function for feature f 58 MCMC over Networks   Cannot enumerate structures, so sample structures 1 P (f (G ) | D )  n n f (G ) i 1 i MCMC Sampling  Define Markov chain over BNs  Run chain to get samples from posterior P(G | D) Possible pitfalls:  Huge (superexponential) number of networks  Time for chain to converge to posterior is unknown  Islands of high posterior, connected by low bridges 59 ICU Alarm BN: No Mixing  500 instances: Score of cuurent sample score -8400 -8600 -8800 -9000 -9200 empty greedy -9400 0 100000 200000 300000 400000 500000 600000 iteration MCMC Iteration  The runs clearly do not mix 60 Effects of Non-Mixing MCMC runs over same 500 instances  Probability estimates for edges for two runs True BN Random start  Two True BN True BN Probability estimates highly variable, nonrobust 61 Fixed Ordering Suppose that  We know the ordering of variables  say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1  Limit number of parents per nodes to k 2k•n•log n networks Intuition: Order decouples choice of parents  Choice of Pa(X7) does not restrict choice of Pa(X12) Upshot: Can compute efficiently in closed form  Likelihood P(D | )  Feature probability P(f | D, ) 62 Our Approach: Sample Orderings We can write P (f | D )   P (f |,D )P ( | D )  Sample orderings and approximate n P (f | D )   P (f |  i , D ) i 1  MCMC Sampling  Define Markov chain over orderings  Run chain to get samples from posterior P( | D) 63 Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances   fewer iterations than MCMC-Nets approximately same amount of computation -8400 Score of cuurent sample score -8405 -8410 -8415 -8420 -8425 -8430 -8435 -8440 -8445 random greedy -8450 0 10000 20000 30000 40000 50000 60000 iteration MCMC Iteration Process appears to be mixing! 64 Mixing of MCMC runs  Two MCMC runs over same instances  Probability estimates for edges 500 instances 1000 instances Probability estimates very robust 65 Application: Gene expression Input: Measurement of gene expression under different conditions  Thousands of genes  Hundreds of experiments Output: Models of gene interaction  Uncover pathways 66 Map of Feature Confidence Yeast data [Hughes et al 2000]  600 genes  300 experiments 67 “Mating response” Substructure KAR4 SST2 TEC1 NDJ1 KSS1 YLR343W YLR334C MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6 FIG1 FUS3 YEL059W Automatically constructed sub-network of highconfidence edges  Almost exact reconstruction of yeast mating pathway  68 Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Parameter estimation  Structure search  Learning from Structured Data 69 Incomplete Data Data is often incomplete  Some variables of interest are not assigned values This phenomenon happens when we have  Missing values:  Some variables unobserved in some instances  Hidden variables:  Some variables are never observed  We might not even know they exist 70 Hidden (Latent) Variables Why should we care about unobserved variables? X1 X2 X3 X1 X2 X3 Y3 Y1 Y2 Y3 H Y1 Y2 17 parameters 59 parameters 71  P(X) Example X Y Y|X=H assumed to be known  Likelihood function of: Y|X=T, Y|X=H  Contour plots of log likelihood for different number of missing values of X (M = 8): Y|X=T Y|X=T no missing values 2 missing value Y|X=T 3 missing values In general: likelihood function has multiple modes 72 Incomplete Data  In the presence of incomplete data, the likelihood can have multiple maxima H Y Example:  We can rename the values of hidden variable H  If H has two values, likelihood has two maxima  In practice, many local maxima 73 L(|D) EM: MLE from Incomplete Data   Use current point to construct “nice” alternative function  Max of new function scores ≥ than current point 74 Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition:  If we had true counts, we could estimate parameters  But with missing values, counts are unknown  We “complete” counts using probabilistic inference based on current parameter assignment  We use completed counts as if real to re-estimate parameters 75 Expectation Maximization (EM) P(Y=H|X=H,Z=T,) = 0.3 Current model Data X Y Z H T H H T T ? ? T H ? ? H T T Expected Counts N (X,Y ) X Y # H T H T H H T T 1.3 0.4 1.7 1.6 P(Y=H|X=T,) = 0.4 76 Expectation Maximization (EM) Reiterate Initial network (G,0) X1 X2 H Y1 Y2  Expected Counts X3 Computation Y3 (E-Step) N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Updated network (G,1) X1 Reparameterize (M-Step) X2 X3 H Y1 Y2 Y3 Training Data 77 Expectation Maximization (EM) Formal Guarantees:  L(1:D)  L(0:D)  Each iteration improves the likelihood  If 1 = 0 , then 0 is a stationary point of L(:D)  Usually, this means a local maximum 78 Expectation Maximization (EM) Computational bottleneck:  Computation of expected counts in E-Step  Need to compute posterior for each unobserved variable in each instance of training set  All posteriors for an instance can be derived from one pass of standard BN inference 79 Summary: Parameter Learning with Incomplete Data  Incomplete data makes parameter estimation hard  Likelihood function  Does not have closed form  Is multimodal  Finding max likelihood parameters: EM  Gradient ascent  Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics  80 Incomplete Data: Structure Scores Recall, Bayesian score: P (G | D )  P (G )P (D | G )  P (G )  P (D | G , )P ( | G )d With incomplete data:  Cannot evaluate marginal likelihood in closed form  We have to resort to approximations:   Evaluate score around MAP parameters Need to find MAP parameters (e.g., EM) 81 Naive Approach  Perform G3 EM for each candidate graph G2 G1 Parameter space Parametric optimization (EM) Local Maximum   Computationally expensive: Gn G4  Parameter optimization via EM — non-trivial  Need to perform EM for all candidate structures  Spend time even on poor candidates In practice, considers only a few candidates 82 Structural EM Recall, in complete data we had  Decomposition  efficient search Idea:  Instead of optimizing the real score…  Find decomposable alternative score  Such that maximizing new score  improvement in real score 83 Structural EM Idea:  Use current model to help evaluate new structures Outline:  Perform search in (Structure, Parameters) space  At each iteration, use current model for finding either:  Better scoring parameters: “parametric” EM step or  Better scoring structure: “structural” EM step 84 Reiterate Computation X1 X2 X3 H Y1 Y2  Y3 Training Data Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) Score & Parameterize X1 X2 X3 H Y1 X1 Y2 X2 Y3 X3 H Y1 Y2 Y3 85 Example: Phylogenetic Reconstruction Input: Biological sequences Human CGTTGC… Chimp CCTAGG… Orang CGAACG… An “instance” of evolutionary process Assumption: positions are independent …. Output: a phylogeny leaf 86 Phylogenetic Model 8 leaf 2  Topology: 12 internal node 11 10 1 9 branch (8,9) 3 4 5 6 7 bifurcating Observed species – 1…N  Ancestral species – N+1…2N-2  Lengths t = {ti,j} for each branch (i,j)  Evolutionary model:   P(A changes to T| 10 billion yrs ) 87 Phylogenetic Tree as a Bayes Net  Variables: Letter at each position for each species  Current day species – observed  Ancestral species - hidden  BN Structure: Tree topology  BN Parameters: Branch lengths (time spans) Main problem: Learn topology If ancestral were observed  easy learning problem (learning trees) 88 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Original Tree (T0,t0) 89 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' arg maxT w (i , j )T i,j Pairwise weights O(N2) pairwise statistics suffice to evaluate all trees 90 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' arg maxT w (i , j )T i,j Construct bifurcation T1 Max. Spanning Tree 91 Algorithm Outline Compute expected pairwise stats Weights: Branch scores Find: T ' arg maxT w (i , j )T i,j Construct bifurcation T1 Theorem: L(T1,t1)  L(T0,t0) New Tree Repeat until convergence… 92 Real Life Data Lysozyme c Mitochondrial genomes 34 # sequences 43 # pos 122 3,578 -2,916.2 -74,227.9 -2,892.1 -70,533.5 0.19 1.03 Traditional approach LogStructural EM likelihood Approach Difference per position Each position twice as likely 93 Overview  Introduction  Parameter Estimation  Model Selection  Structure Discovery  Incomplete Data  Learning from Structured Data 94 Bayesian Networks: Problem Bayesian nets use propositional representation  Real world has objects, related to each other  Intelligence Difficulty Grade 95 Bayesian Networks: Problem Bayesian nets use propositional representation  Real world has objects, related to each other  These “instances” are not independent! Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 A Intell_FGump Intell_FGump Diffic_CS101 Grade_FGump_CS101 C Diffic_Geo101 Grade_FGump_Geo101 96 St. Nordaf University Teaching-ability Teaches Teaches Teaching-ability In-courseGrade Registered Satisfac Intelligence Welcome to Forrest Gump Geo101 Grade Welcome to Difficulty Registered In-courseSatisfac CS101 Intelligence Grade Difficulty In-courseSatisfac Registered Jane Doe 97 Relational Schema  Specifies types of objects in domain, attributes of each type of object, & types of links between objects Professor Classes Student Intelligence Teaching-Ability Take Teach Links Attributes Course Difficulty In Registration Grade Satisfaction 98 Representing the Distribution  Many  possible worlds for a given university All possible assignments of all attributes of all objects  Infinitely  many potential universities Each associated with a very different set of worlds Need to represent infinite set of complex distributions 99 Possible Worlds Prof. Jones High Low Teaching-ability Prof. Smith High Teaching-ability  World: assignment to all attributes of all objects in domain B Grade C Hate Satisfac Like Weak Intelligence Welcome to Forrest Gump Geo101 C B Grade Welcome to Easy Difficulty Hate Satisfac CS101 A Grade Hard Easy Difficulty Hate Like Satisfac Smart Intelligence Jane Doe 100 Probabilistic Relational Models Key ideas: Universals: Probabilistic patterns hold for all objects in class  Locality: Represent direct probabilistic dependencies   Links give us potential interactions! Professor Teaching-Ability Student Intelligence Course Difficulty A B C easy,weak Reg Grade Satisfaction easy,smart hard,weak hard,smart 0% 20% 40% 60% 80% 100% 101 PRM Semantics Prof. Jones Teaching-ability Prof. Smith Teaching-ability Instantiated PRM BN  variables: attributes of all objects  dependencies: determined by Grade|Intell,Diffic links & PRM Grade Welcome to Intelligence Satisfac Forrest Gump Geo101 Grade Welcome to Difficulty Satisfac CS101 Grade Difficulty Satisfac Intelligence Jane Doe 102 The Web of Influence Welcome to  Objects are all correlated CS101  Need to perform inference over entire model  For large databases, use approximate inference:  Loopy beliefCpropagation 0% 0% 50% 50% 100% 100% Welcome to Geo101 easy / hard A weak smart weak / smart 103 PRM Learning: Complete Data Prof. Jones Low Teaching-ability Prof. Smith High Teaching-ability Grade|Intell,Diffic Grade C Satisfac Like Weak Intelligence Welcome to Introduce Geo101 prior over parameters  Entire database is single “instance” B  Update prior with sufficient Grade statistics:  Parameters Easyused many times in instance Difficulty  Hate Satisfac Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi) Welcome to CS101 A Grade Easy Difficulty Smart Intelligence Like Satisfac 104 PRM Learning: Incomplete Data ??? ??? C Hi ??? Welcome to Use expected sufficient statistics Geo101  But, everything is correlated: A  E-step uses (approx) inference over entire model ???  Welcome to Low CS101 ??? B Hi ??? 105 A Web of Data Tom Mitchell Professor Project-of WebKB Project Works-on Advisee-of Sean Slattery Student Contains CMU CS Faculty [Craven et al.] 106 Standard Approach Page Category Word1 ... WordN Professor department extract information computer science machine learning … 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 107 What’s in a Link To- Page From-Page Category Category Word1 ... WordN Word1 ... WordN Exists Link 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 108 Discovering Hidden Concepts Internet Movie Database http://www.imdb.com 109 Discovering Hidden Concepts Actor Type Director Type Gender Appeared Movie Genre Type Credit-Order Year MPAA Rating Rating #Votes Internet Movie Database http://www.imdb.com 110 Web of Influence, Yet Again Movies Actors Directors Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher … … … 111 Conclusion  Many distributions have combinatorial dependency structure  Utilizing this structure is good  Discovering this structure has implications:  To density estimation  To knowledge discovery  Many applications  Medicine  Biology  Web 112 The END Thanks to Gal Elidan  Lise Getoor  Moises Goldszmidt  Matan Ninio  Dana Pe’er  Eran Segal  Ben Taskar  Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/ 113

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Learning I: Introduction, Parameter Estimation