Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data Learning Bayesian Networks from Data Nir Friedman Hebrew U. Daphne Koller Stanford . 2 Bayesian Networks Example: “ICU Alarm” network Compact representation of probability distributions via conditional independence Family of Alarm Qualitative part: Earthquake Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Radio Burglary Alarm Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 254 MINVOLSET E e e e e B P(A | E,B) b 0.9 0.1 b 0.2 0.8 b 0.9 0.1 b 0.01 0.99 PULMEMBOLUS PAP HYPOVOLEMIA Quantitative part: Set of conditional probability distributions LVEDVOLUME CVP SAO2 PCWP LVFAILURE STROEVOLUME FIO2 VENTALV PVSAT ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO HR HREKG ERRCAUTER HRSAT HRBP BP 3 4 Inference Why learning? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Probability of any event given any evidence Most likely explanation Scenario that explains evidence Earthquake Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw Burglary Value of Information Effect of intervention DISCONNECT VENITUBE PRESS Posterior probabilities Maximize expected utility VENTMACH VENTLUNG MINOVL TPR P (B , E , A, C , R ) = P (B )P (E )P (A | B , E )P (R | E )P (C | A) Rational decision making SHUNT ANAPHYLAXIS Call Together: Define a unique distribution in a factored form KINKEDTUBE INTUBATION Radio Alarm data Call 5 6 1 Why Learn Bayesian Networks? Learning Bayesian networks Conditional independencies & graphical language capture structure of many real-world distributions E Graph structure provides much insight into domain Data + Prior Information Allows “knowledge discovery” Learned model can be used for many tasks Learner B R A C E e e e e Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables B P(A | E,B) b .9 .1 b .7 .3 b .8 .2 b .99 .01 7 8 Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> E e e e e B P(A | E,B) b ? ? b ? ? b ? ? b ? ? E B E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> B E e e e e A Learner E Unknown Structure, Complete Data A B P(A | E,B) b .9 .1 b .7 .3 b .8 .2 b .99 .01 E e e e e Network structure is specified B P(A | E,B) b ? ? b ? ? b ? ? b ? ? E E e e e e A Learner E B B A B P(A | E,B) b .9 .1 b .7 .3 b .8 .2 b .99 .01 Network structure is not specified Inducer needs to estimate parameters Inducer needs to select arcs & estimate parameters Data does not contain missing values Data does not contain missing values 9 10 Known Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E e e e e B P(A | E,B) b ? ? b ? ? b ? ? b ? ? E Learner E B A Unknown Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> B A E e e e e B P(A | E,B) b .9 .1 b .7 .3 b .8 .2 b .99 .01 Network structure is specified Data contains missing values E e e e e B P(A | E,B) b ? ? b ? ? b ? ? b ? ? E Learner E B A B A E e e e e B P(A | E,B) b .9 .1 b .7 .3 b .8 .2 b .99 .01 Network structure is not specified Data contains missing values Need to consider assignments to missing values Need to consider assignments to missing values 11 12 2 Overview Learning Parameters Introduction Parameter Estimation Training data has the form: E B A Likelihood function E Bayesian estimation D= Model Selection Structure Discovery Incomplete Data Learning from Structured Data [1] ⋅ ⋅ [ ] B[1] A[1] C [1] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E M B[M] A[M] C [M C ] 13 14 Likelihood Function Likelihood Function Assume i.i.d. samples Likelihood function is E By definition of network, we get B E B A L(Θ : D) = ∏P (E [m],B[m], A[m],C [m] : Θ) A L(Θ : D) = ∏P (E [m],B[m],A[m],C [m] : Θ) C C m m P (E [m] : Θ) P (B[m] : Θ) =∏ P (A[m] |B[m],E [m] : Θ) P (C [m] | A[m] : Θ) m E [1] B [1] A[1] C [1] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E [M] B [M] A[M] C [M 15 Likelihood Function E B Generalizing for any Bayesian network: L(Θ : D ) = ∏ P (x 1[m], K, xn [m] : Θ) A L(Θ : D) = ∏P (E [m],B[m],A[m],C [m] : Θ) C m = ∏∏ P (xi [m] | Pai [m] : Θi ) m P (E [m] : Θ) ∏ m i P (C [m] |A[m] : Θ) ∏ m m = ∏ Li (Θi P (B[m] : Θ) ∏ m P (A[m] | B[m],E [m] : Θ) ∏ m i E [1] B [1] A[1] C [1] ⋅ ⋅ 16 General Bayesian Networks Rewriting terms, we get = ] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E [M] B [M] A[M] C [M ] 17 : D) Decomposition Independent estimation problems ⇒ 18 3 Likelihood Function: Multinomials L( θ : D ) = P (D | θ) = ∏ P (x [m ] | θ) Bayesian Inference Represent uncertainty about parameters using a m probability distribution over parameters, data The likelihood for the sequence H,T, T, H, H is ) : D Learning using Bayes rule θ L(θ : D ) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ L ( 0 0.2 θ 0.4 0.6 0.8 Count of kth outcome in D 1 K L( Θ : D ) = ∏ θ General case: k =1 k Likelihood Prior x [M ] | θ)P (θ) P (θ | x [1], K x [M ]) = P (xP[1(],xK [1], K x [M ]) Posterior Probability of data Nk Probability of kth outcome 19 20 Bayesian Inference Example: Binomial Data Represent Bayesian distribution as Bayes net Prior: uniform for θ in [0,1] ⇒ Pθ θ ( D) ∝ the likelihood L(θ :D) | P (θ | x[1],Kx [M]) ∝ P (x[1],Kx[M] |θ) ⋅ P (θ) X[1] X[2] X[m] H (N ,N T) = (4,1) Observed data MLE for P(X = H ) is 4/5 = 0.8 The values of X are independent given θ P(x[m] | θ ) = θ Bayesian prediction is inference in this network Bayesian prediction is 0 0.2 0.4 0.6 0.8 P (x [M + 1] = H | D ) = ∫ θ ⋅P (θ | D )dθ = 5 = 0.7142K 7 21 22 Dirichlet Priors Dirichlet Priors - Example Recall that the likelihood function is k =1 k 4.5 Nk P (Θ) ∝ ∏ θk 3.5 α ,…,α 1 K αk −1 ⇒ the posterior has the same form, with k =1 hyperparameters α1 +N Dirichlet(αheads, αtails) 4 Dirichlet prior with hyperparameters K 5 P(θ heads) K L( Θ : D ) = ∏ θ 1 α ,…, K k −1 Dirichlet(5,5) Dirichlet(0.5,0.5) 2 Dirichlet(2,2) 1.5 Dirichlet(1,1) K K P (Θ | D ) ∝ P (Θ)P (D | Θ) ∝ ∏θ k α ∏θ k N k k =1 3 2.5 1 +N K 1 k =1 K 0.5 = ∏θ k αk +Nk −1 0 0 k =1 0.2 0.4 0.6 0.8 1 θ heads 23 24 4 Dirichlet Priors (cont.) Bayesian Nets & Bayesian Prediction If P(Θ) is Dirichlet with hyperparameters α1,…,αK θY|X α P (X [1] = k ) = θk ⋅P (Θ)dΘ = k αl ∫ ∑ m l Since the posterior is also Dirichlet, we get P (X [M + 1] = k | D ) = ∫ θk ⋅P ( Θ | D )dΘ = X[1] X[2] X[M] X[M+1] X[m] Y[1] Y[2] Y[M] Y[M+1] Y[m] θY|X Plate notation α k + Nk (α l + Nl ) Query Observed data ∑ l θX θX Priors for each parameter group are independent Data instances are independent given the unknown parameters 25 Bayesian Nets & Bayesian Prediction θY|X 26 Learning Parameters: Summary Estimation relies on sufficient statistics θX For multinomials: counts N(x ,pa ) i X[1] X[2] X[M] Y[1] Y[2] Y[M] ˆθx |pa = N (xi , pai ) N (pai ) i Observed data i ~ θx |pa = i ⇒ Complete data posteriors on parameters are independent Can compute posterior over parameters separately! Bayesian (Dirichlet) Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics 27 Learning Parameters: Case Study KL Divergence to true distribution Introduction Parameter Learning Model Selection M’ — strength of prior 1 28 Overview Instances sampled from ICU Alarm network 1.2 xi , pai ) + N (xi , pai ) α( pai ) + N ( pai ) α( i MLE We can also “read” from the network: 1.4 i Parameter estimation Scoring function 0.8 MLE 0.6 Structure search Structure Discovery Incomplete Data Learning from Structured Data Bayes; M'=20 0.4 Bayes; M'=50 0.2 Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 30 5 Why Struggle for Accurate Structure? Earthquake Alarm Set Score-based Learning Burglary Define scoring function that evaluates how well a structure matches the data Sound Missing an arc Earthquake Alarm Set Adding an arc Burglary Earthquake Alarm Set Sound E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Burglary Sound Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Increases the number of E E B parameters to be estimated Wrong assumptions about domain structure E A A A B B Search for a structure that maximizes the score 31 32 Bayesian Score Likelihood Score for Structure l(G : D) = log L(G : D) = M ∑( I(X i ; Pa i i G ) − H(X i ) ) i Max likelihood params on Pa i ⇒ higher score P (D | G ) = ∫ P (D | G ,θ )P (θ | G )dθ Adding arcs always helps I(X; Y) ≤ I(X; L(G : D) = P(D | G, θˆG ) Bayesian approach: Deal with uncertainty by assigning probability to all possibilities Mutual information between Xi and its parents Larger dependence of X Likelihood score: Marginal Likelihood Max score attained by fully connected network Overfitting: A bad idea… P(G | D) = P(D | G)P(G) P(D) 33 Network structure Fortunately, in many cases integral has closed form determines form of marginal likelihood Θ) is Dirichlet with hyperparameters α1,…,αK D is a dataset with sufficient statistics N1 ,…,NK P( Then P (D ) = Γ Γ ∑ l αl ∑ α +N ( l l l ) ∏ l 34 Marginal Likelihood: Bayesian Networks Marginal Likelihood: Multinomials Prior over parameters Likelihood {Y,Z}) 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Network 1: Two Dirichlet marginal likelihoods Γ(α l + Nl ) Γ(α l ) 35 X P( ) Integral over θ X P( ) Integral over θ Y Y 36 6 Marginal Likelihood: Bayesian Networks Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 X H T T H T H H Y H T H H T T H Marginal Likelihood for Networks The marginal likelihood has the form: P (D | G ) = ∏∏ Network 2: Three Dirichlet marginal likelihoods X Y Dirichlet marginal likelihood for multinomial P(Xi | pai) paiG i Γ (α ( pa G ) ) Γ(α ( x i , pa G ) + N ( x i , pa G )) Γ (α ( x i , pa G )) ( pa G ) + N ( pa G ) )∏ x i Γ α( P( ) Integral over θ X P( ) Integral over θ Y|X=H P( ) Integral over θ Y|X=T i i i i i i N(..) are counts from the data α(..) are hyperparameters for each family given G 37 Bayesian Score: Asymptotic Behavior log P(D | G) = M ∑ i = l (G : D) − log M dim(G) 2 (I(X i ; Pa i ) − H(X i ) ) − Fit dependencies in empirical distribution + O(1) M dim(G) log G 2 38 Structure Search as Optimization Input: Training data Scoring function Set of possible structures + O(1) Complexity penalty As M (amount of data) grows, Output: A network that maximizes the score Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent Key Computational Property: Decomposability: score(G) = ∑ score ( family of X in G ) Observed data eventually overrides prior 39 40 Tree-Structured Networks Learning Trees MINVOLSET Let p(i) denote parent of X We can write the Bayesian score as i PULMEMBOL US Trees: At most one parent per variable PAP KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLU NG DISCONNE CT VENITUBE PRESS MINOVL VENTALV PVSAT Score(G : D ) = ∑ Score(X : Pa ) ARTCO2 i TPR Why trees? Elegant math ⇒ we can solve the optimization problem Sparse parameterization ⇒ avoid overfitting HYPOVOLEMIA LVEDVOL UME CVP PCWP SAO2 LVFAILURE STROEVOLUME = CATECHOL HISTORY ERRBLOWOUTPUT CO i i EXPCO2 INSUFFANESTH HR HREKG ∑ (Score X ( i ERRC AUTER i : X p (i ) )− Score(X )) + ∑ Score(X ) i i i HRSAT HRBP Improvement over “empty” network BP Score of “empty” network Score = sum of edge scores + constant 41 42 7 Learning Trees Beyond Trees When we consider more complex network, the problem is not as easy Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find the optimal network Set w(j→i) =Score( X → X ) - Score(X ) Find tree (or forest) with maximal weight j i In fact, no efficient algorithm exists i Standard max spanning tree algorithm — O(n2 log n) Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1 Theorem: This procedure finds tree with max score 43 44 Heuristic Search Local Search Define a search space: Start with a given network search states are possible structures empty network operators make small changes to structure best tree Traverse space looking for high-scoring structures Search techniques: a random network At each iteration Greedy hill-climbing Best first search Evaluate all possible changes Simulated Annealing Apply change based on score ... Stop when no modification improves score 45 Heuristic Search 46 Learning in Practice: Alarm domain Typical operations: S 2 C E C Ad d local 1.5 →D To update score after change,D E only re-score families that changed S C e le t De C → E D Re v ers ∆score = S({C,E} →D) - S({E} →D) eC →E S C E E D D true distribution C KL Divergence to S Structure known, fit params 1 Learn both structure & params 0.5 0 0 47 500 1000 1500 2000 2500 #samples 3000 3500 4000 4500 5000 48 8 Local Search: Possible Pitfalls Improved Search: Weight Annealing Standard annealing process: Local search can get stuck in: Take bad steps with probability ∝ exp(∆score/t) Probability increases with temperature Local Maxima: All one-edge changes reduce the score Weight annealing: Plateaux: Take uphill steps relative to perturbed score Some one-edge changes leave the score unchanged Perturbation increases with temperature Standard heuristics can escape both ) D | G ( e r o c S Random restarts TABU search Simulated annealing G 49 50 Weight Annealing: ICU Alarm network Perturbing the Score Cumulative performance of 100 runs of annealed structure search Perturb the score by reweighting instances Each weight sampled from distribution: True structure Learned params Mean = 1 Variance ∝ temperature Instances sampled from “original” distribution … but perturbation changes emphasis Greedy hill-climbing Annealed search Benefit: Allows global moves in the search space 51 52 Structure Search: Summary Overview Discrete optimization problem In some cases, optimization problem is easy Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data Example: learning trees In general, NP-Hard Need to resort to heuristic search In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics Adding randomness to search is critical 53 54 9 Structure Discovery Discovering Structure P(G|D) Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y E B R A C Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure 55 56 Discovering Structure P(G|D) Bayesian Approach Posterior distribution over structures Estimate probability of features Edge X→Y E R B A C E R B A E R C Problem Small sample size E B A B R E A C R C Path X→… → B A C ⇒ many high scoring models … Y P (f | D ) = ∑f (G )P (G | D ) G G Feature of , e.g., X→Y Answer based on one model often useless Bayesian score for G Indicator function for feature f Want features common to many models 57 MCMC over Networks ICU Alarm BN: No Mixing 500 instances: Cannot enumerate structures, so sample structures MCMC Sampling fG n∑ 1 n ( i =1 i ) Score of cuurent sample P (f (G ) | D ) ≈ 58 Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges empty greedy 0 100000 200000 300000 400000 500000 600000 MCMC Iteration 59 The runs clearly do not mix 60 10 Effects of Non-Mixing Fixed Ordering Two MCMC runs over same 500 instances Probability estimates for edges for two runs Suppose that We know the ordering of variables say, X1 > X2 > X3 > X4 > … > Xn Limit number of parents per nodes to k True BN Random start parents for Xi must be in X1,…,Xi-1 2k•n•log n networks Intuition: Order decouples choice of parents Choice of Pa(X7) does not restrict choice of Pa(X12) True BN True BN Probability estimates highly variable, nonrobust 61 Upshot: Can compute efficiently in closed form Likelihood P(D | p) Feature probability P(f | D, p) 62 Mixing with MCMC-Orderings Our Approach: Sample Orderings 4 runs on ICU-Alarm with 500 instances fewer iterations than MCMC-Nets approximately same amount of computation We can write P (f | D) = ∑ P (f |p, D)P (p| D ) 400 p 405 Score of cuurent sample 410 Sample orderings and approximate 415 P (f | D ) ≈ ∑ P (f | p ,D ) 420 n i =1 425 i 430 435 MCMC Sampling 440 445 Define Markov chain over orderings 0 20000 30000 40000 50000 60000 Process appears to be mixing! 64 Application: Gene expression Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges 500 instances 10000 MCMC Iteration 63 Input: Measurement of gene expression under different conditions Thousands of genes Hundreds of experiments Output: Models of gene interaction Uncover pathways 1000 instances Probability estimates very robust random greedy 450 Run chain to get samples from posterior P(p | D) 65 66 11 Map of Feature Confidence “Mating response” Substructure KAR4 SST2 TEC1 KSS1 NDJ1 YLR343W Yeast data YLR334C [Hughes et al 2000] MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6 FIG1 FUS3 YEL059W Automatically constructed sub-network of high- 600 genes 300 experiments confidence edges 67 Almost exact reconstruction of yeast mating pathway Overview 68 Incomplete Data Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: Parameter estimation Some variables unobserved in some instances Hidden variables: Structure search Learning from Structured Data Some variables are never observed We might not even know they exist 69 Hidden (Latent) Variables 70 Example X Y P(X) assumed to be known Likelihood function of: θY|X=T, θY|X=H Contour plots of log likelihood for different number Why should we care about unobserved variables? of missing values of X X1 X2 X3 X1 X2 (M = 8): X3 H = X |Y θ H Y1 Y2 Y3 17 parameters Y1 Y2 Y3 59 parameters θY|X=T θY|X=T no missing values 2 missing value θY|X=T 3 missing values In general: likelihood function has multiple modes 71 72 12 Incomplete Data EM: MLE from Incomplete Data In the presence of incomplete data, the likelihood can have multiple maxima H Y ) D | ( L Θ Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima In practice, many local maxima Θ Use current point to construct “nice” alternative function Max of new function scores ≥ than current point 73 Expectation Maximization (EM) 74 Expectation Maximization (EM) A general purpose method for learning from incomplete data Data P(Y=H|X=H,Z=T,Θ ) = 0.3 Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We “complete” counts using probabilistic inference Current model X Y Z H ? T T ? ? H H ? H T T T T H Expected Counts N ( X,Y ) X Y H H T H H T T T # 1.3 0.4 1.7 1.6 P(Y=H|X=T,Θ ) = 0.4 based on current parameter assignment We use completed counts as if real to re-estimate parameters 75 Expectation Maximization (EM) 76 Expectation Maximization (EM) Reiterate Formal Guarantees: L(Θ :D) ≥ L(Θ :D) 1 Θ0) Initial network (G, X 1 X 2 X 1 Y + 2 Computation N(X2) X 11 Reparameterize 3 (E-Step) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) (M-Step) X 2 2 X 1 3 3 Y 11 Y 2 2 = Θ , then Θ is a stationary point of L(Θ:D) 0 0 Usually, this means a local maximum H N(X3) Y If Θ Θ1) Updated network (G, N(X1) H Y Expected Counts 3 0 Each iteration improves the likelihood Y 3 3 N(Y3, H) Training Data 77 78 13 Summary: Parameter Learning with Incomplete Data Expectation Maximization (EM) Incomplete data makes parameter estimation hard Likelihood function Computational bottleneck: Computation of expected counts in E-Step Need to compute posterior for each unobserved variable in each instance of training set All posteriors for an instance can be derived from one pass of standard BN inference Does not have closed form Is multimodal Finding max likelihood parameters: EM Gradient ascent Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics 79 80 Incomplete Data: Structure Scores Naive Approach Perform EM for each candidate graph Recall, Bayesian score: G3 P (G | D ) ∝ P (G )P (D | G ) = P (G ) ∫ P (D | G , Θ)P (Θ | G )dθ G2 Parameter space G1 Parametric optimization (EM) Local Maximum With incomplete data: Cannot evaluate marginal likelihood in closed form We have to resort to approximations: Computationally expensive: G4 Gn Parameter optimization via EM — non-trivial Need to perform EM for all candidate structures Evaluate score around MAP parameters Spend time even on poor candidates ⇒In practice, considers only a few candidates Need to find MAP parameters (e.g., EM) 81 82 Structural EM Structural EM Recall, in complete data we had Decomposition efficient search Idea: Use current model to help evaluate new structures Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: ⇒ Outline: Better scoring parameters: “parametric” EM step ⇒ improvement in real score or Better scoring structure: “structural” EM step 83 84 14 Reiterate Example: Phylogenetic Reconstruction Computation X X 1 2 X Expected Counts 3 Y Y 1 2 + X N(X1) H X 1 N(X2) X 2 3 H N(X3) Y Input: Biological sequences Human CGTTGC… Chimp CCTAGG… Orang CGAACG… Score & Parameterize Y N(Y1, H) N(Y2, H) Y 1 Assumption: positions are independent …. N(H, X1, X1, X3) 3 An “instance” of evolutionary process Y 2 Output: a phylogeny 3 N(Y3, H) 10 billion years N(X2,X1) N(H, X1, X3) X N(Y1, X2) Training Data 11 N(Y2, Y1, H) X 2 X 3 3 leaf H Y 1 Y 2 Y 3 85 Phylogenetic Model 8 leaf 2 3 4 Variables: Letter at each position for each species 12 internal node 11 10 1 Phylogenetic Tree as a Bayes Net 9 branch (8,9) 5 6 Current day species – observed Ancestral species - hidden 7 BN Structure: Tree topology BN Parameters: Branch lengths (time spans) Topology: bifurcating Observed species – 1…N Main problem: Learn topology Ancestral species – N+1…2N-2 Lengths t = {ti,j} for each branch (i,j) Evolutionary model: 86 If ancestral were observed easy learning problem (learning trees) ⇒ P(A changes to T| 10 billion yrs ) 87 Algorithm Outline 88 Algorithm Outline Compute expected pairwise stats Compute expected pairwise stats Weights: Branch scores Weights: Branch scores Find: T '= argmaxT Original Tree (T0,t0) ∑ wi j ij T ( , )∈ , Pairwise weights O(N2) pairwise statistics suffice to evaluate all trees 89 90 15 Algorithm Outline Algorithm Outline Compute expected pairwise stats Compute expected pairwise stats Weights: Branch scores Weights: Branch scores Find: T '= argmaxT ∑ wi j ij T ( , )∈ Find: T '= argmaxT , Construct bifurcation T1 ∑ wi j ij T ( , )∈ , Construct bifurcation T1 Max. Spanning Tree New Tree Theorem: L(T1,t1) ≥ L(T0,t0) Repeat until convergence… 91 92 Real Life Data Lysozyme c Overview # sequences 43 34 # pos 122 3,578 -2,916.2 -74,227.9 -2,892.1 -70,533.5 0.19 1.03 Traditional approach LogStructural EM likelihood Approach Difference per position Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data Mitochondrial genomes Each position twice as likely 93 Bayesian Networks: Problem 94 Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Bayesian nets use propositional representation Real world has objects, related to each other These “instances” are not independent! Intelligence Difficulty Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 A Grade Intell_FGump Intell_FGump Diffic_CS101 Grade_FGump_CS101 C Diffic_Geo101 Grade_FGump_Geo101 95 96 16 St. Nordaf University Teaching-ability of object, & types of links between objects Classes Professor Teaches Teaches Teaching-ability Relational Schema Specifies types of objects in domain, attributes of each type In-courseGrade Registered Welcome to Grade Course Intelligence Grade Satisfac In-course Links Attributes Registered In-course Satisfac CS101 Difficulty Take Teach Forrest Gump Geo101 Difficulty Intelligence Intelligence Satisfac Welcome to Student Teaching-Ability Registered In Difficulty Registration Grade Satisfaction Jane Doe 97 98 Representing the Distribution Possible Worlds Many possible worlds for a given university Prof. Jones High Low Teaching-ability Prof. Smith High Teaching-ability World: assignment to all attributes of all objects in domain All possible assignments of all attributes of all objects Grade C B Infinitely many potential universities Intelligence Weak Satisfac Hate Like Welcome to Each associated with a very different set of Forrest Gump Geo101 worlds Grade B C Easy Difficulty Welcome to Satisfac Hate CS101 Need to represent infinite set of complex distributions Intelligence Smart A Grade Hard Difficulty Easy Jane Doe Satisfac Hate Like 99 100 Probabilistic Relational Models PRM Semantics Key ideas: Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Prof. Jones Teaching-ability Prof. Smith Teaching-ability Links give us potential interactions! Professor Teaching-Ability Grade Student Welcome to Intelligence Satisfac Forrest Gump Geo101 Intelligence Course Instantiated PRM BN variables: attributes of all objects dependencies: determined by θ Grade|Intell,Diffic links & PRM Grade Difficulty A B Difficulty C Welcome to easy,weak Reg Grade Satisfaction CS101 easy,smart hard,weak Grade hard,smart 0% Satisfac 20% 40% 60% 80% 100% 101 Difficulty Satisfac Intelligence Jane Doe 102 17 The Web of Influence PRM Learning: Complete Data Welcome to Objects are all correlated CS101 Need to perform inference over entire model For large databases, use approximate inference: Loopy beliefCpropagation Prof. Jones Low Teaching-ability Prof. Smith High Teaching-ability θ Grade|Intell,Diffic Grade C Intelligence Weak Satisfac Like Welcome to 0% 0% 50% 50% Introduce over parameters Geo101 Entireprior database is single “instance” B Update prior with sufficient Grade statistics: Parameters Difficulty Easyused many times in instance 100% 100% Satisfac Hate Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi) Welcome to A Geo101 weak Welcome to smart CS101 Intelligence Smart A Grade Difficulty Easy easy / hard weak / smart 103 104 PRM Learning: Incomplete Data ??? Satisfac Like A Web of Data Tom Mitchell Professor ??? Project-of C WebKB Project ??? Hi Welcome to Use expected sufficient statistics Geo101 But, everything is correlated: Sean Slattery Student A ⇒ E-step uses??? (approx) inference over entire model Low Welcome to CS101 Contains ??? B Hi ??? Advisee-of Works-on CMU CS Faculty 105 Standard Approach [Craven et al.] 106 What’s in a Link Page Word1 To- Page From-Page Category Category Category ... WordN Word1 ... WordN Word1 Professor department extract information computer science machine learning … ... WordN Exists Link 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 107 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 108 18 Discovering Hidden Concepts Discovering Hidden Concepts Actor Director Type Type Gender Movie Appeared Genre Type Credit-Order Year MPAA Rating Internet Movie Database http://www.imdb.com109 Actors Directors Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher … #Votes Internet Movie Database http://www.imdb.com110 Web of Influence, Yet Again Movies Rating Conclusion Many distributions have combinatorial dependency structure Utilizing this structure is good Discovering this structure has implications: To density estimation To knowledge discovery Many applications Medicine Biology … Web … 111 112 The END Thanks to Gal Elidan Lise Getoor Moises Goldszmidt Matan Ninio Dana Pe’er Eran Segal Ben Taskar Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/ 113 19