Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrative Genomics BME 230 Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data Most models incomplete • Molecular systems are complex & rife with uncertainty • Data/Models often incomplete – Some gene structures misidentified in some organisms by gene structure model – Some true binding sites for a transcription factor can’t be found with a motif model – Not all genes in the same pathway predicted to be coregulated by a clustering model • Even if have perfect learning and inference methods, appreciable amount of data left unexplained Why infer system properties from data? Knowledge acquisition bottleneck • Knowledge acquisition is an expensive process • Often we don’t have an expert Data is cheap • Amount of available information growing rapidly • Learning allows us to construct models from raw data Discovery • Want to identify new relationships in a data-driven way Graphical models for joint-learning • Combine probability theory & graph theory • Explicitly link our assumptions about how a system works with its observed behavior • Incorporate notion of modularity: complex systems often built from simpler pieces • Ensures we find consistent models in the probabilistic sense • Flexible and intuitive for modeling • Efficient algorithms exist for drawing inferences and learning • Many classical formulations are special cases Michael Jordan, 1998 Motif Model Example (Barash ’03) • Sites can have arbitrary dependence on each other. Barash ’03 Results • Many TFs have binding sites that exhibit dependency Barash ’03 Results Bayesian Networks for joint-learning • Provide an intuitive formulation for combining models • Encode notion of causality which can guide model formulation • Formally expresses decomposition of system state into modular, independent sub-pieces • Makes learning in complex domains tractable Unifying models for molecular biology • We need knowledge representation systems that can maintain our current understanding of gene networks • E.g. DNA damage response and promotion into S-phase – highly linked sub-system – experts united around sub-system – but probably need combined model to understand either sub-system • Graphical models offer one solution to this “Explaining Away” • Causes “compete” to explain observed data • So, if observe data and one of the causes, this provides information about the other cause. • Intuition into “V-structures”: Observing grass is wet and then sprinkler wet grass rain finding out sprinklers were on decreases our belief that it rained. So sprinkler and rain dependent given their child is observed. Conditional independence: “Bayes Ball” analogy diverging or parallel: ball passes through when unobserved; does not pass through when observed converging: ball does not pass through when unobserved; passes through when observed. unobserved observed Ross Schacter, 1995 Inference • Given some set of evidence, what is the most likely cause? • BN allows us to ask any question that can be posed about the probabilities of any combination of variables Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in: Variable Elimination • Inference amounts to distributing sums over products • Message passing in the BN • Generalization of forward-backward algorithm Pearl, 1988 Variable Elimination Procedure • The initial potentials are the CPTs in BN. • Repeat until only query variable(s) remain: – Choose another variable to eliminate. – Multiply all potentials that contain the variable. – If no evidence for the variable then sum the variable out and replace original potential by the new result. – Else, remove variable based on evidence. • Normalize remaining potentials to get the final distribution over the query variable. Motif Model Example (Barash ’03) • Sites can have arbitrary dependence on each other. Barash ’03 Results • Many TFs have binding sites that exhibit dependency Barash ’03 Results Conditional independence: “Bayes Ball” analogy diverging or parallel: ball passes through when unobserved; does not pass through when observed converging: ball does not pass through when unobserved; passes through when observed. unobserved observed Ross Schacter, 1995 Inference • Given some set of evidence, what is the most likely cause? • BN allows us to ask any question that can be posed about the probabilities of any combination of variables Inference Since BN provides joint distribution, can ask questions by computing any probability among the set of variables. For example, what’s the probability p53 is activated given ATM is off and the cell is arrested before S-phase? Need to marginalize (sum out) variables not interested in: Variable Elimination • Inference amounts to distributing sums over products • Message passing in the BN • Generalization of forward-backward algorithm Pearl, 1988 Variable Elimination Procedure • The initial potentials are the CPTs in BN. • Repeat until only query variable(s) remain: – Choose another variable to eliminate. – Multiply all potentials that contain the variable. – If no evidence for the variable then sum the variable out and replace original potential by the new result. – Else, remove variable based on evidence. • Normalize remaining potentials to get the final distribution over the query variable. Learning Networks from Data • Have a dataset X • What is the best model that explains X? • We define a score, score(G,X) that scores networks by computing how likely X is given the network • Then search for the network that gives us the best score Bayesian Score log M log P(G | X) (G : D) dim(G) O(1) log2M G M I(Xi ; Pai ) H(Xi ) dim(G) O(1) 2 i Fit dependencies in empirical distribution Complexity penalty • As M (amount of data) grows, – Increasing pressure to fit dependencies in distribution – Complexity term avoids fitting noise • Asymptotic equivalence to MDL score • Bayesian score is consistent – Observed data eventually overrides prior Friedman & Koller ‘03 Learning Parameters: Summary • Estimation relies on sufficient statistics – For multinomials: counts N(xi,pai) – Parameter estimation (xi , pai ) N (xi , pai ) ~ N ( x , pa ) ˆ x |pa N ( pa ) i i ( pai ) N ( pai ) i i xi | pai i MLE Bayesian (Dirichlet) • Both are asymptotically equivalent and consistent • Both can be implemented in an on-line manner by accumulating sufficient statistics Incomplete Data Data is often incomplete • Some variables of interest are not assigned values This phenomenon happens when we have • Missing values: – Some variables unobserved in some instances • Hidden variables: – Some variables are never observed – We might not even know they exist Friedman & Koller ‘03 Hidden (Latent) Variables Why should we care about unobserved variables? X1 X2 X3 X1 X2 X3 Y3 Y1 Y2 Y3 H Y1 Y2 17 parameters 59 parameters Friedman & Koller ‘03 Expectation Maximization (EM) • A general purpose method for learning from incomplete data Intuition: • If we had true counts, we could estimate parameters • But with missing values, counts are unknown • We “complete” counts using probabilistic inference based on current parameter assignment • We use completed counts as if real to re-estimate parameters Friedman & Koller ‘03 Expectation Maximization (EM) P(Y=H|X=H,Z=T,) = 0.3 Current model Data X Y Z H T H H T T ? ? T H ? ? H T T Expected Counts N (X,Y ) X Y # H T H T H H T T 1.3 0.4 1.7 1.6 P(Y=H|X=T,) = 0.4 Friedman & Koller ‘03 Expectation Maximization (EM) Reiterate Initial network (G,0) X1 X2 H Y1 Y2 Expected Counts X3 Computation Y3 (E-Step) N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Updated network (G,1) X1 Reparameterize (M-Step) X2 X3 H Y1 Y2 Y3 Training Data Friedman & Koller ‘03 Expectation Maximization (EM) Computational bottleneck: • Computation of expected counts in E-Step – Need to compute posterior for each unobserved variable in each instance of training set – All posteriors for an instance can be derived from one pass of standard BN inference Friedman & Koller ‘03 Why Struggle for Accurate Structure? Earthquake Alarm Set Burglary Sound Missing an arc Earthquake Alarm Set Burglary Sound • Cannot be compensated for by fitting parameters • Wrong assumptions about domain structure Adding an arc Earthquake Alarm Set Burglary Sound • Increases the number of parameters to be estimated • Wrong assumptions about domain structure Learning BN structure • Treat like a search problem • Bayesian Score allows us to measure how well a BN fits the data • Search for a BN that fits the data the best • Start with initial network B0=<G0,00> • Define successor operations on current network Structural EM Recall, in complete data we had – Decomposition efficient search Idea: • Instead of optimizing the real score… • Find decomposable alternative score • Such that maximizing new score improvement in real score Structural EM Idea: • Use current model to help evaluate new structures Outline: • Perform search in (Structure, Parameters) space • At each iteration, use current model for finding either: – Better scoring parameters: “parametric” EM step or – Better scoring structure: “structural” EM step Reiterate Computation X1 X2 X3 H Y1 Y2 Y3 Training Data Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) Score & Parameterize X1 X2 X3 H Y1 X1 Y2 X2 Y3 X3 H Y1 Y2 Y3 Structure Search Bottom Line • Discrete optimization problem • In some cases, optimization problem is easy – Example: learning trees • In general, NP-Hard – Need to resort to heuristic search – In practice, search is relatively fast (~100 vars in ~2-5 min): • Decomposability • Sufficient statistics – Adding randomness to search is critical Heuristic Search • Define a search space: – search states are possible structures – operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: – – – – Greedy hill-climbing Best first search Simulated Annealing ... Friedman & Koller ‘03 Local Search • Start with a given network – empty network – best tree – a random network • At each iteration – Evaluate all possible changes – Apply change based on score • Stop when no modification improves score Friedman & Koller ‘03 Heuristic Search • Typical operations: S S C S C E C To update score after local change, E only re-score Dfamilies that changed D score = S({C,E} D) - S({E} D) S C E E D D Friedman & Koller ‘03 Naive Approach to Structural EM • Perform EM for each candidate graph G3 G2 G1 Parameter space Parametric optimization (EM) Local Maximum Computationally expensive: Gn G4 Parameter optimization via EM — non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates In practice, considers only a few candidates Friedman & Koller ‘03 Structural EM Recall, in complete data we had – Decomposition efficient search Idea: • Instead of optimizing the real score… • Find decomposable alternative score • Such that maximizing new score improvement in real score Friedman & Koller ‘03 Structural EM Idea: • Use current model to help evaluate new structures Outline: • Perform search in (Structure, Parameters) space • At each iteration, use current model for finding either: – Better scoring parameters: “parametric” EM step or – Better scoring structure: “structural” EM step Friedman & Koller ‘03 Bayesian Approach • Posterior distribution over structures • Estimate probability of features – Edge XY – Path X… Y – …P (f | D ) Bayesian score for G f (G )P (G | D ) G Feature of G, e.g., XY Indicator function for feature f P(G|D) Discovering Structure E R B A C • Current practice: model selection – Pick a single high-scoring model – Use that model to infer domain structure P(G|D) Discovering Structure E R B A C R E B E A C R B A C E R B A C E R B A C Problem – Small sample size many high scoring models – Answer based on one model often useless – Want features common to many models Application: Gene expression Input: • Measurement of gene expression under different conditions – Thousands of genes – Hundreds of experiments Output: • Models of gene interaction – Uncover pathways Friedman & Koller ‘03 “Mating response” Substructure KAR4 SST2 TEC1 NDJ1 KSS1 YLR343W YLR334C MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6 FIG1 FUS3 YEL059W • Automatically constructed sub-network of high-confidence edges • Almost exact reconstruction of yeast mating pathway N. Friedman et al 2000 N. Friedman et al 2000 Bayesian Network Limitation: Model pdf over instances • Bayesian nets use propositional representation • Real world has objects, related to each other • Can we take advantage of general properties of objects of the same class? Intelligence Grade Difficulty Bayesian Networks: Problem • Bayesian nets use propositional representation • Real world has objects, related to each other These “instances” are not independent! Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 A Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 C Relational Schema • Specifies types of objects in domain, attributes of each type of object, & types of links between objects Classes Professor Student Teaching-Ability Intelligence Take Teach Links Attributes Course Difficulty In Registration Grade Satisfaction Modeling real-world relationships • Bayesian nets use propositional representation • Biological system has objects, related to each other from E. Segal’s PhD Thesis 2004 Probabilistic Relational Models (PRMs) • A skeleton σ defining classes & their relations • Random variables are the attributes of the objects Probabilistic Relational Models • Dependencies exist at the class level from E. Segal’s PhD Thesis 2004 Converting a PRM into a BN • We can “unroll” (or instantiate) a PRM into its underlying BN. • E.g.: 2 genes & 3 experiments: from E. Segal’s PhD Thesis 2004 A PRM has important differences from a BN • Dependencies are defined at the class level – Thus can be reused for any objects of the class • Uses the relational structure to allow attributes of an object to depend on attributes of related objects • Parameters are shared across instances of the same class! (Thus, more data for each parameter) Example of joint learning Segal et al. 2003 Joint Learning: Motifs & Modules Segal et al. 2003 Joint Model of Segal et al. 2003 Segal et al. 2003 PRM for the joint model Segal et al. 2003 Joint Model • Motif model: Regulation model: • Expression Model: Segal et al. 2003 Segal et al. 2003 “Ground BN” Segal et al. 2003 TFs associated to modules Segal et al. 2003 PRM for genome-wide location data Segal et al. 2003