Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Networks & Gene Expression . The Problem Experiments j Genes i Goal: Aij - the mRNA level of gene j in experiment i Learn regulatory/metabolic networks Identify causal sources of the biological phenomena of interest Analysis Approaches Clustering Groups together genes with similar expression patterns Does not reveal structural relations between genes Boolean networks Deterministic models of the logical interactions between genes Deterministic, static Linear of expression data models Deterministic fully-connected linear model Under-constrained, assumes linearity of interactions Probabilistic Network Approach Characterize stochastic (non-deterministic!) relationships between expression patterns of different genes Beyond pair-wise interactions => structure! Many interactions are explained by intermediate factors Regulation involves combined effects of several geneproducts Flexible in terms of types of interactions (not necessarily linear or Boolean!) Probabilistic Network Structure Noisy stochastic process of hair heredity (pedigree example) : Homer Al Marge A node represents an individual’s genotype Bart An Lisa Maggie edge represents direct influence/interaction Note: Bart’s genotype is dependent on Al’s genotype, but independent on it given Marge’s genotype A Model: Bayesian Network Ancestor Structure: DAG Parent Meaning: a child is conditionally independent on its nondescendants, given the value of its parents Does not impose causality, but suits modeling of causal processes Y1 Y2 X Child Non-descendent Descendent Local Structures & Independencies B Common parent C A Cascade A B A V-structure The C B C language is compact, the concepts are rich! Bayesian Network – CPDs Local Probabilities: CPD - conditional probability distribution P(Xi|Pai) Discrete variables: Multinomial Distribution (can represent any kind of statistical dependency) Earthquake Radio Burglary Alarm Call E e B b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 P(A | E,B) Bayesian Network – CPDs (cont.) Continuous variables: e.g. linear Gaussian k P( X | Y1 ,..., Yk ) N (a0 ai yi , ) 2 i 1 P(X | Y) X Y Bayesian Network Semantics B E R A C The Qualitative part DAG specifies conditional independence statements Quantitative part + local probability models = Unique joint distribution over domain joint distribution decomposes nicely: P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A) Every node is dependent on many others, BUT : k n k parents O(2 n) vs. O(2 ) params => good for memory conservation & learning robustness So, Why Bayesian Networks? Flexible representation of (in)dependency structure of multivariate distributions and interactions Natural for modeling global processes with local interactions => good for biology Clear probabilistic semantics Natural for statistical confidence analysis of results and answering of queries Stochastic in nature: models stochastic processes & deals (“sums out”) noise in measurements Inference Encloses answers to valuable queries, e.g.: Is node X independent on node Y given nodes Z,W ? What is the probability of X=true if (Y=false and Z=true)? What is the joint distribution of (X,Y) if R=false? What is the likelihood of some full assignment? What is the most likely assignment of values to all the nodes of the network? The queries are relatively efficient Learning Bayesian Network The goal: • Given set of independent samples (assignments random variables) to random variables, find the best (the most likely?) Bayesian Network B E (both DAG and CPDs) R { (B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F) …….. (B,E,A,C,R)=(F,T,T,T,F) } A C E e B b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 P(A | E,B) Learning Bayesian Network • Learning of best CPDs given DAG is easy (collect statistics of values of each node given specific assignment to its parents). But… •The structure (G) learning problem is NP-hard => heuristic search for best model must be applied, generally bring out a locally optimal network. •It turns out, that the richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable), because… Learning Bayesian Network A B C A B C • If we add B to Pa(C) , we have more parametes to fit => more freedom => can always optimize SPD(C) , such that: P(C | A) P(C | A, B) • But we prefer simpler (more explanatory) networks (Occam’s razor!) •Therefore, practical scores of Bayesian Networks compensate the likelihood improvement by “fine” on complex networks. Modeling Biological Regulation Variables of interest: Expression levels of genes Concentration levels of proteins Exogenous variables: Nutrient levels, Metabolite Levels, Temperature Phenotype information … Bayesian Network Structure: Capture dependencies among these variables Possible Biological Interpretation Measured expression level of each gene Random variables Gene interaction Probabilistic dependencies Interactions are represented by a graph: Each gene is represented by a node in the graph Edges between the nodes represent direct dependency X A B A B More Local Structures Dependencies can be mediated through other nodes B A A B C Common cause Intermediate gene Common/combinatorial effects: A B C C The Approach of Friedman et al. Bayesian Network Learning Algorithm Expression data B E R A C Use learned network to make predictions about structure of the interactions between genes – No prior biological knowledge is used! The Discretization Problem The expression measurements are real numbers. => We need to discretize them in order to learn general CPDs => lose information => If we don’t, we must assume some specific type of CPD (like “linear Gaussian”) => lose generality Problem of Sparse Data There are much more genes than experiments (=assignments) => many different networks suit the data well => we can’t rely only on the usual learning algorithm. Shrink the network search space. E.g., we can use the notion, that in biological systems each gene is regulated directly by only a few regulators. Don’t take for granted the resulting network, but instead fetch from it pieces of reliable information. Learning With Many Variables Sparse Candidate algorithm - efficient heuristic search, relies on sparseness of regulation nets. For each gene, choose promising “candidate parents set” for direct influence for each gene Find (locally) optimal BN constrained on those parent candidates for each gene Iteratively improve candidate set candidates parents in BN Experiment Data from Spellman et al. (Mol.Bio. of the Cell 1998). Contains 76 samples of all the yeast genome: Different methods for synchronizing cell-cycle in yeast. Time series at few minutes (5-20min) intervals. Spellman et al. identified 800 cell-cycle regulated genes. Network Learned Challenge: Statistical Significance Sparse Data Small number of samples “Flat posterior” -- many networks fit the data Solution estimate confidence in network features E.g., two types of features Markov neighbors: X directly interacts with Y (have mutual edge or a mutual child) Order relations: X is an ancestor of Y Confidence Estimates B E Bootstrap approach [FGW, UAI99] D1 Learn R A C E D resample D2 Learn B R A C ... Estimate “Confidence level”: Dm E R B A Learn C 1 C (f ) 1f Gi m i 1 m In summary… Normalization, Discretization Expression data Bayesian Network Learning Algorithm, Mutation modeling + Bootstrap Markov Edge Separator Preprocess Learn model Ancestor Feature extraction Result: a list of features with high confidence. They can be biologically interpreted. Resulting Features: Markov Relations Question: Do X and Y directly interact? Parent-Child SST2 (0.91) confidence Hidden Parent ARG5 (0.84) (one gene regulating the other) STE6 SST2 STE6 Regulator in mating pathway Exporter of mating factor (two genes co-regulated by a hidden factor) Transcription factor GCN4 ARG3 ARG3 Arginine Biosynthesis ARG5 Arginine Biosynthesis Resulting Features: Separators Given that X andY are indirectly dependent, who mediates this dependence? Separator relation: Question: X affects Z who in turn affects Z Z regulates both X and Y Mating transcriptional regulator of nuclear fusion KAR4 AGA1 FUS1 Cell fusion Cell fusion Separators: Intra-cluster Context MAPK of cell wall integrity pathway SLT2 CRH1 Cell wall protein SLR3 YPS3 Cell wall protein All YPS1 Cell wall protein pairs have high correlation Clustered together Protein of unknown function Separators: Intra-cluster Context MAPK of cell wall integrity pathway SLT2 + CRH1 Cell wall protein YPS3 Cell wall protein SLT2: + YPS1 SLR3 Protein of unknown function Cell wall protein Pathway regulator, explains the dependence Many signaling and regulatory proteins identified as direct and indirect separators Next Step: Sub-Networks Automatic reconstruction Goal: Dense sub-network with highly confident pair-wise features Score: Statistical significance Search: High scoring sub-networks Advantages Global picture Structured context for interactions Incorporate mid-confidence features Normalization, Discretization Expression data Preprocess Bayesian Network Learning Algorithm, Mutation modeling + Bootstrap Markov Edge Separator Reconstruct Sub-Networks Learn model Ancestor Feature extraction Feature assembly Global network Local features Sub-network Results 6 well structured sub-networks representing coherent molecular responses Mating Iron metabolism Low osmolarity cell wall integrity pathway Stationary phase and stress response Amino acid metabolism, mitochondrial function and sulfate assimilation Citrate metabolism Uncovered interactions regulatory, signaling and metabolic “Mating response” Substructure Two branches: Signaling pathway regulator •Cell fusion Transcriptional regulator •Outgoing Mating Signal of nuclear fusion KAR4 SST2 TEC1 NDJ1 KSS1 YLR343W YLR334C MFA1 STE6 FUS1 PRM1 AGA1 AGA2 TOM6 FIG1 FUS3 YEL059W Genes that participate in Cell fusion We missed: STE12 (main TF), Fus3 (Main MAPK) is marginal More insights to incorporate in BN approach Sequence (mainly promoters) Protein-protein interactions measurements Cluster analysis of genes and/or experiments Incorporating prior knowledge Incorporate large mass of biological knowledge, and insight from sequence/structure databases