Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using Bayesian Networks to Analyze Expression Data N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem . Central Dogma Transcription Translation mRNA Gene Cells express different subset of the genes In different tissues and under different conditions Protein Microarrays (aka “DNA chips”) New technological breakthrough: Measure RNA expression levels of thousands of genes in one experiment Measure expression on a genomic scale Opens up new experimental designs Many major labs are using, or will use this technology in the near future The Problem Experiments j Genes i Goal: Aij - the mRNA level of gene j in experiment i Learn regulatory/metabolic networks Identify causal sources of the biological phenomena of interest Our Approach Characterize statistical relationships between expression patterns of different genes Beyond pair-wise interactions Many interactions are explained by intermediate factors Regulation involves combined effects of several geneproducts We build on the language of Bayesian networks Network: Example Noisy stochastic process: Example: Pedigree Homer A node represents an individual’s genotype Bart Modeling Marge Lisa Maggie assumptions: Ancestors can effect descendants' genotype only by passing genetic materials through intermediate generations Ancestor Network Structure Parent Generalizing to DAGs: A child is conditionally independent from its non-descendents, given the value of its parents Y1 Y2 X Often a natural assumption for causal processes if we believe that we capture the relevant state of each intermediate stage. Non-descendent Descendent Local Probabilities Associated with each variable Xi is a conditional probability distribution P(Xi|Pai:) X P(Y |X) Discrete variables: X x 0.9 0.1 Multinomial distribution x Y variables: Choice: for example linear gaussian P(Y | X) Continuous X Y 0.3 0.7 Bayesian Network Semantics B E R A C Qualitative part DAG specifies conditional independence statements Quantitative part + local probability models = Unique joint distribution over domain Compact & efficient representation: k parents O(2kn) vs. O(2n) params parameters pertain to local interactions P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A) Why Bayesian Networks? Bayesian Networks: Flexible representation of dependency structure of multivariate distributions Natural for modeling processes with local interactions Learning of Bayesian Networks Can learn dependencies from observations Handles stochastic processes: “true” stochastic behavior noise in measurements Modeling Regulatory Interactions Variables of interest: Expression levels of genes Concentration levels of proteins (proteomics!) Exogenous variables: Nutrient levels, Metabolite Levels, Temperature, Phenotype information … Bayesian Network Structure: Capture dependencies among these variables Examples Measured expression level of each gene Gene interaction Random variables Probabilistic dependencies Interactions are represented by a graph: Each gene is represented by a node in the graph Edges between the nodes represent direct dependency X A B A B More Complex Examples Dependencies can be mediated through other nodes B A A C C Common cause Common B Intermediate gene effects can imply conditional dependence A B C Outline of Our Approach Bayesian Network Learning Algorithm Expression data B E R A C Use learned network to make predictions about structure of the interactions between genes Experiment Data from Spellman et al. (Mol.Bio. of the Cell 1998) Contains 76 samples of all the yeast genome: Different methods for synchronizing cell-cycle in yeast Time series at few minutes (5-20min) intervals Spellman et al. identified 800 cell-cycle regulated genes. Methods Treat samples as IID (ignoring temporal order) Experiment 1: Discretized into three levels of expression 0 - -0.5 + 0.5 Log(ratio to control) Learn multinomial probabilities Experiment 2: Learn linear interactions (w/ Gaussian noise) No prior biological knowledge was used Network Learned Challenge: Statistical Significance Sparse Data Small number of samples “Flat posterior” -- many networks fit the data Solution estimate confidence in network features Two types of features Markov neighbors: X directly interacts with Y Order relations: X is an ancestor of Y Confidence Estimates B E Bootstrap approach [FGW, UAI99] D1 Learn R A C E D resample D2 Learn R A C ... Dm E R B A Learn C 1 C (f ) 1f Gi m i 1 m Estimate: B Testing for Significance We run our procedure on randomized data where we reshuffled the order of values for each gene Histograms of number of Markov features at each confidence level Original Data Randomized Data Testing for Significance We run our procedure on randomized data where we reshuffled the order of values for each gene Markov w/ Gaussian Models Features with Confidence above t 4000 500 450 3500 Random Real Random Real 400 350 3000 300 2500 250 200 2000 150 1500 100 50 1000 500 0 0.1 0.2 0.3 0.4 0.5 0.6 t 0 0.1 0.2 0.7 0.8 0.3 0.9 0.4 0.5 1 0.6 0.7 0.8 0.9 1 Testing for Significance Markov w/ Multinomial Models Features with Confidence above t 250 1400 Random Real Random Real 200 1200 1000 150 800 100 600 50 400 0 0.1 200 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 t 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Local Map Finding Key Genes Key gene: a gene that preceeds many other genes YLR183C MCD1 Mitotic Chromosome Determinant; RAD27 DNA repair protein CLN2 role in cell cycle START SRO4 involved in cellular polarization during budding YOX1 Homeodomain protein that binds leu-tRNA gene POL30 required for DNA replication and repair YLR467W CDC5 MSH6 Homolog of the human GTBP protein YML119W CLN1 role in cell cycle START Future Work Finding suitable local distribution models Correct handling of hidden variables Can we recognize hidden causes of coordinated regulation events? Incorporating prior knowledge Incorporate large mass of biological knowledge, and insight from sequence/structure databases Abstraction Combine with cluster analysis of higher confidence conclusions