Download Presentation - CS

Using Bayesian Networks to Analyze Expression Data N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem . Central Dogma Transcription Translation mRNA Gene Cells express different subset of the genes In different tissues and under different conditions Protein Microarrays (aka “DNA chips”)  New technological breakthrough:  Measure RNA expression levels of thousands of genes in one experiment  Measure expression on a genomic scale  Opens up new experimental designs  Many major labs are using, or will use this technology in the near future The Problem Experiments j Genes i Goal: Aij - the mRNA level of gene j in experiment i  Learn regulatory/metabolic networks  Identify causal sources of the biological phenomena of interest Our Approach  Characterize statistical relationships between expression patterns of different genes  Beyond pair-wise interactions   Many interactions are explained by intermediate factors Regulation involves combined effects of several geneproducts We build on the language of Bayesian networks Network: Example Noisy stochastic process: Example: Pedigree Homer  A node represents an individual’s genotype Bart  Modeling  Marge Lisa Maggie assumptions: Ancestors can effect descendants' genotype only by passing genetic materials through intermediate generations Ancestor Network Structure Parent Generalizing to DAGs:  A child is conditionally independent from its non-descendents, given the value of its parents Y1 Y2 X Often a natural assumption for causal processes  if we believe that we capture the relevant state of each intermediate stage. Non-descendent Descendent Local Probabilities  Associated with each variable Xi is a conditional probability distribution P(Xi|Pai:) X P(Y |X)  Discrete variables: X x 0.9 0.1 Multinomial distribution x Y variables: Choice: for example linear gaussian P(Y | X)  Continuous X Y 0.3 0.7 Bayesian Network Semantics B E R A C Qualitative part DAG specifies conditional independence statements Quantitative part + local probability models = Unique joint distribution over domain  Compact   & efficient representation:  k parents  O(2kn) vs. O(2n) params parameters pertain to local interactions P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A) Why Bayesian Networks? Bayesian Networks:  Flexible representation of dependency structure of multivariate distributions  Natural for modeling processes with local interactions Learning of Bayesian Networks  Can learn dependencies from observations  Handles stochastic processes:   “true” stochastic behavior noise in measurements Modeling Regulatory Interactions Variables of interest:  Expression levels of genes  Concentration levels of proteins (proteomics!)  Exogenous variables: Nutrient levels, Metabolite Levels, Temperature,  Phenotype information  … Bayesian Network Structure:  Capture dependencies among these variables Examples Measured expression level of each gene Gene interaction Random variables Probabilistic dependencies Interactions are represented by a graph:  Each gene is represented by a node in the graph  Edges between the nodes represent direct dependency X A B A B More Complex Examples  Dependencies can be mediated through other nodes B A A C C Common cause  Common B Intermediate gene effects can imply conditional dependence A B C Outline of Our Approach Bayesian Network Learning Algorithm Expression data B E R A C Use learned network to make predictions about structure of the interactions between genes Experiment Data from Spellman et al. (Mol.Bio. of the Cell 1998)  Contains 76 samples of all the yeast genome:  Different methods for synchronizing cell-cycle in yeast  Time series at few minutes (5-20min) intervals  Spellman et al. identified 800 cell-cycle regulated genes. Methods  Treat samples as IID (ignoring temporal order) Experiment 1:  Discretized into three levels of expression 0 - -0.5 + 0.5 Log(ratio to control)  Learn multinomial probabilities Experiment 2:  Learn linear interactions (w/ Gaussian noise) No prior biological knowledge was used Network Learned Challenge: Statistical Significance Sparse Data  Small number of samples  “Flat posterior” -- many networks fit the data Solution  estimate confidence in network features  Two types of features  Markov neighbors: X directly interacts with Y  Order relations: X is an ancestor of Y Confidence Estimates B E Bootstrap approach [FGW, UAI99] D1 Learn R A C E D resample D2 Learn R A C ... Dm E R B A Learn C 1 C (f )   1f  Gi  m i 1 m Estimate: B Testing for Significance  We run our procedure on randomized data where we reshuffled the order of values for each gene  Histograms of number of Markov features at each confidence level Original Data Randomized Data Testing for Significance  We run our procedure on randomized data where we reshuffled the order of values for each gene Markov w/ Gaussian Models Features with Confidence above t 4000 500 450 3500 Random Real Random Real 400 350 3000 300 2500 250 200 2000 150 1500 100 50 1000 500 0 0.1 0.2 0.3 0.4 0.5 0.6 t 0 0.1 0.2 0.7 0.8 0.3 0.9 0.4 0.5 1 0.6 0.7 0.8 0.9 1 Testing for Significance Markov w/ Multinomial Models Features with Confidence above t 250 1400 Random Real Random Real 200 1200 1000 150 800 100 600 50 400 0 0.1 200 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 t 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Local Map Finding Key Genes Key gene: a gene that preceeds many other genes YLR183C  MCD1 Mitotic Chromosome Determinant;  RAD27 DNA repair protein  CLN2 role in cell cycle START  SRO4 involved in cellular polarization during budding  YOX1 Homeodomain protein that binds leu-tRNA gene  POL30 required for DNA replication and repair  YLR467W  CDC5  MSH6 Homolog of the human GTBP protein  YML119W  CLN1 role in cell cycle START  Future Work  Finding suitable local distribution models  Correct handling of hidden variables  Can we recognize hidden causes of coordinated regulation events?  Incorporating  prior knowledge Incorporate large mass of biological knowledge, and insight from sequence/structure databases  Abstraction  Combine with cluster analysis of higher confidence conclusions

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation - CS