Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Networks in Bioinformatics Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University [email protected] Contents Bayesian networks – preliminaries Bayesian networks vs. causal networks Partially DAG representation of the Bayesian network Structural learning of the Bayesian network Classification using Bayesian networks Microarray data analysis with Bayesian networks Experimental results on the NCI60 data set Term Project #3 Diagnosis using Bayesian networks Copyright (c) 2002 by SNU CSE Biointelligence Lab 2 Bayesian Networks The joint probability distribution over all the variables in the Bayesian network. P( X1 , X 2 ,..., X n ) i 1 P( X i | Pa i ) n Local probability distribution for Xi Pa i : the set of parents of X i A i ( i1 ,..., iqi ) ~ parameter for P( X i | Pa i ) B qi : # of configurat ions for Pa i ri : # of states for X i C D P( A, B, C , D, E ) P( A) P( B | A) P(C | A, B) P( D | A, B, C ) P( E | A, B, C , D) E P( A) P( B) P(C | A, B) P( D | B) P( E | C ) Copyright (c) 2002 by SNU CSE Biointelligence Lab 3 Knowing the Joint Probability Distribution We can calculate any conditional probability from the joint probability distribution in principle. Gene A Gene C Gene E Gene F This Bayesian network can classify the examples by calculating the appropriate conditional probabilities. Gene B Class Gene D Gene G P(Class| other variables) Gene H Copyright (c) 2002 by SNU CSE Biointelligence Lab 4 Classification by Bayesian Networks I Calculate the conditional probability of ‘Class’ variable given the value of the other variables. Infer the conditional probability from the joint probability distribution. For example, P(Class | Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H ) P(Class , Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H ) , P(Class, Gene A, Gene B, Gene C, Gene D, Gene E, Gene F , Gene G, Gene H ) Class where the summation is taken over all the possible class values. Copyright (c) 2002 by SNU CSE Biointelligence Lab 5 Knowing the Causal Structure Gene A Gene C regulates Gene E and F. Gene B Gene D regulates Gene G and H. Class has an effect on Gene F and G. Gene C Gene E Gene F Class Gene D Gene G Gene H Copyright (c) 2002 by SNU CSE Biointelligence Lab 6 Bayesian Networks vs. Causal Networks Network structure Bayesian networks Causal networks Conditional independencies Causal relationships By d-separation property of the Bayesian network structure The network structure asserts that every node is conditionally independent from all of its nondescendants given the values of its immediate parents. Copyright (c) 2002 by SNU CSE Biointelligence Lab 7 Equivalent Two DAGs X Y These two DAGs assert that X and Y are dependent on each other. the same conditional independencies X Y equivalence class Causal relationships are hard to learn from the observational data. Copyright (c) 2002 by SNU CSE Biointelligence Lab 8 Verma and Pearl’s Theorem Theorem: Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. X Y v-structure (X, Z, Y) Z : X and Y are parents of Z and not adjacent to each other. Copyright (c) 2002 by SNU CSE Biointelligence Lab 9 PDAG Representations Minimal PDAG representations of the equivalence class The only directed edges are those that participate in v-structures. Completed PDAG representation Every directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge. Copyright (c) 2002 by SNU CSE Biointelligence Lab 10 Example: PDAG Representations X W X W Y Y Z Z X Minimal PDAG V W V X V An equivalence class W Y Y Z Z V Completed PDAG Copyright (c) 2002 by SNU CSE Biointelligence Lab 11 Learning Bayesian Networks Metric approach Use a scoring metric to measure how well a particular structure fits an observed set of cases. A search algorithm is used. Find a canonical form of an equivalence class. Independence approach An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG Copyright (c) 2002 by SNU CSE Biointelligence Lab 12 Scoring Metrics for Bayesian Networks Likelihood L(G, G, C) = P(C|Gh, G) Gh: the hypothesis that the data (C) was generated by a distribution that can be factored according to G. The maximum likelihood metric of G M ML (G, C ) max L(G,G , C ) G prefer the complete graph structure Copyright (c) 2002 by SNU CSE Biointelligence Lab 13 Information Criterion Scoring Metrics The Akaike information criterion (AIC) metric M AIC (G, C ) log M ML (G, C ) Dim (G) The Bayesian information criterion (BIC) metric 1 M BIC (G, C ) log M ML (G, C ) Dim (G ) log N 2 Copyright (c) 2002 by SNU CSE Biointelligence Lab 14 MDL Scoring Metrics The minimum description length (MDL) metric 1 M MDL1 (G, C ) log P(G) M BIC (G, C ) The minimum description length (MDL) metric 1 M MDL2 (G, C ) log M ML (G, C ) | EG | log N c Dim (G) Copyright (c) 2002 by SNU CSE Biointelligence Lab 15 Bayesian Scoring Metrics A Bayesian metric M (G, C, ) log P(G h | ) log P(C | G h , ) c The BDe (Bayesian Dirichlet & likelihood equivalence) metric n qi ( N ij ' ) i 1 j 1 ( N ij ' N ij ) P(C | G , ) h ri ( N ijk ' N ijk ) k 1 ( N ijk ' ) Copyright (c) 2002 by SNU CSE Biointelligence Lab 16 Greedy Search Algorithm for Bayesian Network Learning Generate the initial Bayesian network structure G0. For m = 1, 2, 3, …, until convergence. Among all the possible local changes (insertion of an edge, reversal of an edge, and deletion of an edge) in Gm–1, the one leads to the largest improvement in the score is performed. The resulting graph is Gm. Stopping criterion Score(Gm–1) == Score(Gm). At each iteration (learning Bayesian network consisting of n variables) O(n2) local changes should be evaluated to select the best one. Random restarts is usually adopted to escape the local maxima. Copyright (c) 2002 by SNU CSE Biointelligence Lab 17 Probabilistic Inference Calculate the conditional probability given the values of the observed variables. Junction tree algorithm Sampling method General probabilistic inference is intractable. However, calculation of the conditional probability for the classification is rather straightforward because of the property of the Bayesian network structure. Copyright (c) 2002 by SNU CSE Biointelligence Lab 18 The Markov Blanket All the variables of interest X = {X1, X2, …, Xn} For a variable Xi, its Markov blanket MB(Xi) is the subset of X – Xi which satisfies the following: P( X i | X X i ) P( X i | MB( X i )). Markov boundary Minimal Markov blanket Copyright (c) 2002 by SNU CSE Biointelligence Lab 19 Markov Blanket in Bayesian Networks Given the Bayesian network structure, the determination of the Markov blanket of a variable is straightforward. By the conditional independence assertions. Gene A Gene C Gene E Gene F The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children. Gene B Class Gene D Gene G Gene H Copyright (c) 2002 by SNU CSE Biointelligence Lab 20 Classification by Bayesian Networks II P(Class | Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H ) P(Class , Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H ) P(Class , Gene A, Gene B, Gene C , Gene D, Gene E, Gene F , Gene G, Gene H ) Class P( A) P( B) P(C ) P(Class | A, B) P( D) P( E | C ) P( F | C , Class ) P(G | Class , D) P( H | D) P( A) P( B) P(C ) P(Class | A, B) P( D) P( E | C ) P( F | C , Class ) P(G | Class , D) P( H | D) Class P( A) P( B) P(C ) P(Class | A, B) P( D) P( F | C , Class ) P(G | Class , D) P( A) P( B) P(C ) P(Class | A, B) P( D) P( F | C , Class ) P(G | Class , D) Class P(Class | A, B) P( F | C , Class ) P(G | Class , D) Copyright (c) 2002 by SNU CSE Biointelligence Lab 21 DNA Microarrays Monitor thousands of gene expression levels simultaneously traditional one gene experiments. Fabricated by high-speed robotics. Known probes Copyright (c) 2002 by SNU CSE Biointelligence Lab 22 A Comparative Hybridization Experiment Image analysis Copyright (c) 2002 by SNU CSE Biointelligence Lab 23 Mining on Gene Expression and Drug Activity Data Relationships among human cancer, gene expression, and drug activity Human cancer Gene expression Drug activity Revealing these relationships Cause and mechanisms of the cancer development New molecular targets for anti-cancer drugs Copyright (c) 2002 by SNU CSE Biointelligence Lab 24 NCI (National Cancer Institute) Drug Discovery Program NCI 60 cell lines data set Copyright (c) 2002 by SNU CSE Biointelligence Lab 25 NCI60 Cell Lines Data Set From 60 human cancer cell lines Colorectal, renal, ovarian, breast, prostate, lung, and central nervous system origin cancers, as well as leukemias and melanomas Gene expression patterns cDNA microarray Drug activity patterns Sulphorhodamine B assay changes in total cellular protein after 48 hours of drug treatment Copyright (c) 2002 by SNU CSE Biointelligence Lab 26 Schematic View of the Modeling Approach Preprocessing Gene Expression Data Gene B - Thresholding - Clustering - Discretization Gene A Drug A Drug B Cancer Drug activity Data - Selected genes, drugs Gene A Gene B and cancer type node Drug A Drug B Cancer < Learned Bayesian network > - Dependency analysis - Probabilistic inference Copyright (c) 2002 by SNU CSE Biointelligence Lab 27 Data Preparation cDNA microarray data Drug activity data (1376 + 118) 60 data matrix 1376 genes 60 samples Drug activities Drug activity patterns on 60 cell lines 118 60 matrix Gene expressions Gene expression profiles on 60 cell lines 1376 60 matrix 60 samples Copyright (c) 2002 by SNU CSE Biointelligence Lab 118 drugs 28 Preprocessing Elimination of unknown ESTs 805 genes Elimination of drugs which have more than 4 missing values 84 drugs 60 samples Thresholding 60 samples 1376 genes 805 genes 84 drugs 118 drugs Discretization Local probability model for Bayesian networks: multinomial distribution 0 -1 1 - c Copyright (c) 2002 by SNU CSE Biointelligence Lab + c 29 Bayesian Network Learning for Gene-Drug Analysis Large-scale Bayesian network Several hundreds nodes (up to 890) General greedy search is inapplicable because of time and space complexity. Search heuristics Local to global search heuristics Exploit the locality of Bayesian networks to reduce the entire search space. The local structure: Markov blanket Find the candidate Markov blanket (of pre-determined size k) of each node reduce the global search space Copyright (c) 2002 by SNU CSE Biointelligence Lab 30 Local to Global Search Heuristics Input: - A data set D. - An initial Bayesian network structure B0. - A decomposable scoring metric, Score( B, D) i Score( X i | Pa B ( X i ), D). Output: A Bayesian network structure B. Loop for n = 1, 2, …, until convergence. - Local Search Step: * Based on D and Bn–1, select for Xi, a set CBin (|CBin| k) of candidate Markov blanket of Xi. * For each set {Xi, CBin}, learn the local structure and determine the Markov blanket of Xi, BLn(Xi), from this local structure. * Merge all Markov blanket structures G({Xi, BLn(Xi)}, Ei) into a global network structure Hn (could be cyclic). - Global Search Step: * Find the Bayesian network structure Bn Hn, which maximizes Score(Bn, D) and retains all noncyclic edges in Hn. Copyright (c) 2002 by SNU CSE Biointelligence Lab 31 Dimensionality Problem The number of attributes (nodes) >> sample size Unreliable structure of the learned Bayesian networks Probabilistic inference is nearly impossible. Downsize the number of attributes by clustering Prototype: mean of all members in a cluster In the preprocessing step Copyright (c) 2002 by SNU CSE Biointelligence Lab 32 Bayesian Network with 45 Prototypes Node types (46 nodes in all) 40 gene prototypes 5 drug prototypes Cancer label Discretization boundary - c, + c c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 Bayesian network learning Varying candidate Markov blanket size (k = 5 ~ 15) Select the best one Three data sets (c = 0.43, 0.50, 0.60) three Bayesian networks Probabilistic inference Copyright (c) 2002 by SNU CSE Biointelligence Lab 27.4 33 Correlations between ASNS and L-Asparaginase Part of the Bayesian network (c = 0.60) < Conditional probability table > P(D2|G4) D2 = -1 Prototype for L-Asparaginase D2 = 0 D2 = 1 G4 = -1 0.32096 0.27086 0.40818 G4 = 0 0.31387 0.41247 0.27366 G4 = 1 0.32167 0.34920 0.32913 Prototype for ASNS and SID W 484773, PYRROLINE-5CARBOXYLATE REDUCTASE [5':AA037688, 3':AA037689] Copyright (c) 2002 by SNU CSE Biointelligence Lab 34 Bayesian Networks on Subset of Genes and Drugs Node types (17 nodes in all) 12 genes 4 drugs Cancer label Discretization boundary - c, + c c Distribution Ratio -1 0 1 0.43 33.3 % 33.3 % 33.3 % 0.50 30.8 % 38.3 % 30.8 % 0.60 27.4 45.1 Clustering of genes and drugs together - From neighboring clusters Bayesian network learning General greedy search with restart (100 times) Select the best one Three data sets (c = 0.43, 0.50, 0.60) three Bayesian networks Probabilistic inference Copyright (c) 2002 by SNU CSE Biointelligence Lab 27.4 35 Around the L-Asparaginase < Part of the Bayesian network (c = 0.6) > Copyright (c) 2002 by SNU CSE Biointelligence Lab 36 Probabilistic Relationships Around the L-Asparaginase Cancer type unobserved D1: L-Asparaginase G1: ASNS gene G2: PYRROLINE-5-CARBOXYLATE REDUCTASE Cancer type observed (= leukemia) D1: L-Asparaginase G1: ASNS gene G2: PYRROLINE-5-CARBOXYLATE REDUCTASE P(D1|G1) D1 = -1 D1 = 0 D1 = 1 P(D1|G1,L) D1 = -1 D1 = 0 D1 = 1 G1 = -1 0.19857 0.27471 0.52672 G1 = -1 0.17536 0.22838 0.59626 G1 = 0 0.31110 0.49795 0.19095 G1 = 0 0.27128 0.53790 0.19081 G1 = 1 0.42159 0.36279 0.21561 G1 = 1 0.38500 0.42437 0.19063 P(D1|G2) D1 = -1 D1 = 0 D1 = 1 P(D1|G2,L) D1 = -1 D1 = 0 D1 = 1 G2 = -1 0.27510 0.35226 0.37263 G2 = -1 0.23812 0.33853 0.42335 G2 = 0 0.31621 0.41072 0.27307 G2 = 0 0.27978 0.42666 0.29356 G2 = 1 0.33837 0.39664 0.26499 G2 = 1 0.30371 0.42108 0.27520 Copyright (c) 2002 by SNU CSE Biointelligence Lab 37 Term Project #3: Diagnosis Using Bayesian Networks Outline Task 1: Structural learning of the Bayesian network Data generation from the ALARM network Structural learning of Bayesian networks using more than two kinds of algorithms and scores Compare the learned results w.r.t. the edge errors according to the various sample sizes and the learning algorithms Task 2: Classification using Bayesian networks Arbitrarily divide the Leukemia data set between the training set and the test set Learn the Bayesian network from the training data set using one of the metric-based approaches Evaluate the performance of the Bayesian network as a classifier (classification accuracy) Copyright (c) 2002 by SNU CSE Biointelligence Lab 39 Data Generation Using the Netica Software (http://www.norsys.com) The ALARM network # of nodes: 37 # of edges: 46 Copyright (c) 2002 by SNU CSE Biointelligence Lab 40 Structural Learning Independence method BN Power constructor (http://www.cs.ualberta.ca/~jcheng/bnsoft.htm) Metric-based method LearnBayes (http://www.cs.huji.ac.il/labs/compbio/LibB/) MDL, BIC, BD, and likelihood score are can be used. Copyright (c) 2002 by SNU CSE Biointelligence Lab 41 The Leukemia Data Set Class type ALL (acute lymphoblastic leukemia) or AML (acute myeloid leukemia) Data set # of attributes: 50 gene expression levels (0 or 1) # of samples: 72 Copyright (c) 2002 by SNU CSE Biointelligence Lab 42 Submission Deadline: 2002. 11. 27 Location: 301-419 Copyright (c) 2002 by SNU CSE Biointelligence Lab 43