Download Project#3 - 서울대 Biointelligence lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bayesian Networks
in Bioinformatics
Kyu-Baek Hwang
Biointelligence Lab
School of Computer Science and Engineering
Seoul National University
[email protected]
Contents

Bayesian networks – preliminaries
 Bayesian networks vs. causal networks
 Partially DAG representation of the Bayesian network
 Structural learning of the Bayesian network
 Classification using Bayesian networks

Microarray data analysis with Bayesian networks
 Experimental results on the NCI60 data set

Term Project #3
 Diagnosis using Bayesian networks
Copyright (c) 2002 by SNU CSE Biointelligence Lab
2
Bayesian Networks

The joint probability distribution over all the variables in
the Bayesian network.
P( X1 , X 2 ,..., X n )  i 1 P( X i | Pa i )
n
Local probability
distribution for Xi
Pa i : the set of parents of X i
A
i  ( i1 ,..., iqi ) ~ parameter for P( X i | Pa i )
B
qi : # of configurat ions for Pa i
ri : # of states for X i
C
D
P( A, B, C , D, E )
 P( A) P( B | A) P(C | A, B) P( D | A, B, C ) P( E | A, B, C , D)
E
 P( A) P( B) P(C | A, B) P( D | B) P( E | C )
Copyright (c) 2002 by SNU CSE Biointelligence Lab
3
Knowing
the Joint Probability Distribution

We can calculate any conditional probability from the
joint probability distribution in principle.
Gene A
Gene C
Gene E
Gene F
This Bayesian network can classify
the examples by calculating the
appropriate conditional
probabilities.
Gene B
Class
Gene D
Gene G
 P(Class| other variables)
Gene H
Copyright (c) 2002 by SNU CSE Biointelligence Lab
4
Classification by Bayesian Networks I

Calculate the conditional probability of ‘Class’ variable
given the value of the other variables.
 Infer the conditional probability from the joint probability
distribution.
 For example,
P(Class | Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H )

P(Class , Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H )
,
 P(Class, Gene A, Gene B, Gene C, Gene D, Gene E, Gene F , Gene G, Gene H )
Class
 where
the summation is taken over all the possible class values.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
5
Knowing the Causal Structure
Gene A
Gene C regulates Gene E and F.
Gene B
Gene D regulates Gene G and H.
Class has an effect on Gene F and G.
Gene C
Gene E
Gene F
Class
Gene D
Gene G
Gene H
Copyright (c) 2002 by SNU CSE Biointelligence Lab
6
Bayesian Networks vs. Causal Networks
Network structure
Bayesian networks
Causal networks
Conditional
independencies
Causal relationships
By d-separation property of the Bayesian network
structure
 The network structure asserts that every node is
conditionally independent from all of its nondescendants given the values of its immediate parents.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
7
Equivalent Two DAGs
X
Y
These two DAGs assert
that X and Y are
dependent on each other.
 the same conditional
independencies
X
Y
 equivalence class
Causal relationships are hard to learn from the
observational data.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
8
Verma and Pearl’s Theorem

Theorem:
 Two DAGs are equivalent if and only if they have the same
skeleton and the same v-structures.
X
Y
v-structure (X, Z, Y)
Z
: X and Y are parents of Z and not
adjacent to each other.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
9
PDAG Representations

Minimal PDAG representations of the equivalence class
 The only directed edges are those that participate in v-structures.

Completed PDAG representation
 Every directed edge corresponds to a compelled edge, and every
undirected edge corresponds to a reversible edge.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
10
Example: PDAG Representations
X
W
X
W
Y
Y
Z
Z
X
Minimal
PDAG
V
W
V
X
V
An equivalence class
W
Y
Y
Z
Z
V
Completed
PDAG
Copyright (c) 2002 by SNU CSE Biointelligence Lab
11
Learning Bayesian Networks

Metric approach
 Use a scoring metric to measure how well a particular structure
fits an observed set of cases.
 A search algorithm is used.  Find a canonical form of an
equivalence class.

Independence approach
 An independence oracle (approximated by some statistical test)
is queried to identify the equivalence class that captures the
independencies in the distribution from which the data was
generated.  Search for a PDAG
Copyright (c) 2002 by SNU CSE Biointelligence Lab
12
Scoring Metrics for Bayesian Networks

Likelihood L(G, G, C) = P(C|Gh, G)
 Gh: the hypothesis that the data (C) was generated by a
distribution that can be factored according to G.

The maximum likelihood metric of G
M ML (G, C )  max L(G,G , C )
G
 prefer the complete graph structure
Copyright (c) 2002 by SNU CSE Biointelligence Lab
13
Information Criterion Scoring Metrics

The Akaike information criterion (AIC) metric
M AIC (G, C )  log M ML (G, C )  Dim (G)

The Bayesian information criterion (BIC) metric
1
M BIC (G, C )  log M ML (G, C )  Dim (G ) log N
2
Copyright (c) 2002 by SNU CSE Biointelligence Lab
14
MDL Scoring Metrics

The minimum description length (MDL) metric 1
M MDL1 (G, C )  log P(G)  M BIC (G, C )

The minimum description length (MDL) metric 1
M MDL2 (G, C )  log M ML (G, C ) | EG | log N  c  Dim (G)
Copyright (c) 2002 by SNU CSE Biointelligence Lab
15
Bayesian Scoring Metrics

A Bayesian metric
M (G, C,  )  log P(G h |  )  log P(C | G h ,  )  c

The BDe (Bayesian Dirichlet & likelihood equivalence)
metric
n
qi
( N ij ' )
i 1
j 1
( N ij ' N ij )
P(C | G ,  )  
h
ri
( N ijk ' N ijk )
k 1
( N ijk ' )

Copyright (c) 2002 by SNU CSE Biointelligence Lab
16
Greedy Search Algorithm
for Bayesian Network Learning

Generate the initial Bayesian network structure G0.
 For m = 1, 2, 3, …, until convergence.
 Among all the possible local changes (insertion of an edge, reversal of
an edge, and deletion of an edge) in Gm–1, the one leads to the largest
improvement in the score is performed. The resulting graph is Gm.

Stopping criterion
 Score(Gm–1) == Score(Gm).

At each iteration (learning Bayesian network consisting of n
variables)
 O(n2) local changes should be evaluated to select the best one.

Random restarts is usually adopted to escape the local
maxima.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
17
Probabilistic Inference

Calculate the conditional probability given the values of
the observed variables.
 Junction tree algorithm
 Sampling method
 General probabilistic inference is intractable.
 However,
calculation of the conditional probability for the
classification is rather straightforward because of the property of
the Bayesian network structure.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
18
The Markov Blanket

All the variables of interest
 X = {X1, X2, …, Xn}

For a variable Xi, its Markov blanket MB(Xi) is the
subset of X – Xi which satisfies the following:
P( X i | X  X i )  P( X i | MB( X i )).

Markov boundary
 Minimal Markov blanket
Copyright (c) 2002 by SNU CSE Biointelligence Lab
19
Markov Blanket in Bayesian Networks

Given the Bayesian network structure, the determination
of the Markov blanket of a variable is straightforward.
 By the conditional independence assertions.
Gene A
Gene C
Gene E
Gene F
The Markov blanket of a node in
the Bayesian network consists of
all of its parents, spouses, and
children.
Gene B
Class
Gene D
Gene G
Gene H
Copyright (c) 2002 by SNU CSE Biointelligence Lab
20
Classification by Bayesian Networks II
P(Class | Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H )
P(Class , Gene A, Gene B, Gene C , Gene D, Gene E , Gene F , Gene G, Gene H )

 P(Class , Gene A, Gene B, Gene C , Gene D, Gene E, Gene F , Gene G, Gene H )
Class

P( A) P( B) P(C ) P(Class | A, B) P( D) P( E | C ) P( F | C , Class ) P(G | Class , D) P( H | D)
 P( A) P( B) P(C ) P(Class | A, B) P( D) P( E | C ) P( F | C , Class ) P(G | Class , D) P( H | D)
Class

P( A) P( B) P(C ) P(Class | A, B) P( D) P( F | C , Class ) P(G | Class , D)
 P( A) P( B) P(C ) P(Class | A, B) P( D) P( F | C , Class ) P(G | Class , D)
Class
 P(Class | A, B) P( F | C , Class ) P(G | Class , D)
Copyright (c) 2002 by SNU CSE Biointelligence Lab
21
DNA Microarrays


Monitor thousands of gene expression levels
simultaneously  traditional one gene experiments.
Fabricated by high-speed robotics.
Known
probes
Copyright (c) 2002 by SNU CSE Biointelligence Lab
22
A Comparative
Hybridization Experiment
Image
analysis
Copyright (c) 2002 by SNU CSE Biointelligence Lab
23
Mining on
Gene Expression and Drug Activity Data

Relationships among human cancer, gene expression, and drug
activity
Human cancer
Gene expression

Drug activity
Revealing these relationships 
 Cause and mechanisms of the cancer development
 New molecular targets for anti-cancer drugs
Copyright (c) 2002 by SNU CSE Biointelligence Lab
24
NCI (National Cancer Institute)
Drug Discovery Program
NCI 60
cell lines
data set
Copyright (c) 2002 by SNU CSE Biointelligence Lab
25
NCI60 Cell Lines Data Set

From 60 human cancer cell lines
 Colorectal, renal, ovarian, breast, prostate, lung, and central
nervous system origin cancers, as well as leukemias and
melanomas

Gene expression patterns
 cDNA microarray

Drug activity patterns
 Sulphorhodamine B assay  changes in total cellular protein
after 48 hours of drug treatment
Copyright (c) 2002 by SNU CSE Biointelligence Lab
26
Schematic View
of the Modeling Approach
Preprocessing
Gene Expression
Data
Gene B
- Thresholding
- Clustering
- Discretization
Gene A
Drug A
Drug B
Cancer
Drug activity
Data
- Selected genes, drugs
Gene A
Gene B
and cancer type node
Drug A
Drug B
Cancer
< Learned Bayesian network >
- Dependency analysis
- Probabilistic inference
Copyright (c) 2002 by SNU CSE Biointelligence Lab
27
Data Preparation

cDNA microarray data

Drug activity data
(1376 + 118)  60 data matrix
1376
genes
60 samples
Drug activities
 Drug activity patterns on 60
cell lines
 118  60 matrix
Gene expressions
 Gene expression profiles on
60 cell lines
 1376  60 matrix
60 samples
Copyright (c) 2002 by SNU CSE Biointelligence Lab
118
drugs
28
Preprocessing

 Elimination of
unknown ESTs 
805 genes
 Elimination of drugs
which have more
than 4 missing
values  84 drugs

60 samples
Thresholding
60 samples
1376
genes
805
genes
84
drugs
118
drugs
Discretization
 Local probability
model for Bayesian
networks:
multinomial
distribution
0
-1
1
 - c

Copyright (c) 2002 by SNU CSE Biointelligence Lab
 + c
29
Bayesian Network Learning
for Gene-Drug Analysis

Large-scale Bayesian network
 Several hundreds nodes (up to 890)
 General greedy search is inapplicable because of time and space
complexity.

Search heuristics
 Local to global search heuristics
 Exploit the locality of Bayesian networks to reduce the entire
search space.
 The
local structure: Markov blanket
 Find the candidate Markov blanket (of pre-determined size k) of
each node  reduce the global search space
Copyright (c) 2002 by SNU CSE Biointelligence Lab
30
Local to Global Search Heuristics
Input:
- A data set D.
- An initial Bayesian network structure B0.
- A decomposable scoring metric,
Score( B, D)  i Score( X i | Pa B ( X i ), D).
Output: A Bayesian network structure B.
Loop for n = 1, 2, …, until convergence.
- Local Search Step:
* Based on D and Bn–1, select for Xi, a set CBin (|CBin|  k) of candidate Markov blanket of Xi.
* For each set {Xi, CBin}, learn the local structure and determine the Markov blanket of Xi, BLn(Xi),
from this local structure.
* Merge all Markov blanket structures G({Xi, BLn(Xi)}, Ei) into a global network structure Hn
(could be cyclic).
- Global Search Step:
* Find the Bayesian network structure Bn  Hn, which maximizes Score(Bn, D) and retains all noncyclic edges in Hn.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
31
Dimensionality Problem

The number of attributes (nodes) >> sample size
 Unreliable structure of the learned Bayesian networks
 Probabilistic inference is nearly impossible.

Downsize the number of attributes by clustering
 Prototype: mean of all members in a cluster
In the
preprocessing step
Copyright (c) 2002 by SNU CSE Biointelligence Lab
32
Bayesian Network with 45 Prototypes

Node types (46 nodes in all)
 40 gene prototypes
 5 drug prototypes
 Cancer label

Discretization boundary
  - c,  + c
c
Distribution Ratio
-1
0
1
0.43
33.3
%
33.3
%
33.3
%
0.50
30.8
%
38.3
%
30.8
%
0.60
27.4
45.1

Bayesian network learning
 Varying candidate Markov
blanket size (k = 5 ~ 15)
 Select the best one
 Three data sets (c = 0.43, 0.50,
0.60)  three Bayesian
networks
 Probabilistic inference
Copyright (c) 2002 by SNU CSE Biointelligence Lab
27.4
33
Correlations between
ASNS and L-Asparaginase

Part of the Bayesian network (c = 0.60)
< Conditional probability table >
P(D2|G4) D2 = -1
Prototype for L-Asparaginase
D2 = 0
D2 = 1
G4 = -1
0.32096
0.27086
0.40818
G4 = 0
0.31387
0.41247
0.27366
G4 = 1
0.32167
0.34920
0.32913
Prototype for ASNS and SID W
484773, PYRROLINE-5CARBOXYLATE REDUCTASE
[5':AA037688, 3':AA037689]
Copyright (c) 2002 by SNU CSE Biointelligence Lab
34
Bayesian Networks
on Subset of Genes and Drugs

Node types (17 nodes in all)
 12 genes
 4 drugs
 Cancer label

Discretization boundary
  - c,  + c
c
Distribution Ratio
-1
0
1
0.43
33.3
%
33.3
%
33.3
%
0.50
30.8
%
38.3
%
30.8
%
0.60
27.4
45.1

Clustering of genes and drugs
together
- From neighboring clusters
Bayesian network learning
 General greedy search with
restart (100 times)
 Select the best one
 Three data sets (c = 0.43, 0.50,
0.60)  three Bayesian
networks
 Probabilistic inference
Copyright (c) 2002 by SNU CSE Biointelligence Lab
27.4
35
Around the L-Asparaginase
< Part of the Bayesian network (c = 0.6) >
Copyright (c) 2002 by SNU CSE Biointelligence Lab
36
Probabilistic Relationships
Around the L-Asparaginase


Cancer type unobserved
 D1: L-Asparaginase
 G1: ASNS gene
 G2: PYRROLINE-5-CARBOXYLATE
REDUCTASE
Cancer type observed (= leukemia)
 D1: L-Asparaginase
 G1: ASNS gene
 G2: PYRROLINE-5-CARBOXYLATE
REDUCTASE
P(D1|G1)
D1 = -1
D1 = 0
D1 = 1
P(D1|G1,L)
D1 = -1
D1 = 0
D1 = 1
G1 = -1
0.19857
0.27471
0.52672
G1 = -1
0.17536
0.22838
0.59626
G1 = 0
0.31110
0.49795
0.19095
G1 = 0
0.27128
0.53790
0.19081
G1 = 1
0.42159
0.36279
0.21561
G1 = 1
0.38500
0.42437
0.19063
P(D1|G2)
D1 = -1
D1 = 0
D1 = 1
P(D1|G2,L)
D1 = -1
D1 = 0
D1 = 1
G2 = -1
0.27510
0.35226
0.37263
G2 = -1
0.23812
0.33853
0.42335
G2 = 0
0.31621
0.41072
0.27307
G2 = 0
0.27978
0.42666
0.29356
G2 = 1
0.33837
0.39664
0.26499
G2 = 1
0.30371
0.42108
0.27520
Copyright (c) 2002 by SNU CSE Biointelligence Lab
37
Term Project #3:
Diagnosis Using Bayesian Networks
Outline

Task 1: Structural learning of the Bayesian network
 Data generation from the ALARM network
 Structural learning of Bayesian networks using more than two
kinds of algorithms and scores
 Compare the learned results w.r.t. the edge errors according to
the various sample sizes and the learning algorithms

Task 2: Classification using Bayesian networks
 Arbitrarily divide the Leukemia data set between the training set
and the test set
 Learn the Bayesian network from the training data set using one
of the metric-based approaches
 Evaluate the performance of the Bayesian network as a
classifier (classification accuracy)
Copyright (c) 2002 by SNU CSE Biointelligence Lab
39
Data Generation


Using the Netica Software (http://www.norsys.com)
The ALARM network
 # of nodes: 37
 # of edges: 46
Copyright (c) 2002 by SNU CSE Biointelligence Lab
40
Structural Learning

Independence method
 BN Power constructor
(http://www.cs.ualberta.ca/~jcheng/bnsoft.htm)

Metric-based method
 LearnBayes (http://www.cs.huji.ac.il/labs/compbio/LibB/)
 MDL, BIC, BD, and likelihood score are can be used.
Copyright (c) 2002 by SNU CSE Biointelligence Lab
41
The Leukemia Data Set

Class type
 ALL (acute lymphoblastic leukemia) or AML (acute myeloid
leukemia)

Data set
 # of attributes: 50 gene expression levels (0 or 1)
 # of samples: 72
Copyright (c) 2002 by SNU CSE Biointelligence Lab
42
Submission


Deadline: 2002. 11. 27
Location: 301-419
Copyright (c) 2002 by SNU CSE Biointelligence Lab
43
Related documents