Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Functions, Networks, and
Phenotypes by Integrative
Genomics Analysis
Xianghong Jasmine Zhou
Molecular and Computational Biology
University of Southern California
IMS, NUS, Singapore, July 18, 2007
Information Flow
Cellular systems
1995
DNA
1997
Bacteria
Eukaryote
~1.6K genes ~6K genes
1998
2001
Human
Animal
~20K genes 30~100K genes
Objective:
$1000 human genome
Gene Expression Datasets: the Transcriptome
RNA
•Oligonucleotide Array
•cDNA Array
Protein abundance measurement (Mass Spec)
Protein
Protein interactions (yeast 2-hybrid system, protein arrays)
Protein complexes (Mass Spec)
Rapid accumulation of microarray data
in public repositories
• NCBI Gene Expression Omnibus
137231 experiments
• EBI Array Express
55228 experiments
The public microarray data increases by 3 folds per year
Multiple Microarray Technology Platforms
Microarray
Platforms
Graph-based Approach for the Integrative Microarray Analysis
Datasets
Coexpression
Networks
Recurrent
Patterns
Annotation
experiments
Transcriptional Annotation
f
f
gene
----------------
a
j
h
a
c
TF
e
e
b
b
k
d
j
h
c
h
k
d
g
i
e
i
g
h
e
f
f
----------------
j
a
c
h
c
e
b
k
d
e
g
----------------
i
g
Functional Annotation
f
j
j
a
h
c
a
e
e
b
k
k
d
f
h
c
b
d
i
g
i
k
d
f
a
i
h
b
i
g
g
j
a
b
i
g
d
f
f
a
----------------
j
j
a
h
c
h
c
e
e
Gene Ontology
b
b
k
k
d
g
i
d
g
i
Frequent Subgraph Mining Problem is hard!
Problem formulation: Given n graphs, identify
subgraphs which occur in at least m graphs (m n)
Our graphs are massive! (>10,000 nodes and >1 million edges)
The traditional pattern growth approach (expand frequent
subgraph of k edges to k+1 edges) would not work, since
the time and memory requirements increase exponentially
with increasing size of patterns and increasing number of
networks.
Novel Algorithms to identify diverse
frequent network patterns
• CoDense
(Hu et al. ISMB 2005)
– identify frequent coherent dense subgraphs
across many massive graphs
• Network Biclustering
(Huang et al, ISMB 2007)
– identify frequent subgraphs across many massive
graphs
• Network Modules (NeMo)
(Yan et al. ISMB 2007)
– identify frequent dense vertex sets across many
massive graphs
CODENSE: identify frequent
coherent dense subgraphs across
massive graphs
Hu et al, ISMB 2005
Identify frequent co-expression clusters
across multiple microarray data sets
c1 c2… cm
g1 .1 .2… .2
g2 .4 .3… .4
…
c1 c2… cm
g1 .8 .6… .2
g2 .2 .3… .4
…
f
a
c
d g
a
c
b
k
i
f
e
j
k
d
e
f
e
b
c
j
h
a c
k
b
f
j
a
d g
i
h
j
h
k
i
k
i
f
e
g
i
e
b
d
k
d g
e
a
j
h
..
.
b
d g
k
i
f
c
..
.
a c
h j
d g
i
g
c
b
a
h
f
c1 c2… cm
g1 .2 .5… .8
g2 .7 .1… .3
…
a
b
..
.
c1 c2… cm
g1 .9 .4… .1
g2 .7 .3… .5
…
e
f
h j
c
h
j
e
b
d g
k
i
The common pattern growth
approach
Find a frequent subgraph of k edges, and
expand it to k+1 edge to check occurrence
frequency
– Koyuturk M., Grama A. & Szpankowski W. An
efficient algorithm for detecting frequent subgraphs
in biological networks. ISMB 2004
– Yan, Zhou, and Han. Mining Closed Relational
Graphs with Connectivity Constraints. ICDE 2005
Problem of the Pattern-growth approach
The time and memory requirements increase
exponentially with increasing size of patterns
and increasing number of networks. The
number of frequent dense subgraphs is
explosive when there are very large frequent
dense subgraphs, e.g., subgraphs with
hundreds of edges.
Problem of the Pattern-growth approach
f
a
c
Pattern Expansion
k k+1
h j
e
d g
k
i
f
c
j
f
i
a c
j
h
k
d g
f
j
d g
i
d g
h
a
j
e
c
h
a c
k
i
f
h
d g
d g
a
k
i
i
c
h
b
k
d g
a c
f
j
h
k
d g
j
a
b
k
i
c
e
f
c
k
i
j
h
e
b
k
d g
a c
f
i
j
h
e
i
b
k
d g
i
f
h
j
a
e
b
d g
h j
d g
i
b
e
d g
e
c
b
a
h
f
f
j
j
e
b
k
c
k
i
f
i
j
a
e
d g
a
k
c
f
h j
b
h
e
b
e
b
b
d g
i
f
f
j
e
b
k
c
k
i
d g
e
b
e
f
i
h
a
b
a
h
e
c
f
h j
d g
b
k
e
c
c
k
i
f
h
e
d g
a
a
e
d g
a
j
b
a c
c
f
h j
b
b
a
f
a
k
i
c
h
k
j
e
b
d g
k
i
Our solution
We develop a novel algorithm, called CODENSE, to mine
frequent coherent dense subgraphs. The target subgraphs
have three characteristics:
(1) All edges occur in >= k graphs (frequency)
(2) All edges should exhibit correlated occurrences in
the given graph set. (coherency)
(3) The subgraph is dense, where density d is higher
than a threshold and d=2m/(n(n-1)) (density)
m: #edges, n: #nodes
CODENSE: Mine coherent dense
subgraph
(1) Builds a summary graph by eliminating infrequent edges
f
a
a
h
c
e
b
f
f
a
c
e
b
h
c
h
e
b
f
d
d
i
g
G1
d
i
g
G2
i
g
a
G3
h
c
e
b
f
a
b
f
a
h
c
e
d
b
g
G4
i
f
a
h
c
e
d
b
g
G5
i
d
h
c
i
summary graph Ĝ
e
d
g
g
G6
i
CODENSE: Mine coherent dense
subgraph
(2) Identify dense subgraphs of the summary graph
f
a
f
Step 2
h
c
e
h
c
e
b
d
g
i
summary graph Ĝ
MODES
g
i
Sub(Ĝ)
Observation: If a frequent subgraph is dense, it must be a dense
subgraph in the summary graph. However, the reverse conclusion
is not true.
CODENSE: Mine coherent dense
subgraph
(3) Construct the edge occurrence profiles for each
dense summary subgraph
E
f
h
c
Step 3
e
g
i
Sub(Ĝ)
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
0
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
…
…
…
…
…
…
…
edge occurrence profiles
CODENSE: Mine coherent dense
subgraph
(4) builds a second-order graph for each dense summary
subgraph
g-h
E
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
1
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
…
…
…
…
…
…
…
edge occurrence profiles
f-i
e-i
h-i
e-g
g-i
Step 4
e-h
c-h
f-h
c-f
e-f
c-e
c-i
second-order graph S
CODENSE: Mine coherent dense
subgraph
(5) Identify dense subgraphs of the second-order graph
g-h
g-h
f-i
e-i
h-i
e-i
h-i
e-g
g-i
Step 4
g-i
e-g
e-h
e-h
c-h
f-h
c-f
e-f
c-e
c-h
f-h
c-f
e-f
c-i
second-order graph S
c-e
Sub(S)
Observation: if a subgraph is coherent (its edges show high correlation
in their occurrences across a graph set), then its 2nd-order graph must
be dense.
CODENSE: Mine coherent dense
subgraph
(6) Identify the coherent dense subgraphs
g-h
h
e-i
h-i
e
i
g
Step 5
g-i
e-g
e-h
f
h
c
c-h
f-h
e
c-f
e-f
c-e
Sub(S)
Sub(G)
Our solution
The identified subgraphs by definition satisfy the three
criteria:
(1) All edges occur in >= k graphs (frequency)
(2) All edges should exhibit correlated occurrences in
the given graph set. (coherency)
(3) The subgraph is dense, where density d is higher
than a threshold and d=2m/(n(n-1)) (density)
m: #edges, n: #nodes
a
CODENSE: Mine coherent dense
subgraph
f
h
c
f
f
a
c
b
e
h
b
e
d
G1
a
b
e
b
h
c
e
d
i
g
G4
f
i
g
a
Step 1
G3
f
a
h
d
d
G2
f
c
e
i
g
h
c
b
d
i
g
a
g
c
b
e
d
i
G5
Step 2
e
Add/Cut
h
e
d
g
i
MODES
i
g
Sub(Ĝ)
summary graph Ĝ
g
h
c
b
f
a
f
h
c
i
G6
Step 3
g-h
g-h
f-i
h
e-i
e-i
h-i
h-i
E
e
Step 6
i
g
g-i
e-g
Step 5
g-i
e-g
e-h
e-h
f
h
c
e
Restore
G and
MODES
c-h
f-h
c-f
e-f
c-e
Sub(G)
Step 4
Sub(S)
MODES
c-h
f-h
c-f
e-f
c-e
c-i
second-order graph S
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
1
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
…
…
…
…
…
…
…
edge occurrence profiles
CODENSE
The design of CODENSE can solve the scalability issue.
Instead of mining each biological network individually,
CODENSE compresses the networks into two meta-graphs
(the summary graph and the second-order graph) and
performs clustering in these two graphs only. Thus,
CODENSE can handle any large number of networks.
MODES: Mine overlapped dense
subgraph
G
Sub(G) a
j
a
f
b
c
h
i
e
Step 1
b
HCS’
c
g
d
G’
j
f
h
h
Step 2
V
condense
i
e
j
i
g
g
d
HCS’
Sub(G’)
a
h
f
Step 4
f
b
h
h
Step 3
V
e
i
HCS’
c
e
i
restore
d
Hu et al. ISMB 2005
i
Applying CoDense to 39 yeast microarray data sets
f
c1 c2… cm
g1 .1 .2… .2
g2 .4 .3… .4
…
c1 c2… cm
g1 .8 .6… .2
g2 .2 .3… .4
…
c1 c2… cm
g1 .9 .4… .1
g2 .7 .3… .5
…
a
c
e
a
b
d g
a
c
b
k
i
f
e
j
k
d
a c
f
j
h
a
c
j
h
e
b
k
d g
f
a c
k
b
i
j
h
j
a
f
e
k
i
k
i
d g
i
h
g
k
i
e
b
d
e
f
c
e
b
d g
h j
d g
i
g
c
b
a
h
f
c1 c2… cm
g1 .2 .5… .8
g2 .7 .1… .3
…
f
h j
c
h
j
e
b
d g
k
i
Functional annotation
Annotation
Functional Annotation (Validation)
Method: leave-one-out approach - masking a
known gene to be unknown, and assign its
function based on the other genes in the
subgraph pattern.
Functional categories: 166 functional categories
at GO level at least 6
Results: 448 predictions with accuracy of 50%
Functional Annotation (Prediction)
We made functional predictions for 169
genes, covering a wide range of functional
categories, e.g. amino acid biosynthesis, ATP
biosynthesis, ribosome biogenesis, vitamin
biosynthesis, etc. A significant number of our
predictions can be supported by literature.
However…
• How about frequent non-dense graphs?
– Many biological modules may form paths
• How about subgraphs which are coherent
across only a subset of the graphs?
– Not all modules are activated across all
conditions, and genes may form modules with
diff. other genes under diff. conditions
Network Biclustering:
Identify frequent subgraphs across
massive graphs
Huang et al, ISMB 2007
Using 65 human co-expression network
as an illustration example
• 65 co-expression networks generated from
65 microarray data sets
• each graph contains 8297 genes, and 1%10% edges of a complete graph
Basically, it is a biclustering problem
graphs
edges
Network Biclustering
• Objective function
graphs
c'
f
mn c
c’: number of 1 in the bicluster
c: number of 1 in the whole matrix
mn: size of the bicluster
: regularization factor
1
0
1
0
0
…
1
0
1
0
0
1
…
0
0
1
1
1
1
…
0
0
1
0
1
1
…
1
0
1
1
0
1
…
0
1
1
1
1
1
…
0
0
0
1
0
0
…
1
edges
However, the matrix is very large with millions of edges …
We will first identify robust seed to narrow down the search space
Identify Bicluster seed
The property of relation graphs: edge labels are unique.
Hence, each graph can be treated as a collection of items
Thus, Frequent subgraph Mining can be modeled as frequent item set mining
Problem: current frequent item set mining algorithms can only
efficiently mine across many small item sets
In our problem, we have 65 very large item set…
We use a trick….
Identify Bicluster seed
E
Edge occurrence profiles:
G1
G2
G3
G4
G5
G6
...
G60
G61
G62
G63
G64
G65
e1
1
1
0
1
0
1
...
0
0
1
1
1
1
e2
1
1
0
1
0
1
…
0
1
0
1
1
1
e3
1
1
1
1
1
1
…
0
0
0
1
1
1
e4
0
0
1
1
1
0
...
0
0
1
1
1
0
e5
1
1
0
1
0
1
…
0
0
0
1
1
1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Frequent pattern tree
{G1, G3, G5, G6, G7,…}
common edges
{e1, e10, e56, e100, e1000,…}
Graph set with more than 5 members
and with > 1000 common edges
{G2, G3, G5, G7, G8,…} …. {G8, G9, G15, G26, G29}
common edges
{e4, e12, e33, e56, e890,…}
….
common edges
{e99, e220, e1545, e2629,…}
Very time consuming! It takes more than 2 weeks on 40 Pentium IV nodes
Huang et al. ISMB 2007
Expanding the Biclusters
E
G7
…
G61
G62
G63
G64
G65
1
0
0
0
1
1
1
1
1
0
1
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
.0
0
0
1
1
1
0
0
0
1
0
1
1
0
0
0
1
1
1
…
…
…
…
…
…
…
…
…
…
…
…
G1
G2
G3
G4
G5
G6
e1
1
1
0
1
1
e2
1
1
1
1
e3
1
1
1
e4
1
1
e5
1
…
…
Simulated Annealing
E
G7
…
G61
G62
G63
G64
G65
1
0
0
0
1
1
1
1
1
0
1
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
.0
0
0
1
1
1
0
0
0
1
0
1
1
0
0
0
1
1
1
…
…
…
…
…
…
…
…
…
…
…
…
G1
G2
G3
G4
G5
G6
e1
1
1
0
1
1
e2
1
1
1
1
e3
1
1
1
e4
1
1
e5
1
…
…
Identify connected components
Systematic identification of functional
modules in human genome
• We identified 143,400
network modules with
recurrence >= 5. They
vary in size from 4 to 180.
• 77.0% of the patterns are
functionally homogenous
(GO hyper-geometric Pvalue less than 0.01)
• The figure shows that the
functional homogeneity of
modules increase with
their recurrences.
Results (I): Examples of highly recurring
modules
Recurrence 20
Recurrence 18
Defense against oxygen and nitrogen species
Involved in spermatogenesis
Loosely connected network
patterns with high recurrence can
represent functional modules
Functional annotation
Annotation
We made functional predictions for 779 known and 116 unknown genes
by random forest classification with 71% accuracy.
Variables for random forest classification:
functional enrichment P-value
network connectivity
average node degree
Network size
network topology score
pattern recurrence numbers
unknown gene ratio
Network Modules (NeMo)
Identify frequent dense vertex sets
across many massive graphs
microarray
coexpression
graph
...
...
step 1
(neighbor association) clustering
summary graph
refinement
...
partitioning
step 2
step 3
step 4
Yan et al. ISMB 2007
105 human microarray data sets
NeMo
6477 recurrent coexpression clusters
(density > 0.7 and support > 10)
Validation based on ChIp-chip data
(9176 target genes for 20 TFs)
15.4% homogenous clusters
(vs. 0.2% by randomization test)
Validation based on human-mouse
Conserved Transfac prediction
(7720 target genes for 407 TFs)
12.5% homogenous clusters
(vs. 3.3% by randomization test)
Percentage of potential transcription modules validated by
ChIP-Chip data increase with cluster density and recurrence
Expression data
Microarray Data
Phenotype information
Phenotype Concepts (e.g. diseases, perturbations, tissues )
in Unified Medical Language System (UMLS)
Classifying microarray data based on phenotype
Adenocarcinoma Arthritis
Asthma
…
Glaucoma
HIV
For example, the current NCBI GEO database contains >60
cancer datasets, among which 11 leukemia datasets.
Identify phenotype-specific functional or
transcriptional modules
• Unsupervised approach
Cancer
Cancer
Cancer
Cancer
Frequent pattern mining
Module
Module 11, Module 2, Module 3, … Module k
105 microarray data sets
NeMo
6477 recurrent coexpression clusters
(density > 0.7 and support > 10)
A total of 459 of the clusters are statistically significant with a
hypergeometric P-value <0.01 in a specific phenotype:
e.g. malignant neoplasms, cardiovascular diseases or nervous
system disorders.
An example
5 out of the 9 support datasets are leukemia datasets (P-value 0.0039). It is
potentially regulated by E2F4, and majority genes are involved in cell cycle
and DNA repair.
Identify phenotype-specific functional or
transcriptional modules
• Supervised approach
Adenocarcinoma
Arthritis
…
AsthmaNon-Adenocarcinoma
Glaucoma
Functional and transcriptional modules
which are active ONLY in Adenocarcinoma
Related data sets
HIV
A case study: Identify Network Modules
Characterizing Cancer
32 Cancer Datasets
c1 c2… cm
c1 c.1
2… cm
g1 .9 .4…
c
1 c.1
2… cm
.4…
g2 g.71 .9
.3…
.5c1 c2… cm
g .9 .4…
.5c1.1c2… cm
…g2 .71 g.3…
.9
.4…
.5c1.1c2… cm
…g2 .71 g.3…
.9
.4…
.5 .1
…g2 .71 g.3…
.9
.4…
.1
…g2 .71 .3… .5
…g2 .7 .3… .5
…
f
a
j
h
c
f
e
a
c
d
g
j
h
c
…
e
b
b
k
a
h
e
b
f
j
k
k
d
i
d
i
g
i
g
25 Non-cancer Datasets
c1 c2… cm
c1 c.8
2… cm
g1 .2 .5…
c
1 c.8
2… cm
.5…
g2 g.71 .2
.1…
.3c1 c2… cm
g .2 .5…
.3c1.8c2… cm
…g2 .71 g.1…
.2
.5…
.3c1.8c2… cm
…g2 .71 g.1…
.2
.5…
.3 .8
…g2 .71 g.1…
.2
.5…
.8
…g2 .71 .1… .3
…g2 .7 .1… .3
…
f
a
f
h
c
j
a
k
g
a
i
k
g
…
e
b
d
j
h
c
e
b
d
h
c
e
f
j
i
b
k
d
g
i
Examples of identified modules
Cell cycle Module
Cell adhesion
PDGF-signaling
across all cancer datasets
across all solid tumor datasets
in breast cancer datasets
Reconstruct transcriptional cascades
by second-order analysis
Zhou et al. Nature Biotech 2005
Frequently occurring tight clusters
Frequently occurring tight clusters
Transcription
Factors
Co-occurrence of tight clusters
Coexpression network constructed with the dataset 1
Co-occurrence of tight clusters
Coexpression network constructed with the dataset 2
Co-occurrence of tight clusters
Coexpression network constructed with the dataset 3
Co-occurrence of tight clusters
Coexpression network constructed with the dataset 4
Co-occurrence of tight clusters
Coexpression network constructed with the dataset 5
Transcription
Factors Set 1
Coexpression Networks
Cooperativity
Transcription
Factor Set 2
Relevance Networks
Three types of transcription cascades
gene 1
TF2
Type I
TF1
gene 2
gene 3
TF3
gene 4
gene 5
gene 1
TF2
Type II
TF1
gene 4
gene 2
gene 3
gene 5
gene 1
TF1
Type III
TF2
gene 2
Transcription Regulation
gene 3
Protein Interaction
gene 4
gene 5
Applying to 39 yeast microarray data sets
• We identified 60 transcription modules. Among
them, we found 34 pairs that showed high 2ndorder correlation. A significant portion (29%, pvalue<10-5 by Monte Carlo simulation ) of those
modules pairs are participants in transcription
cascades: 2 pairs in Type I, 8 pairs in Type II,
and 3 pairs in type III cascades. In fact, these
transcription cascades inter-connect into a
partial cellular regulatory network.
HGH1 LOC1 NOC3
YDR152W YHL013C
BUD19 RPL13A RPL13B RPL16A RPL24B RPL28 RPL2B
RPL33B RPL39 RPL42B RPS16B RPS18A RPS21B RPS22A
RPS23B RPS4B RPS6A
BAT1 ILV2 LEU9
YBL113C YRF11 YRF1-7
RCS1
MET10 MET14 MET28 SUL2
RGM1
Leu3
GAL4
ALD5 ARG1 ARG2 ARG3 ARG4 ARG5,6
ARO1 ARO3 ARO4 ASN1 BNA1 CPA2
DED81 ECM40 HIS1 HIS4 HOM3 LEU4
LEU9 LYS20 MET22 ORT1 PCL5 TEA1
TRP2 YBR043C YDR341C YHM1
YHR162W YJL200C YJR111C
MET4
PDR1
GCN4
YBL113C YRF1-1 YRF1-3
YRF1-7
YAP5
SWI6
GAT3
YBL113C YIL177C YJL225C
YLR464W YML133C YRF1-1
YRF1-3 YRF1-4 YRF1-5 YRF16 YRF1-7
SWI5
MBP1
CDC45, RAD27, RNR1,
SPT21, YPL267W
MSN4
SWI4
NDD1
CDC21 CDC45 CLB5 CLB6 CLN1
GIN4 IRR1 MCD1 MSH6 PDS5
RAD27 RNR1 SPO16 SPT21
SWE1 TOF1 YBR070C
CLB6 RNR1
SPT21
YPL267W
GAS1
HTA1
HTB1
YEL077C YRF1-3 YRF1-7
ALK1 BUD4 CDC20 CDC5
CLB2 HST3 SWI5 YIL158W
YJL051W YOR315W
Regulation of transcription modules by transcription factors, based on ChIP-chip data and supported by recurrent expression clusters
Regulation between transcription factors, based on ChIP-chip data and supported by 2nd-order expression correlation
Protein interactions between two transcription factors, based on experimental data and supported by 2nd-order expression correlation
Two transcription modules with high 2nd-order expression correlations
Integrative Array Analyzer (iArray): a software package for
cross-platform and cross-species microarray analysis
Bioinformatics. 22(13):1665-7
Gene Aging Nexus (GAN):
An Integrated Genomics Data Mining Platform for Aging
Nucleic Acids Res. 2006 Nov
Acknowledgement
CoDense
Cancer Network Module
• Haiyan Hu
• Min Xu
NetBiclustering
• Haifeng Li
• Yu Huang
NeMo
• Xifeng Yan (IBM)
• Mike Mehan
Second-order TF cascade
• Ming-Chih Kao (U. Michigan)
• Haiyan Huang (UC Berkeley)
• Wing H. Wong (Stanford)
Consultation
• Jiawei Han (UIUC)
• Michael Waterman
Funding agencies: NIGMS, NCI, NSF, Seaver Foundation,
Sloan Foundation, Zumberg Foundation
Thank you!