Download Module Discovery in Gene Expression Data Using Closed Itemset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Microevolution wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Module Discovery in Gene Expression Data Using
Closed Itemset Mining Algorithm
Yoshifumi Okada
Wataru Fujibuchi
[email protected]
[email protected]
Paul Horton
[email protected]
Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and
Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
Keywords: gene expression data, module, biclustering, closed itemset, LCM
1
Introduction
An expression module is a set of genes with shared expression behavior under certain experimental
conditions. The data used to search for expression modules typically is data from several microarray chip
measurements, labeled by the experimental condition the sample was subjected to before performing the
measurement. In recent years, several biclustering methods have been suggested to discover modules in gene
expression data matrix, where a bicluster (module) is defined as a subset of genes that exhibit a highly
correlated expression pattern over a subset of conditions. Biclustering however involves combinatorial
optimization in selecting the rows and columns composing a module. Hence most existing algorithms are
based on heuristic or stochastic approaches and produce possibly sub-optimal solutions [1].
Our goal is to develop a fast biclustering method for enumerating every interesting bicluster within a
reasonable time. We conjecture that interesting biclusters (or at least their cores) can be obtained by
enumerating maximal biclusters which have identical condition label and discretized expression values; a
problem which can be solved in polynomial time. By exhaustive enumeration of such biclusters, it is possible
to select only biclusters satisfying a certain criterion such as a user-specified bicluster size, Gene Ontology
(GO) term enrichment and so on. Here, we propose a new biclustering method, BiModule, that enumerates
biclusters in polynomial time based on a closed itemset mining algorithm that has been actively studied in
data-mining.
2
Method
In this study, biclustering is reduced to a closed itemset mining problem in a transaction database. A
transaction database is a set of records representing transactions, where each record consists of a number of
items. A closed itemset is a kind of maximal itemset, specifically a set of items included in t transactions,
which has no superset also included in t transactions. Extracting a closed itemset is equivalent to finding a set
of conditions for a bicluster. BiModule achieves a fast enumeration of biclusters by using the LCM algorithm
(Linear Closed itemset Miner) [2] that can enumerate closed itemsets in linear time. The input to BiModule
is a gene expression data matrix and minimum bicluster size. BiModule performs its calculation in several
steps. First, the given expression data are normalized by sample to have mean 0 and variance 1, then the
normalized data are discretized to several levels by dividing equally the range between the maximum and
minimum (Fig.1a). Next, we prepare an itemization table of IDs for representing each discretization level in
each sample (Fig.1b). The discretized data is converted to a transaction database by reference to the
itemization table (Fig.1c). A transaction corresponds to a gene, and an item represents an ID for a
discretization level in a sample, where IDs for missing values are not included. Subsequently, closed itemsets
are enumerated by LCM (Fig. 1d). A bicluster is obtained by converting the IDs contained in a closed itemset
back into a condition label and discretized expression value, which in turn are used to identify the genes
(Fig.1e). Finally, biclusters that overlap by more than 25% with a larger bicluster are removed and the
remaining biclusters are output to the user.
(a)
(d)
(e)
LCM
(c)
(b)
Figure1:Procedure of BiModule
(a)
(b)
0.6
0.4
0.2
1
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.6
0.4
average match score
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.1
0.15
0.2
0
0.25
0.02
1
0.04
0.06
0
0.08
2
3
0.4
0.2
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.6
0.4
0.2
0
0.25
5
6
7
8
9
10
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.6
0.4
0.2
0
0
0.2
noise level
4
1
average match score
0.6
average match score
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.15
1
overlap degree
1
0.1
0.4
noise level
noise level
0.05
0.6
0
0
0.05
BiModule
Bimax
ISA
Samba
CC
OPSM
xMotif
0.8
0.2
0.2
0
average match score
(c)
1
average match score
average match score
1
0
0.02
0.04
noise level
0.06
0.08
0
1
2
3
4
5
6
7
8
9
10
overlap degree
Figure 2: Extraction accuracy of artificially-embedded modules. (a), (b) and (c) shows the
precision (upper) and recall (lower) for constant, coherent and overlapping modules.
3
Results and Conclusions
To test the validity of biclusters extracted by BiModule we compared it to existing salient biclustering
methods[1]: Bimax, ISA, Samba, CC, OPSM and xMotif, using synthetic datasets with artificially-embedded
modules. Fig.2 shows the extraction accuracies of genes in true modules based on two indices, precision
(upper) and recall (lower). The left two are graphs for constant module datasets, which contain planted
biclusters in which each gene has a constant value. The middle two are graphs for coherent module datasets,
which contain biclusters with expression values that vary with condition label. The accuracies for constant
and coherent modules are plotted according to noise levels that were added. The right two graphs are for
overlapping module datasets, in which adjacent biclusters overlap each other. The accuracies of overlapping
modules are shown according to overlap degree. Fig.2 shows that BiModule is robust against noise and
overlap of modules, and has high and stable performance over every module type. Moreover, we have
confirmed that biclusters generated by BiModule reflect GO annotations and protein-protein interactions.
BiModule is a promising tool for extracting modules.
References
[1] Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L.,
and Zitzler, E., A Systematic comparison and evaluation of biclustering methods for gene expression
data, Bioinformatics, 22:1122-1129, 2006.
[2] Uno, T., Asai, T., and Arimura, H., An efficient algorithm for enumerating closed patterns in transaction
databases, Lecture Notes in Artificial Intelligence, 3245:16-31, 2004.