Download Module Discovery in Gene Expression Data Using Closed Itemset

Module Discovery in Gene Expression Data Using Closed Itemset Mining Algorithm Yoshifumi Okada Wataru Fujibuchi [email protected] [email protected] Paul Horton [email protected] Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan Keywords: gene expression data, module, biclustering, closed itemset, LCM 1 Introduction An expression module is a set of genes with shared expression behavior under certain experimental conditions. The data used to search for expression modules typically is data from several microarray chip measurements, labeled by the experimental condition the sample was subjected to before performing the measurement. In recent years, several biclustering methods have been suggested to discover modules in gene expression data matrix, where a bicluster (module) is defined as a subset of genes that exhibit a highly correlated expression pattern over a subset of conditions. Biclustering however involves combinatorial optimization in selecting the rows and columns composing a module. Hence most existing algorithms are based on heuristic or stochastic approaches and produce possibly sub-optimal solutions [1]. Our goal is to develop a fast biclustering method for enumerating every interesting bicluster within a reasonable time. We conjecture that interesting biclusters (or at least their cores) can be obtained by enumerating maximal biclusters which have identical condition label and discretized expression values; a problem which can be solved in polynomial time. By exhaustive enumeration of such biclusters, it is possible to select only biclusters satisfying a certain criterion such as a user-specified bicluster size, Gene Ontology (GO) term enrichment and so on. Here, we propose a new biclustering method, BiModule, that enumerates biclusters in polynomial time based on a closed itemset mining algorithm that has been actively studied in data-mining. 2 Method In this study, biclustering is reduced to a closed itemset mining problem in a transaction database. A transaction database is a set of records representing transactions, where each record consists of a number of items. A closed itemset is a kind of maximal itemset, specifically a set of items included in t transactions, which has no superset also included in t transactions. Extracting a closed itemset is equivalent to finding a set of conditions for a bicluster. BiModule achieves a fast enumeration of biclusters by using the LCM algorithm (Linear Closed itemset Miner) [2] that can enumerate closed itemsets in linear time. The input to BiModule is a gene expression data matrix and minimum bicluster size. BiModule performs its calculation in several steps. First, the given expression data are normalized by sample to have mean 0 and variance 1, then the normalized data are discretized to several levels by dividing equally the range between the maximum and minimum (Fig.1a). Next, we prepare an itemization table of IDs for representing each discretization level in each sample (Fig.1b). The discretized data is converted to a transaction database by reference to the itemization table (Fig.1c). A transaction corresponds to a gene, and an item represents an ID for a discretization level in a sample, where IDs for missing values are not included. Subsequently, closed itemsets are enumerated by LCM (Fig. 1d). A bicluster is obtained by converting the IDs contained in a closed itemset back into a condition label and discretized expression value, which in turn are used to identify the genes (Fig.1e). Finally, biclusters that overlap by more than 25% with a larger bicluster are removed and the remaining biclusters are output to the user. (a) (d) (e) LCM (c) (b) Figure1:Procedure of BiModule (a) (b) 0.6 0.4 0.2 1 BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.6 0.4 average match score BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.1 0.15 0.2 0 0.25 0.02 1 0.04 0.06 0 0.08 2 3 0.4 0.2 BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.6 0.4 0.2 0 0.25 5 6 7 8 9 10 BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.6 0.4 0.2 0 0 0.2 noise level 4 1 average match score 0.6 average match score BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.15 1 overlap degree 1 0.1 0.4 noise level noise level 0.05 0.6 0 0 0.05 BiModule Bimax ISA Samba CC OPSM xMotif 0.8 0.2 0.2 0 average match score (c) 1 average match score average match score 1 0 0.02 0.04 noise level 0.06 0.08 0 1 2 3 4 5 6 7 8 9 10 overlap degree Figure 2: Extraction accuracy of artificially-embedded modules. (a), (b) and (c) shows the precision (upper) and recall (lower) for constant, coherent and overlapping modules. 3 Results and Conclusions To test the validity of biclusters extracted by BiModule we compared it to existing salient biclustering methods[1]: Bimax, ISA, Samba, CC, OPSM and xMotif, using synthetic datasets with artificially-embedded modules. Fig.2 shows the extraction accuracies of genes in true modules based on two indices, precision (upper) and recall (lower). The left two are graphs for constant module datasets, which contain planted biclusters in which each gene has a constant value. The middle two are graphs for coherent module datasets, which contain biclusters with expression values that vary with condition label. The accuracies for constant and coherent modules are plotted according to noise levels that were added. The right two graphs are for overlapping module datasets, in which adjacent biclusters overlap each other. The accuracies of overlapping modules are shown according to overlap degree. Fig.2 shows that BiModule is robust against noise and overlap of modules, and has high and stable performance over every module type. Moreover, we have confirmed that biclusters generated by BiModule reflect GO annotations and protein-protein interactions. BiModule is a promising tool for extracting modules. References [1] Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L., and Zitzler, E., A Systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics, 22:1122-1129, 2006. [2] Uno, T., Asai, T., and Arimura, H., An efficient algorithm for enumerating closed patterns in transaction databases, Lecture Notes in Artificial Intelligence, 3245:16-31, 2004.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Module Discovery in Gene Expression Data Using Closed Itemset