Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Improving Biological Significance of Gene Expression Biclusters with Key Missing Genes ∗ Shufan Ji Xing Tian Jin Chen Department of Computer Science Beihang University Beijing, China 100191 Department of Computer Science Beihang University Beijing, China 100191 [email protected] [email protected] Department of Energy Plant Research Laboratory Department of Computer Science and Engineering Michigan State University East Lansing, MI [email protected] ABSTRACT 1. Identifying condition-specific co-expressed gene groups is critical for gene functional and regulatory analysis. However, given that genes with critical functions (such as transcription factors) may not co-express with their target genes, it is insufficient to uncover gene functional associations only from gene expression data. In this paper, we propose a novel integrative biclustering approach to build high quality biclusters from gene expression data, and to identify critical missing genes in biclusters based on Gene Ontology as well. Our approach delivers a complete inter- and intra-bicluster functional relationship, thus provides biologists a clear picture for gene functional association study. We experimented with the Yeast cell cycle and Arabidopsis cold-response gene expression datasets. Experimental results show that a clear inter- and intra-bicluster relationship is identified, and the biological significance of the biclusters is considerably improved. Gene expression clustering is a key step towards gene functional and regulatory analysis at the genome level [36]. Genes that have similar expression patterns under certain subsets of experimental conditions can be grouped together, and are considered to be functionally related [14, 8, 41]. Biclustering analysis, that identifies condition-specific coexpressed patterns from gene expression data, is essential to better understanding of the biological interactions between genes that are not apparent when clustered globally [7, 9]. In the last decade, many biclustering algorithms have been proposed to simultaneously cluster genes and experimental conditions together [7, 39, 40, 42, 13, 19, 21, 6, 9]. However, it has been shown that although these biclustering methods can capture gene groups that function together under a subset of experimental conditions, there is a lot of potential to improve towards uncovering gene regulation relationships, mainly because of the following two reasons: Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering; J.3 [Life and Medical Sciences]: Biology and genetics General Terms INTRODUCTION 1. Biclustering based on gene expression data would inevitably miss certain functionally related genes, mainly because some functionally related genes do not coexpress. For example, transcription factors (TF) usually do not co-express with their target genes, and if a key TF is missing from a bicluster, the functional association among the related genes is difficult to identify. Algorithms Keywords bi-clustering, Gene Ontology, gene expression, missing gene, biological network ∗contact author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. BCB’15, September 9–12, 2015, Atlanta, GA, USA. Copyright 2015 ACM 978-1-4503-3853-0/15/09 ...$15.00. http://dx.doi.org/10.1145/2808719.2808747. 2. It is the inter- and intra- cluster relationship rather than the clustered genes that contributes towards better interpretation of the overall picture of gene regulatory networks [17, 22, 32, 2]. Although hierarchical biclustering algorithms[13, 21, 6] can reveal the interbicluster relationships, to our knowledge no one has extensively studied both the inter- and intra-bicluster relationships. In summary, some functionally related genes may be missing in a bicluster. As a result, the intra-bicluster connections are weakened with reduced functional association. The gene regulatory networks built upon such biclusters would then be incomplete and disconnected. In order to generate and interpret biological hypotheses on gene transcriptional regulations, a biclustering algorithm that can 1) build high quality biclusters from gene expression data, 2) reveal inter- and intra-bicluster relationships, and 1. New Concept of Biclustering. We picture the whole skeleton of biclusters. The proposed concept of bicluster skeleton (i.e. inter- and intra- cluster relationships) could reveal the hierarchical subset/superset of a gene bicluster and the relationships among genes within a bicluster. Moreover, we improve the bicluster quality by adding new genes and new connections into biclusters to reinforce the intra-bicluster relationships. root leaves seeds Experimental results on Yeast cell cycle [7] and Arabidopsis cold-response microarray datasets [37] show that SKB could build a clear structure of the inter- and intra-bicluster relationship, consequently the mined biclusters have higher biological significance than existing methods. Comparing with existing biclustering algorithms, we have made four main contributions: cluster graphs 2. Novel Way to Employ Domain Knowledge. It is generally difficult to integrate mining data with domain knowledge in unsupervised learning models. Work to date either combines domain knowledge with mining data to form a complicated feature space, or takes domain knowledge as a post-processing filter [23, 30, 35, 12, 4]. However, the former approach suffers from the complicated combination of heterogeneous features, while the latter approach results in data loss. In this paper, we employ the domain knowledge as an independent data space, onto which the mining results are mapped for further improvement. Figure 1: Framework of SKB. In the upper part, SKB mines biclusters from gene expression data and reveals the inter-bicluster relationship with a hierarchical tree. In the tree, each node is a bicluster, shown with gene expression changing trend. In the lower part, SKB reveals the intra-bicluster relationships by identifying missing genes (yellow) that could bridge the functionally distinct genes in each bicluster, so as to uncover the otherwise hidden gene associations. 3) identify missing genes and missing gene associations, is highly expected. In this paper, we propose to identify gene expression biclusters with a novel biclustering algorithm called SKeleton Biclustering (SKB). SKB not only builds biclusters and reveals bicluster skeleton (inter-bicluster and intra-bicluster relationships), but also identifies relevant missing genes that can bridge the functionally distinct biclustered genes, in order to uncover the otherwise hidden gene associations within biclusters. As shown in Figure 1, SKB has three phases. In phase one, a hierarchical biclustering method QHB [13] is adopted to generate biclusters from gene expression data, which reveals the hierarchical inter-bicluster relationship by a dendrogram. The inter-bicluster relationships are revealed by a hierarchical tree. In phase two, each bicluster is converted into a graph (made up of several disconnected subgraphs) by linking the genes based on their functional similarity, which is calculated with Gene Ontology (GO) annotation [1, 10]. Gene connections within the bicluster is defined as the intra-bicluster relationship. In phase three, a graph mining algorithm is proposed to efficiently add new genes, if any, to each bicluster, to re-connect the disconnected subgraphs generated in phase 2, in order to enhance the intra-bicluster relationship. Therefore, the overall functional connectivity of the bicluster is increased. 3. Improved Biological Significance. SKB improves the biological significance of biclusters by reinforcing the gene functional associations. Different from most existing biclustering algorithms, our method can identify key missing genes that express differently but function associatively with biclustered genes, thus delivers a complete skeleton of biclusters. 4. Efficient Algorithm. Both gene biclustering and missing gene identification are computationally expensive. We introduce a novel algorithm to achieve the goal efficiently. 2. RELATED WORK The biclustering algorithm on gene expression data was first introduced by Cheng and Church 2000 [7] to simultaneously cluster both genes and experimental conditions, which captures the coherence of a subset of gene under a subset of experimental conditions [7]. A two-phase iterative strategy is used to identify biclusters in the data: 1) remove rows and columns from the gene expression matrix to compose a submetrix with the mean squared residue score below a threshold δ; 2) add the deleted rows and columns to the submatrix as along as the mean squared residue score does not exceed the threshold. δ-cluster model [40] was later proposed to accelerate the mining process with a move-based algorithm FLOC, which randomly generates initial biclusters and iteratively optimizes biclusters by adding/deleting genes or conditions. [39] further accelerated the biclustering process by employing a depth-first algorithm to mine pClusters with a user specified minimum pScore. However, those biclustering algorithms fail to generate deterministic mining results. A deterministic algorithm DBF was proposed [42] to further improve the quality and efficiency of biclustering. DBF employs a frequent closed pattern mining algorithm to generate “good seeds” with better pattern similarity and then refines the seeds by adding genes and conditions to achieve a low MSR score as well as large bicluster volume. A minimum row variance threshold is set to remove biclusters with trivial changes in trends. In particular, on a binary gene expression matrix, we can view the biclustering formulation as the search for all maximal bicliques in a bipartite graph, where the nodes are genes or conditions and an edge connecting a gene gi and a condition cj indicates gene gi responses in the condition cj [38]. Subsequently, the biclustering problem on a binary matrix can solved by the backtracking BronKerbosch algorithm for maximal clique enumeration [5]. While these algorithms can generate biclusters with similar trends, they are limited in several ways. First, these schemes typically employ a similarity score (e.g., MSR and pScore) to determine the quality of biclusters, which, however, cannot adequately capture the trend consistency of biclusters. Second, these algorithms usually generate biclusters based on selected “seeds” that cover only a small part of the whole dataset. As such, interesting patterns may be missed resulting in loss of relevant information. Third, the seed improvement process follows the hill-climbing paradigm, and hence, is computationally expensive. To overcome these limitations, a quick hierarchical biclustering algorithm called QHB was proposed to generate consistent-trends biclusters (a subsets of genes homogeneously expressed in a subset of conditions) and produces a hierarchical tree to reveal the inter-bicluster relations [13]. More recently, model-based biclustering algorithms [20, 21], which approximate the data density by a mixture of Gaussian distributions, adopt a hierarchical structure of biclusters to define a flexible distribution of data density. In [6], a graphical model was adopted to identify hierarchical biclusters. The resulting tree structure of conditions, where each node corresponds to a bicluster, provides an interpretation basis for the grouping [6]. LateBiclustering identifies biclusters with time lags, thus recovering genes that are missed because they co-expressed with delays [11]. DeBi improves the biological significance of biclusters by discovering maximum size homogeneous biclusters, in which each gene is strongly associated with a subset of samples, providing functionally more coherent gene sets compared to standard clustering or biclustering algorithms [33]. Note that biclustering methods solely based on gene expression data would inevitably leave some functionally related genes out of the biclusters [28], and thus weaken the overall gene relationships. Hence, incorporating biological domain knowledge in biclustering [23, 30, 35] has been recognized as a reliable way to enhance the interpretability of biclustered genes. Gene Ontology (GO) [1, 10] has now been accepted as the de facto language for the attribute description of biological entities in three key domains that are shared by all organisms, namely molecular function, biological process and cellular component [29, 24, 27, 26, 25]. In each of these domains, the corresponding GO is structured as a directed acyclic graph to reflect the complex hierarchy of biological terminologies. To incorporate Gene Ontology to increase biclustering performance, we propose a biclustering algorithm called Skeleton Biclustering (SKB) that could build high quality biclusters with clear inter- and intra-bicluster relationships, as well as identify missing genes and missing gene associations. Table 1: Original Data Matrix A A c1 c2 c3 c4 g1 2.4 2.95 2.45 2.99 g2 1.95 1.71 1 0.29 g3 0.5 1.1 0.38 1.56 Table 2: Slope Angle Matrix A0 A0 c1 c2 c2 c3 c3 c4 g1 28.81◦ −26.57◦ 28.37◦ g2 −13.50◦ −35.37◦ −35.37◦ g3 30.96◦ −35.75◦ 49.72◦ Table 3: Binary Matrix A00 (BinningT hreshold = 26.5◦ ) A00 c1 c2 (c1 c2 )0 c2 c3 (c2 c3 )0 c3 c4 (c3 c4 )0 g1 0 1 1 0 0 1 g2 0 0 1 0 1 0 g3 0 1 1 0 0 1 3. METHOD In this section, we introduce the three phases of SKB respectively. The overall framework of SKB is in Figure 1. 3.1 SKB phase 1, inter-bicluster relationship identification We employ an efficient top-down hierarchical biclustering algorithm QHB [13] (without modification) as the first phase of SKB to identify biclusters from gene expression data. We generate biclusters with consistent rising / falling trends and produce a hierarchical tree to reveal the interbicluster relationship. Here we adopt QHB algorithm in that it could deliver a well-organized hierarchical tree with each node representing a bicluster, from which we could easily identify the subset/superset and the nearest neighbors of a bicluster, which will facilitate gene family studies. Now that we will briefly describe the algorithm in two steps with a running example. Step 1: Trend Separation In Step 1, different changing trends of gene expression data are separated. Firstly, the gene expression matrix A (Table 1), where rows represent genes and columns represent experimental conditions, is transformed into a slope angle matrix A0 (Table 2) to reflect different changing trends when the experimental conditions change. Then A0 is binned into a binary matrix A00 (Table 3) where the rising(01) and falling(10) trends are separated into two consecutive columns; trends with trivial change(00) are filtered out. Step 2: Matrix Partition In Step 2, genes with similar changing trends (rising and falling simultaneously) are grouped into biclusters through a hierarchical partition tree in Figure 2. From matrix A00 , the partitioning process constructs a hierarchical tree where all valuable upper level information is kept intact and propagated into the lower level. This helps in preventing any information loss. The binning and partitioning ensures that genes with consistent trends under condition transitions are kept together in the same biclusters while genes with incon- A'' g1 g2 g3 A'' g2 g3 c1c2 (c1c2)' c2c3 (c2c3)' c3c4 (c3c4)' 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 c1c2 (c1c2)' c2c3 (c2c3)' c3c4 (c3c4)' 0 0 1 0 1 0 0 1 1 0 0 1 NULL A'' g2 g3 NULL c2c3 1 1 c3c4 1 0 NULL A'' g1 g3 A'' g1 g2 g3 Gene Distance Measure Biologically, many genes are involved in multiple cellular processes and therefore labeled with more than one GO term. Thus, the distance between two genes is defined as the minimum dissimilarity between any two GO terms annotating the two genes [34]. Let Tgi and Tgj be the set of GO terms annotating gene gi and gj respectively, the gene distance measure is defined as: (c1c2)' c2c3 (c3c4)' 1 1 1 0 1 0 1 1 1 (c1c2)' c2c3 (c3c4)' 1 1 1 1 1 1 NULL Figure 2: Matrix (A00 ) Partition. Genes with similar changing trends (rising and falling simultaneously) are grouped into biclusters through a hierarchical partition tree, where all valuable upper level information is kept intact and propagated into the lower level. sistent trends are separated into different biclusters. According to QHB, the biclustering results can be further refined to reflect the similar magnitudes of change by the same binning and partitioning procedures, although only different changing trends are considered here for simplicity. This scheme helps maintain the bicluster quality effectively, while the simultaneous grouping of multiple genes and conditions makes the process rather efficient. Moreover, the hierarchical partition tree reveals to users a clear interbicluster relationships. Based on the tree, users can navigate up or down the tree to get a more general or detailed insight into biclusters. 3.2 SKB phase 2, gene distance measure and bicluster graph generation In this phase, each gene expression bicluster (on the leaf level of the hierarchical biclustering tree) is converted into an undirected graph by linking functionally associated genes together. In general, two genes are considered to be functionally associated if they share at least one biological feature. Here we adopt GO annotation for the function association measure, defined as gene distance. Thus, genes satisfying a certain distance threshold are linked together. Clearly, the algorithm could be easily extended to adopt other domain knowledge as gene distance measure. GO Term Similarity Measure As the GO terms are not equally informative [29], for each gene set, specific weights are assigned to the GO terms [18]: the weight of a GO term is defined as the occurrence ratio of the GO term (with all its descendants) to the total terms in the gene set. That is, for any GO term t ∈ T , where T is the full set ofPGO terms, the term weight is defined as w(t) = (f req(t)+ d∈Dt f req(d))/N , where f req(t) denotes the number of occurrences of GO term t; Dt is the set of descendants of GO term t in T ; and N is the total number of term occurrences. Given two GO terms ta and tb and their weights w(ta ) and w(tb ), their term similarity score sim(ta , tb ) is defined in Equation 1 [31]. Note that tab is the nearest common ancestor of ta and tb , and w(tab ) is its weight. sim(ta , tb ) = 2 × ln w(tab ) × (1 − w(tab )) ln w(ta ) + ln w(tb ) (1) d(gi , gj ) = min ta ∈Tgi ,tb ∈Tgj (1 − sim(ta , tb )) (2) Bicluster Graph Generation To reveal the biological relationship among the biclustered genes, a bicluster S is converted into an undirected bicluster graph G(V, E), with the vertex gi ∈ V representing gene gi , and the edge e(gi , gj ) ∈ E indicating that the gene distance d(gi , gj ) is smaller than the predefined distance threshold σ. The graph G is either connected or made up of several disconnected components, each of which is a connected subgraph of G (see Algorithm 1 line 4-9). 3.3 SKB phase 3, adding new genes to improve biological significance Without key functional genes, the originally associated genes in resulting bicluster graph might be separated into different components. As for the example in Figure 3, without gene gx (function f1 , f2 ), gene ga (function f1 , f3 ) is separated with gene gb (function f2 , f4 ) due to non-similar functions (long gene-to-gene distance). gx could bridge the gap of ga and gb , in that it shares the same function f1 , f2 with ga , gb respectively. Therefore, given gene ga and gb in a bicluster graph G(V, E), there might exist a third gene gx , such that d(ga , gb ) > d(ga , gx ) + d(gb , gx ). According to this property, adding new genes to the bicluster graph might help to connect the otherwise disconnected components, leading to the enriched functional associations within biclusters. The search space of new genes, which is set as the whole genome in our study, could be defined according to different applications. Given the whole gene set I, let S be a biclustered gene set, G(V, E) be its bicluster graph, and C = {C1 , C2 , . . . , Cn } be the disconnected component set of G. Two disconnected components Ci (Vi , Ei ) and Cj (Vj , Ej ) are defined as “connectable” if and only if there exists gx ∈ I − S such that min(d(gi , gx ) + d(gj , gx )) < σ (the gene distance threshold), where gi ∈ Vi and gj ∈ Vj . The strategy to identify new genes is to search for the smallest set of new genes that connect all the connectable components in a bicluster graph (see pseudocode in Algorithm 1 line 10-22). We decided to identify the smallest set of new genes because of three reasons. First, we notice that owing to incomplete biological knowledge in GO annotation, introducing too many new gene could cause the bicluster to be biased toward known information. Specifically, if two genes share the same function that is not yet known or annotated, the distance between them will be artificially large. Therefore, we decided to add the minimum number of new genes as a constraint to weaken such bias towards known information. Second, by minimizing the number of new genes to be included, we are maximizing the number of components each new gene connects to. This results in new genes that are more informative. Third, the objective of adding gx {f1, f2} Ci Cj ga gb {f1, f3} {f2, f4} cluster graph G Figure 3: Effect of Key Gene. Adding key genes to a bicluster graph may help connect the otherwise disconnected components, leading to the enriched functional associations within biclusters. Algorithm 1 SKB phase 2 and 3 1: procedure AddNewGenes(T, I, σ) 2: Input: S - bicluster genes; T - the full set of GO terms; I - the full set of genes in gene expression data; σ - gene distance threshold; 3: Output: Vx - new gene set; G0 - the new cluster graph; 4: V = ∅; E = ∅; . Initialization 5: for each gene pair gi , gj ∈ S (i 6= j) do . Bicluster graph generation 6: if getDistance(gi , gj , T ) < σ then 7: V = V ∪ {gi , gj }; E = E ∪ {< gi , gj >}; 8: end if 9: end for 10: G = (V, E); 11: C0 = getComponent(G); 12: if |C0 | = 1 then . No need to insert missing genes for a connected graph 13: Vx = ∅; G0 = G; 14: else 15: C = ∅; 16: for each component pair Ci , Cj ∈ C0 do 17: if (Connectable(Ci , Cj ) = true) C = C ∪ {Ci , Cj }; 18: end for 19: (Vx , Ex ) = getExtraGenes(C); . Identify missing genes 20: G0 = (V ∪ Vx , E ∪ Ex ); 21: end if 22: return Vx and G0 ; 23: end procedure Algorithm 2 getExtraGenes(C) 1: Input: S - bicluster genes; T - the full set of GO terms; I - the full set of genes in gene expression data; σ - gene distance threshold; C - the set of connectable components; c - a constant large number; 2: Output: Vx - new gene set; Ex - edge set that connect vertices in Vx & V ; 3: Vx = ∅; Ex = ∅; Λ = ∅; Cx = ∅; 4: for each gene g ∈ I − S do . Number of components g is connectable to 5: N (g) = getConnComp(g, C, σ); 6: if (N (g) ≥ 2) Λ = Λ ∪ {g}; 7: end for 8: while Λ 6= ∅ do 9: remove g with max N (g) from Λ; 10: Vx = Vx ∪ {g}; Cx = Cx ∪ {g}; 11: for each component Ci ∈ C that g connects to do 12: gi = getM inDist(Ci , g); Ex = Ex ∪ {< g, gi >}; 13: Cx = Cx ∪ Ci ; remove Ci from C; 14: end for 15: C = C ∪ {Cx }; Cx = ∅; 16: for each gene g ∈ Λ do 17: N (g) = getConnComp(g, C, σ); 18: if (N (g) < 2) remove g from Λ; 19: end for 20: end while 21: return Vx and Ex ; For a given bicluster S, instead of testing every subset of I − S exhaustively, which is a NP problem, we develop a heuristic algorithm to efficiently identify the local optimized smallest new gene set. Let Vx be the initial new gene set. The algorithm to identify Vx consists of two steps. First, candidate genes in I − S − Vx are sorted in descending order according to the number of components that it could connect in C. Candidate genes that could connect less than two components are ignored (see Algorithm 2 line 4-7). Second, the top candidate gene is added to Vx , connecting as many components in C as possible. Then the whole process is repeated until all components in C are connected or no more candidate gene exists (see Algorithm 2 line 8-20). This heuristic algorithm has polynomial time complexity. For example, given a cluster graph with 9 disconnected components C0 , C1 , . . . , C8 in Figure 4, and five new gene candidates g1 , g2 , . . . , g5 . In each iteration, the numbers of components that each gene could connect are listed in Table 4. As for this example, g5 is firstly ignored since it cannot connect any two components. Meanwhile, g1 is added to Vx in that it connects the most number of the components, i.e., 6. In the second iteration, the number of connections for each remaining gene candidate is recalculated after g1 connects C0 , C1 , C2 , C3 , C5 , C6 together. Similarly, g3 is added to Vx , connecting together two more components C7 , C8 with C5 . And finally, g4 is added to connect C4 with the others. As such, the whole process terminates with Vx = {g1 , g3 , g4 } to connect all the components together. 4. new genes is to understand the relationships among the biclustered genes, and too many new genes may reduce the gene expression coherence. EXPERIMENTAL RESULTS We performed experiments on two datasets, the Yeast cell cycle microarray data [7] and the Arabidopsis cold-response microarray data [37], with two GO categories, molecular C0 C8 g5 C2 g1 Gene Ontology category Biological Process Molecular Function Biological Process Molecular Function new genes per cluster 5.1 2.5 9.9 7.6 FOS QHB Rand 2.68 2.51 1.20 1.09 8.42 2.16 8.12 2.90 C3 C6 g2 C5 Arabi -dopsis Yeast g3 g4 C4 Process C7 Table 5: SKB performance summary dataset C1 transcription transport Table 4: Number of connections for candidate new genes Candidate The value of N(g) new gene Iter 1 Iter 2 Iter 3 Iter 4 g1 6X g2 4 1 g3 3 3X g4 2 2 2X g5 1 - function (MF) and biological process (BP) adopted as domain knowledge [10]. Note that the proportion of functional annotations, esp. BP, are not complete for most genomes. For example, only 37% of the Arabidopsis genes have annotations with experimental evidence [15], leading to challenges in using GO in gene clustering. We tested the performance of SKB on both rich annotated genome (yeast) and sparsely annotated genome (Arabidopsis). The results showed that the performance of SKB is robust. The Yeast microarray dataset contains 2884 genes whose expressions are altered during cell cycle under 17 time points. Among them, 1649 genes have at least one MF annotation, and 1973 genes have at least one BP annotation. The Arabidopsis microarray dataset contains 2255 cold-response genes under 6 time points with cold treatment at 4◦ C. Among them, 1142 genes have at least one MF annotation, and 975 genes have at least one BP annotation. Only non-IEA GO annotations are used in the experiments. 4.1 Results of SKB In SKB phase 1 of hierarchical biclustering, we mined biclusters and inter-cluster relationships, requiring that each bicluster has at least 50 genes and 5 conditions. For the Yeast dataset, 114 biclusters were mined with average 56.7 genes on each bicluster. For the Arabidopsis dataset, 44 biclusters were mined with average 55.0 genes on each bicluster. By drawing the hierarchical biclustering trees, the inter-cluster relationships are revealed. In SKB phase 2 of bicluster graph generation, we set distance thresholds σ for each gene expression dataset under different GO categories, such that for any edge in a bicluster graph, the false discovery rate of its distance value is smaller DNA/RNA binding transporter activity Component Figure 4: Illustrative example for new gene identification. Function TF activity nucleus New genes cell wall All cold-response genes 0 5 10 15 20 25 30 Percentage Figure 5: The functional characterization distribution of the new genes. than 0.01. Hence, all the edges in each bicluster graph are statistically sound. With the given distance thresholds, we convert all of the 114 Yeast and 44 Arabidopsis biclusters into bicluster graphs. Note that all of the un-annotated genes are removed from the bicluster graphs before further processing. In SKB phase 3 of new gene identification, a set of new genes are identified for each bicluster (see details in Table 5). On average, for Yeast biclusters, 7 and 10 new genes were found per cluster with MF and BP annotations. For Arabidopsis biclusters, 3 and 5 new genes were found per cluster with MF and BP annotations. 4.2 Performance evaluation The performance of SKB was tested on two ways: 1) a GO based measure on the identification of functionally related missing genes, and 2) network connectivity on the identification of hidden key genes. In addition, the biological significance of the new genes were tested with functional characterization. Effectiveness of SKB on the identification of functionally related missing genes To qualitatively show the biological significance of a gene cluster graph, we define the Functionally Overall Similarity measure (FOS) in Equation 3: |V | X F OS(G) = 1 − f (gi , gj ) i,j=1 1 |V |(|V 2 | − 1) (3) Disease Signaling Unknown TFs CPK28 LHY1 Circadian Rhythm CCA1 MPK3 Cell Expansion CBF1 Cold regulators & Targets CBF3 CBF CBF CBF2 Figure 6: Case study. Red colored are clustered genes, and yellow colored are new genes. To study why the different processes co-express under cold treatment and how these processes may work together, SKB introduces 7 new genes, in which MPK3 acts as a central node to bridge C1 with other components, providing clues for how the processes may be functioning together in cold conditions. 8 d(gi , gj ) X if< gi , gj >∈ E > > < min d(u, v) if< gi , gj >∈ / E & Φ 6= ∅ f (gi , gj ) = φ∈Φ(gi ,gj ) > (u,v)∈φ > : 1 otherwise (4) where Φ(gi , gj ) denotes the set of paths connecting gi and gj ; f (gi , gj ) is the gene distance value if gi and gj are directly connected, or the sum of distance along the shortest path connecting gi and gj if such path exists. If no such connections exist between two genes, the value of f (gi , gj ) is 1. The score of FOS indicates the overall functional similarity of a bicluster using GO annotation. The value close to 1 means that, for most of the genes in cluster graph G, the overall pair-wise similarity of the connected is high, or the similarity value along the path to connect the gene pair is high. In contrast, the value of F OS(G) close to 0 means most of the genes are dissimilar from each other. To evaluate the effectiveness of SKB in identifying functionally related missing genes, we compared SKB against QHB on FOS scores of the resulting biclusters. We chose QHB to compare because that i) QHB performs well in mining biclusters with consistent trends and trends with similar degrees of fluctuations; ii) such comparison shows the ef- fectiveness of the new genes since SKB generates the same biclusters as QHB. To evaluate SKB’s strategy to identify new genes, we also compared SKB with a random selection process, in which we randomly choose the same number of new genes as SKB does from the candidate gene set and add them to each bicluster. Experimental results in Table 5 show that the FOS scores of the biclusters are considerably increased comparing to QHB and the random selection. For example, by introducing an average of 5.1 new genes to the Arabidopsis biclusters with BP annotations, comparing to QHB and the random selection, the FOS score is increased by 2.67 and 2.51 times, respectively. Furthermore, in order to verify the performance of SKB independent to GO annotation, we adopt 11 known cold related motifs [3] and test their existence in the Arabidopsis cold-response genes. Experiment shows that 80.0% new genes share the same cold motifs with the biclustered genes with BP annotation. In contrast, randomly picked new genes only share 47.2% same cold motifs with the biclustered genes. The results correlate well with our biological knowledge that the new genes should have the same cold motifs as the biclustered genes that they connect, since all the biclustered genes are co-expressed under cold treatment. Graph connectivity on the identification of hidden key genes We further tested whether the selection of new genes is significantly different from random with a leave-k-out approach. The SKB phase 2 on the Arabidopsis dataset generated a bicluster graph with 975 nodes. In the graph, we identified and then deleted 255 key genes with the minimal cut algorithm [16]. In addition, we removed 56 marginal genes with the lowest node betweenness scores. This leavek-out process separated the whole network into 25 disconnected subnetworks. Let the candidate new genes be all the removed genes, we tested whether the SKB phase 3 is able to re-discover any bridge genes to connect the disconnected components in the resulting graph. The results showed that SKB added 13 missing genes, all of which are the bridge genes, and thus reduced the number of disconnected subgraphs from 25 to 7. In contrast, a random addition of the same amount (13) of new genes from the candidate gene set results in 22.0 disconnected subgraphs (the average of 1000 repetitions, with standard deviation at 1.1). The results indicate that the SKB results are far from random (t-test p-value< 0.001). 4.3 Functional characterization of the missing genes in Arabidopsis The functional characterization of the new genes for Arabidopsis shows in Figure 5 that there is an enrichment of TFs in the new genes compared to all the genes in the microarray data. TFs are regulate transcriptions and are generally located in nucleus. A TF usually does not co-express substantially with its target genes. Hence, such TFs are usually missing from biclusters based on expression data. The significant and consistent rise of TFs in the new gene set from the highly noisy background on all the three GO categories indicates that SKB is biologically sound. A case study on Arabidopsis dataset is shown in Figure 6. An Arabidopsis bicluster has 26 annotated genes (red). Using BP annotation, the cluster graph is formed with one large component belonging to the cold regulatory process, denoted as C1 , and the other 10 small components. These 11 components are disconnected because they are conceptually dissimilar based on GO annotation. To study why the different processes co-express under cold treatment and how these processes may work together, SKB introduces 7 new genes (yellow) that can connect the biclustered genes, in which M P K3 acts as a central node to bridge C1 with other components, providing clues for how the processes may be functioning together in cold conditions. 5. DISCUSSION AND CONCLUSIONS In this paper, we introduce SKB, a method to effectively improve the biological significance of gene expression biclustering. The delineation of functional relationships and incorporation of such missing genes may help biologists to discover biological processes that are important in a given study and provide clues for how the processes may function together. Experimental results show that SKB can reveal the inter- and intra-bicluster relationships efficiently and accurately, and greatly improve the biological significance of the biclusters. Note that, SKB achieves better performance on yeast than on Arabidopsis, probably because of the different ratio of Gene Ontology annotations of the two organisms. It may be argued that instead of designing a new algorithm to add new genes, a connected cluster graph may be easily obtained by simply increasing the distance threshold. To address this question, we increased the distance thresholds until the same number of components as those connected by SKB was reached. To reach this point, the distance threshold was increased on average from 0.122 to 0.665 for the Arabidopsis dataset with BP annotations. Consequently, the resultant cluster graphs are cliques or clique-like dense graphs, in which the edges are not statistically significant and numerous weak gene associations are present. By contrast, the distance threshold in SKB is constantly low and all the edges are guaranteed to be statistically significant, revealing that SKB can not only recognize functional modules in biclusters, but also increase the intra-cluster relationships and filter weak gene associations effectively. Therefore, SKB is significantly better on graph representation and noise control than simply increasing the distance threshold. It may also be argued that the effort to introduce new gene can be replaced by simply joining two related biclusters because the biclusters are highly overlapped and new genes found for one bicluster may already exist in another bicluster. To address this question, we tested how many new genes identified by SKB were already included in the biclusters. For Arabidopsis, only 12 out of 50 new genes are overlapped with biclustered genes using BP annotation. This indicates that the majority of the new genes cannot be captured by joining related biclusters. Therefore, SKB is more effective than joining existing biclusters because it finds considerably more new genes. In the framework of SKB, we chose QHB because it is effective in identifying hierarchical inter-bicluster relationships, suitable for our skeleton bicluster expectation. With simple extension, SKB can adopt any biclustering method as its first step. We will extend the capability of SKB in two ways in the future. First, heterogeneous domain knowledge, including protein motifs, cis-elements, sequence alignment and others, will be utilized to improve the bicluster performance in addition to the GO annotations. Second, SKB will be further extended to enhance the performance of a gene search engine by grouping functionally similar genes, recommending additional genes that are not in the search result, and visualizing the search result graphically. Competing interests [13] [14] none. Authors’ contributions [15] JC conceived the approach. SJ and XT implemented the algorithm and conducted all the experiments. All wrote the manuscript. Acknowledgement [16] This research was supported by Chemical Sciences, Geosciences and Biosciences Division, Office of Basic Energy Sciences, Office of Science, U.S. Department of Energy (award number DE-FG02-91ER20021). [17] 6. REFERENCES [1] M. Ashburner, C. Ball, J. Blake, et al. Gene ontology: tool for the unification of biology. Nat Genet, 25(1):25–29, 2000. [2] S. Awad, N. Panchy, S.-K. Ng, and J. Chen. Inferring the regulatory interaction models of transcription factors in transcriptional regulatory networks. Journal of bioinformatics and computational biology, 10(05):1250012, 2012. [3] C. Benedict, M. Geisler, J. Trygg, N. Huner, and V. Hurry. Consensus by democracy. using meta-analyses of microarray and genomic data to model the cold acclimation signaling pathway in arabidopsis. Plant Physiol, 141(4):1219–1232, 2006. [4] G. M. Boratyn, S. Datta, and S. Datta. Incorporation of biological knowledge into distance for clustering genes. Bioinformation, 1(10):396, 2007. [5] C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. Comm ACM, 16(9):575–577, 1973. [6] J. Caldas and S. Kaski. Hierarchical generative biclustering for microrna expression analysis. J Comp Bio, 18(3):251–261, 2011. [7] Y. Cheng and G. Church. Biclustering of expression data. In ISMB, pages 93–103, 2000. [8] P. D’haeseleer. How does gene expression clustering work? Nat Biotech, 23(12):1499–1502, 2005. [9] J. Flores, I. Inza, P. Larrañaga, and B. Calvo. A new measure for gene expression biclustering based on non-parametric correlation. Comput Meth Prog Bio, 112(3):367–397, 2013. [10] T. Gene Ontology Consortium. Gene ontology annotations and resources. Nucl Acid Res, 41(D1):D530–D535, 2013. [11] J. Goncalves and S. Madeira. Latebiclustering: Efficient heuristic algorithm for time-lagged bicluster identification. IEEE ACM T Comput Bi, 11(5):801–813, 2014. [12] D. Huang and W. Pan. Incorporating biological knowledge into distance-based clustering analysis of [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] microarray gene expression data. Bioinformatics, 22(10):1259–1268, 2006. L. Ji, K. Mock, and K. Tan. Quick hierarchical biclustering on microarray gene expression data. In BIBE, pages 110–120, 2006. J. Jun, S. Chung, and D. McLeod. Subspace clustering of microarray data based on domain transformation. In VLDB Workshop on Data Mining in Bioinformatics, 2006. P. Lamesch, T. Z. Berardini, D. Li, D. Swarbreck, C. Wilks, R. Sasidharan, R. Muller, K. Dreher, D. L. Alexander, M. Garcia-Hernandez, et al. The arabidopsis information resource (tair): improved gene annotation and new tools. Nucleic acids research, 40(D1):D1202–D1210, 2012. U. Lauther. A min-cut placement algorithm for general cell assemblies based on a graph representation. In Proceedings of the 16th Design Automation Conference, pages 1–10, 1979. Y. Lazebnik. Can a biologist fix a radio?–or, what i learned while studying apoptosis. Cancer Cell, 2(3):179–182, 2002. P. Lord, D. Robert, B. Andy, and A. Carole. Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics, 19(10):1275–83, 2002. S. Madeira, M. Teixeira, I. Sá-Correia, and A. Oliveira. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE ACM T Comput Bi, 7(1):153–165, 2010. F. Martella, M. Alfò, and M. Vichi. Biclustering of gene expression data by an extension of mixtures of factor analyzers. Intl J Biostat, 4(1):1–19, 2008. M. Miller, L. Barkal, K. Jeng, A. Herrlich, M. Moss, L. Griffith, and D. Lauffenburger. Proteolytic activity matrix analysis (prama) for simultaneous determination of multiple protease activities. Integ Biol, 3(4):422–438, 2011. M. E. Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003. W. Pan. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics, 22(7):795–801, 2006. J. Peng, J. Chen, and Y. Wang. Identifying cross-category relations in gene ontology and constructing genome-specific term association networks. BMC bioinformatics, 14(Suppl 2):S15, 2013. J. Peng, H. Li, Q. Jiang, Y. Wang, and J. Chen. An integrative approach for measuring semantic similarities using gene ontology. BMC systems biology, 8(Suppl 5):S8, 2014. J. Peng, S. Uygun, T. Kim, Y. Wang, S. Y. Rhee, and J. Chen. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks. BMC bioinformatics, 16(1):44, 2015. J. Peng, Y. Wang, and J. Chen. Towards integrative gene functional similarity measurement. BMC bioinformatics, 15(Suppl 2):S5, 2014. A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006. S. Rhee, V. Wood, K. Dolinski, and S. Draghici. Use and misuse of the gene ontology annotations. Nat Rev Genet, 9(7):509–515, 2008. M. Robinson, J. Grigull, N. Mohammad, and T. Hughes. Funspec: a web-based cluster interpreter for yeast. BMC Bioinformatics, 3(35), 2002. A. Schlicker et al. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 7:302, 2006. E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature genetics, 34(2):166–176, 2003. A. Serin and M. Vingron. Debi: Discovering differentially expressed biclusters using a frequent itemset approach. Algorithms for Molecular Biology, 6(1):18, 2011. J. Sevilla et al. Correlation between gene expression and go semantic similarity. In IEEE ACM T Comput Bi, volume 2, pages 330–333, 2005. F. Sohler, D. Hanisch, and R. Zimmer. New methods for joint analysis of biological networks and expression data. Bioinformatics, 20(10):1517–1521, 2004. M. E. Studham, A. Tjärnberg, T. E. Nordling, S. Nelander, and E. L. Sonnhammer. Functional association networks as priors for gene regulatory network inference. Bioinformatics, 30(12):i130–i138, 2014. J. Vogel, D. Zarka, H. A. Van B., S. Fowler, and M. Thomashow. Roles of the cbf2 and zat12 transcription factors in configuring the low temperature transcriptome of arabidopsis. 41:195–211, 2005. O. Voggenreiter, S. Bleuler, and W. Gruissem. Exact biclustering algorithm for the analysis of large gene expression data sets. BMC Bioinformatics, 13(Suppl 18):A10, 2012. H. Wang, W. Wang, J. Yang, and P. Yu. Clustering by pattern similarity in large data sets. In SIGMOD, pages 394–405, 2002. J. Yang, W.Wang, H.Wang, and P. Yu. δ-clusters: Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002. Y. Yang, L. Han, Y. Yuan, J. Li, N. Hei, and H. Liang. Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types. Nat Comm, 5, 2014. Z. Zhang, M. Teo, B. Ooi., and K. Tan. Mining deterministic biclusters in gene expresssion data. In BIBE, pages 283–290, 2004.