Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Gene Expression Datasets using Density-based Clustering Seokkyung Chung, Jongeun Jun, Dennis McLeod Department of Computer Science and Integrated Media System Center University of Southern California Los Angeles, California 90089–0781, USA [seokkyuc, jongeunj, mcleod]@usc.edu ABSTRACT We propose a mining framework that supports the identication of useful patterns based on data clustering. Given the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, we are interested in mining a yeast cell cycle dataset. In molecular biology, a set of co-expressed genes tend to share a common biological function. Moreover, coexpressed genes can be further used to identify mechanisms of gene regulation and interaction. Thus, it is essential to develop an eective clustering algorithm to identify the set of co-expressed genes. Toward this end, we propose genome-wide expression clustering based on a k-nearest neighbor search. By addressing the strengths and limitations of previous density-based clustering approaches, we present a novel density clustering algorithm, which utilizes a neighborhood dened by k-nearest neighbors. Experimental results indicate that the proposed method successfully identies co-expressed gene clusters for a yeast cell cycle dataset. Categories and Subject Descriptors H.4 [Information Systems Applications]: Data mining; I.5.3 [Pattern Recognition]: Clustering General Terms Algorithms Keywords Clustering, Bioinformatics, Density Estimation, Gene Expression Analysis 1. INTRODUCTION With the recent advancement of DNA microarray technologies, the expression levels of thousands of genes can be mea- sured simultaneously [9]. The obtained data are usually organized as a matrix (also known as a gene expression prole), which consists of n columns and m rows. The columns represent genes (usually genes of the whole genome), and the rows correspond to the samples (e.g. various tissues, experimental conditions, or time points). Given this rich amount of gene expression data, the goal of microarray analysis is to extract hidden knowledge (e.g., similarity or dependency between genes) from this matrix. The analysis of gene expression may identify mechanisms of gene regulation and interaction, which can be used to understand a function of a cell [11]. Moreover, comparison between expression in a diseased tissue and a normal tissue will further enhance our understanding in the disease pathology [13]. Therefore, data mining, which transforms a raw dataset into useful higher-level knowledge, becomes a must in life science [21]. One of the key steps in gene expression analysis is to perform clustering genes that show similar patterns. By identifying a set of gene clusters, we can hypothesize that the genes clustered together tend to be functionally related. With the abundance of microarray data, genome-wide expression data clustering has received signicant attention during the past few years in the bioinformatics research community, ranging from hierarchical clustering [9, 22], selforganizing maps [25], neural networks [14], algorithms based on Principal Components Analysis [31] or Singular Value Decomposition [6, 9, 15], subspace clustering [27, 30], and graph-based approach [29]. However, less clustering research has been conducted in terms of a k-nearest neighbor density estimation. In this paper, we propose a density-based clustering algorithm, which utilizes density of a neighborhood dened by k-nearest neighbors. In addition, we explore optimization methods for the fast KNN (k-nearest neighbor) density estimation. 1.1 Goal Since gene expression datasets consist of measurements across various conditions (or time points), they are characterized by multi-dimensional, huge size of volumes, and a noisy data. Thus, clustering algorithms must be able to address and exploit such features of the datasets. Although many clustering algorithms have been studied in statistics, data mining, and machine learning in the past few decades, to address special constraints in gene expression datasets, we propose a new clustering algorithm that satises the following constraints. 1. Many clustering algorithms require a user to provide the number of clusters, which is hard to be determined beforehand. Thus, the algorithm should be equipped with the ability to identify the number of clusters automatically. 2. Due to the high-dimensionality and extremely huge volumes of gene expression datasets, successful clustering algorithms should be scalable with the large number of genes and dimensions (e.g., conditions). 3. Many clustering algorithms are very sensitive to noises or outliers. Since microarray data imposes a signicant amount of noises, clustering algorithms must be able to identify noises and remove them if necessary. 4. As discussed in Jiang et al. [17], co-expressed gene clusters may be highly connected by a large amount of intermediate genes (i.e., genes located between one cluster and another). Clustering algorithms should not be confused by genes in a transition region. That is, simply merging two clusters connected by a set of intermediate genes would be avoided. Thus, the ability to detect the \genes in the transition region" would be helpful. To address the above requirements, we propose a novel clustering algorithm that is relevant for gene expression datasets. Our clustering algorithm exploits density-based clustering, which utilizes a neighborhood dened by k-nearest neighbors. The proposed algorithm rst eciently identies a k-nearest neighbor list for each point. Next, KNN density for each point is dened by utilizing a k-nearest neighbor list of each point. Based on the density of each point, core points (genes with high density), border points (genes with medium density), and noise points (genes with low density) are identied. Since a core point has high KNN density, it is expected to locate well inside the cluster (i.e., a representative of a cluster). Thus, instead of performing clustering on whole datasets, conducting clustering on core points set can produce a rough cluster structure. After that, border points are used to rene cluster structure by assigning them to the most relevant cluster. Note that we do not aim to cluster whole genes (i.e., noise points or points in a transition region may not be clustered into any cluster). Since our goal is to identify a set of genes with strong coherent patterns, it may be necessary to remove many of the genes during the clustering process. While this approach does not provide a complete organization of all genes, it can extract the \essentials" of information in a genome-wide expression data. In this paper, we are mainly focused on time-course gene expression data (i.e., expression levels of genes are monitored during some time interval). In particular, we are focused on Notation n m X xi xij Nk (xi ) P jCi j CP C K Meaning A total number of genes A total number of time points An m n gene expression prole matrix i , th gene j -th feature of i-th gene The k-nearest neighbor list for xi (excluding xi ) A set of core points The size of Ci A set of core clusters (before renement) A set of clusters (after renement) A number of clusters A threshold that determines a core point A threshold that determines a noise point Table 1: Summary of notations a yeast cell cycle dataset. However, the proposed algorithm can be easily extended to other kinds of microarray datasets. 1.2 Our Contributions In this paper, we present a clustering algorithm that addresses the constraints discussed in Section 1.1. Recent database mining research has proposed density-based clustering algorithms like DBSCAN [10] or Shared Nearest Neighbors (SNN) clustering [8]. In addition to incorporating the ideas (e.g., core points, border points, noise points) of these approaches, by addressing the limitations of previous densitybased clustering methods, we present a novel KNN-density estimation clustering algorithm that is relevant for producing co-expressed gene clusters. One of the key limitations in the proposed method is high computational complexity in KNN density estimation. That is, since a k-nearest neighbor list need to be constructed for each gene, the time-complexity of our approach is O(n2 ) where n is the number of genes. However, the complexity of the algorithm can be reduced to O(nlogn) by utilizing a dimensionality reduction scheme. We explore the details of these optimizations in Section 4. 1.3 Organization The remainder of this paper is structured as follows. We present background of this paper in Section 2. In Section 3, we explain the proposed clustering algorithm. Section 4 explores the dimensionality reduction step for an ecient neighborhood search. In Section 5, we briey review the related work, and highlight the strengths and weaknesses of the previous approach in comparison with ours. Finally, we conclude the paper and provide our future plans in Section 6. Table 1 illustrates the notations that will be used throughout this paper. 2. BACKGROUND Throughout this paper, we explain our methodology based on Spellman et al.'s yeast cell cycle dataset (a.k.a. Spellman's dataset) [22]. Using cDNA arrays, Spellman et al. measured the genome-wide mRNA levels for 6,108 yeast ORFs simultaneously over approximately two cell cycle pe- Sampling time interval 0-7 14-21 28-35 42 49-56 63-70 77-84 91-98 105 112-119 (Minute) Cell cycle M/G1 G1 S G2 M M/G1 G1 S G2 M Table 2: Illustration of cell cycle in Spellman's dataset Among 6,108 genes, we removed the genes with missing values, and obtained 4,418 genes. Thus, the dataset is organized as an 18 4,418 matrix with equally spaced sampling time points. In this paper, rather than trying to identify cellcycle regulated gene clusters (by relying on external knowledge), we perform unsupervised clustering on 4,418 genes. Thus, non cell-cycle regulated gene clusters as well as cellcycle regulated gene clusters are expected to be discovered. 3. PROPOSED ALGORITHM In Section 3.1, similarity metrics for density estimation are described. Section 3.2 introduces how to dene density for each gene. Section 3.3 explains how to specify rough cluster structure, and Section 3.4 illustrates how to rene cluster structure. In Section 3.5, we discuss about input parameters. Finally, in Section 3.6, we present experimental results. 3.1 Similarity Metric The rst step in KNN density estimation is to decide the distance metric (or similarity metric). One of the most commonly used metrics to measure the distance between two data items is Euclidean distance. The distance between xi and xj in m-dimensional space is dened as follows: v u uXm d(xi ; xj ) = Euclidean(xi ; xj ) = t (xid , xjd )2 d=1 (1) Since Euclidean distance emphasizes individual magnititudes of each feature, it does not account for shifting or scaling patterns very well. In gene expression datasets, the overall shapes of gene expression patterns is more important than magnititude. To address the shifting and scaling problem, each gene can be standardized as follows: 4 2 3 1.5 2 1 Expression level Among 6,108 genes, Spellman et al. identied 800 genes whose expression is cell-cycle regulated. To nd a threshold value that determines the signicance of the cell-cycle regulation, they utilized previously known gene sets and published dataset. Table 2 illustrates the cell cycle, which is dened by Spellman et al. After 800 yeast genes were identied as cell-cycle regulated, clustering was performed to classify the genes into dierent clusters (according to similarity of expression). 2.5 Expression level riods in a yeast culture synchronized by factor relative to a reference mRNA from an asynchronous yeast culture. The yeast cells were sampled at 7 minute intervals for 119 minutes with a total of 18 time points after synchronization. 0.5 0 1 0 −0.5 −1 −1 −2 −1.5 −2 0 2 4 6 8 10 12 14 16 18 Time points (a) High density gene −3 0 2 4 6 8 10 12 14 16 18 Time points (b) Low density gene Figure 1: Plot of top k-nearest neighbors for a high density gene and a low density gene (when k=30) Another widely used metric for time-series similarity is Pearson's correlation coecient. Given two genes xi and xj , Pearson's correlation coecient r(xi ; xj ) is dened as follows: Pm (3) r(xi ; xj ) = pPm d=1 (xid ,2pi )(Pxmjd , j ) d=1 (xid , i ) d=1 (xjd , j )2 Note that r(xi ; xj ) has a value between 1 (perfect positive linear correlation) and -1 (perfect negative linear correlation). In addition, value 0 indicates no linear correlation. If the data is standardized by subtracting o the mean and dividing by the standard deviation, then we can show that Euclidean distance is related with Pearson correlation coefcient as follows: 2 r(xi ; xj ) = 1 , d (x2im; xj ) (4) Based on the above relation, eectiveness of a clustering algorithm is expected to be similar regardless of the similarity metrics. Thus, throughout this paper, we will explain our methodology by using Pearson correlation coecient. In addition, similarity and correlation are used interchangeably. 3.2 Density Estimation One of the important steps in density-based clustering is how to estimate density for each point. In DBSCAN [10], the density of an object is dened by the number of the objects in a region of specied radius around the point. This approach is similar to a histogram-based method. (2) Another approach is the kernel-based method, which denes weight to each point [7, 18]. That is, the points at the edge of the search area are less inuenced to the density estimator than the other points. A Gaussian kernel function is normally used for this purpose. where xi is the standardized vector of xi , i is the mean of xi , and i is the standard deviation of xi , respectively. In this paper, we are mainly focused on KNN density estimation. In the histogram-based approach or the kernel-based xid = xid, i i 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −4 −3 2 4 6 8 10 12 14 16 18 −4 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 2 4 6 8 10 12 14 16 18 −4 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −4 2 4 6 8 10 12 14 16 18 A rough cluster structure then can be derived by performing clustering on core points. This step can be devised based on the following two observations. Since border and noise points are excluded in the rough 2 4 6 8 10 12 14 16 18 −3 2 (a) Core clusters 4 6 8 10 12 14 16 18 −4 (b) Coherent patterns 2 4 6 8 10 12 14 16 18 Figure 2: Sample examples of core clusters and the corresponding coherent expression patterns approach, the volume around point x is xed. On the other hand, KNN density estimation xes the number of points k in advance. Thus, the size of the volume around a point x is adjusted to include k-nearest neighbors of x. Based on this, probabilistic density function for x can be dened as follows: p(x) = k=n V (5) where n is the total number of points, and V is the size of volume that includes k nearest neighbor points of x. Hence, in high density regions, the size of volume is expected to be small while the size of volume is expected to be large in low density regions. Another approach to dening KNN density is to utilize the sum of distances between k-nearest neighbors to x. In this paper, we use the second notion of KNN density. Figure 1 illustrates the intuition behind this approach. As shown, with a high density gene, the sum of distance between nearest neighbors to x (or the size of volume) is relatively smaller than a low density gene. 3.3 greater than a user-dened threshold (). Similarly, a noise point is referred to as a point whose KNN density is less than a user-dened threshold ( ). Noise points are discarded in the clustering process since we are mainly concerned with highly co-expressed patterns. A non-core, non-noise point is considered as a border point. We will discuss how to determine appropriate values for and in Section 3.5. Rough Cluster Identification based on Core Points In density-based clustering, clusters are dened as dense regions (i.e., a set of core points), and each dense region is separated from one another by low density regions (i.e., a set of border points). Thus, once density for each point is estimated, the next step is to identify core, border and noise points. A core point is referred to as a point whose KNN density is cluster identication step, each cluster is expected to be well separated each other. That is, if two core points belong to dierent clusters, then they are expected to be far from each other. Moreover, if two points belong to a same cluster, then the two points is expected to be proximate each other since densities of the points are high. Due to transitivity, although xi and xj are similar, and xj and xk are similar, xi and xk can be dissimilar since similarity relation does not satisfy transitivity. This can be partially addressed by adjusting xk using k-nearest neighbors of xk . That is, the representative for k-nearest neighbors of xk can be used for similarity computation. Based on the above observations, the algorithm for rough cluster identication is outlined as follows: Input: A set of core points (P ) Output: A set of core point clusters (CP ) 1. Initially, the most high density gene x0 forms a singleton cluster C0 , and x0 is removed from P . 2. The next high density gene xi in P is chosen and xi is removed from P . The similarity between xi and pregenerated clusters are computed by considering the similarity between xi and the representative (e.g., center) of a cluster. 3. The cluster (Ci ), which has the maximum proximity with xi , is identied. 4. If the similarity between xi and Ci exceeds a predened threshold (), then xi is assigned to Ci . 5. If not, xi is adjusted by k-nearest neighbors of xi , and the similarity is recomputed between adjusted xi and Ci . If the similarity between adjusted xi and Ci exceeds , then xi is assigned to the Ci . Otherwise, xi forms a new singleton cluster (Cj ). Cj is added to CP . 6. Repeat 2-5 until P becomes empty. Figure 2 plots sample clusters. Figure 2(a) shows sample core clusters, and Figure 2(b) illustrates the corresponding coherent patterns that characterize a trend of expression levels of genes within a cluster. A coherent pattern of a cluster is dened by a medoid of the cluster. As illustrated, the rst to the cluster. Otherwise, xi is identied as a transition point, and not assigned to any cluster. 42 40 38 Toward this end, the set of candidate clusters (Cxi ) is identied by selecting the cluster that contains any gene belonging to Nk (xi ). Subsequently, the cluster, which can host xi , is identied by using one of the following two methods. Expression level 36 34 32 30 28 26 24 22 0 100 200 300 400 500 600 700 800 900 1000 Genes Figure 3: Plot of top 1000 lowest density genes (when k=60) 55 54 Expression level 53 52 51 50 49 48 0 100 200 300 400 500 600 Genes Figure 4: Plot of top 600 highest density genes (when k=60) top two plots show clear cell-cycle regulated patterns. That is, the rst cluster contains a set of genes whose expression values have a peak in G1 phase. Similarly, the second cluster contains a set of genes whose expression values have a peak in G2 phase. These clusters were also identied as cell-cycle regulated patterns by Spellman et al. On the other hand, as illustrated in the third cluster, non cell-cycle regulated clusters were also identied. This can be explained in dierent perspectives (e.g., external eects or unrevealed reasons). For example, if DNA is damaged, then it is necessary to block the cell cycle to repair DNA. Thus, forming a cluster of non cell-cycle regulated genes (with similar expression patterns), and providing an interpretation on this cluster (based on external knowledge) is an interesting task. 3.4 Cluster Refinement based on Border Points Once a rough cluster structure is obtained, the next step is to identify relevant clusters that can host each border point. The proposed clustering algorithm exploits a characteristic of a neighborhood. That is, a label of an object is inuenced by the attributes of its neighbors. Examples of such attributes are the labels of the neighbors, or the percentage of neighbors that fulll a certain constraint. The above idea can be translated into clustering perspective as follows: a cluster label of an object depends on the cluster labels of its neighbors. To assign a gene (xi ) to the existing cluster, the cluster, which can host xi , needs to be identied using the neighborhood of xi . If there exists such a cluster, then xi is assigned 1. M1: Considering the size of an overlapped region. Select the cluster that has the biggest number of its members in Nkc(xi ) where Nkc (xi ) is dened as follows: NkC (xi ) = Nk (xi ) \ P (6) This approach only considers the number of genes in the overlapped region, and ignores the proximity between neighbors and xi . 2. M2: Exploiting weighted voting. The similarities between each neighbor of xi and the candidate clusters are measured. Then, the similarity values are aggregated using weighted voting. Thus, each neighbor can vote for its class with a weight proportional to its proximity to xi . Let wij be a weight for representing the proximity of a neighbor to xi . Then, the most relevant cluster (Cl ) is selected based on the following formula: X wij r(xj ; Ck ) (7) Cl = argmaxCk 2Cxi xj 2NkC (xi ) 3.5 Parameterization One of the key weaknesses in density-based clustering is how to determine user-dened parameters. Many of the previously proposed density-based algorithms are known as sensitive to the input parameters. For instance, in DBSCAN [10], SNN [8], or DHC [17], MinPts (the minimum number of points within a neighborhood) should be determined in order to decide whether a given point is a core or not. In our approach, there are three important parameters that should be determined beforehand, (the user-dened threshold to cut o between core points and border points), (the user-dened threshold to cut o between border points and noise points), and k (length of a nearest neighbor list). In this section, we discuss how to decide the value of these parameters. The neighborhood list size (k) may determine the granularity of the clusters. If k is too small, then the algorithm will tend to identify a large number of small-size clusters. In contrast, if k is too large, then a few large-size clusters are formed. In what follows, we explain how to decide the value of and when k is xed. Sharp change in the slope of density values can be used to identify the number of core and noise points. To this end, we plot the density of the 1000 genes that have the lowest density values (Figure 3), and the density of the 600 genes that have the highest density values (Figure 4). As shown in Figure 3, the density decreases sharply with 850-1000. Thus, we can determine the number of noise points based on Figure 3. However, the slope in Figure 4 decreases rather C1 C2 C3 C4 C5 C6 C7 C1 1.00 -0.44 -0.70 0.36 0.16 0.67 -0.03 C2 1.00 0.32 0.02 0.26 -0.05 -0.08 C3 1.00 0.27 -0.27 -0.60 -0.41 C4 1.00 0.01 0.40 -0.67 C5 1.00 0.16 -0.07 C6 1.00 -0.08 C7 1.00 Table 3: Illustration of between-cluster similarity before renement C1 C2 C3 C4 C5 C6 C7 C1 1.00 -0.38 -0.63 0.34 0.14 0.57 -0.02 C2 1.00 0.28 0.12 0.25 0.12 -0.18 C3 1.00 0.36 -0.26 -0.56 -0.37 C4 1.00 -0.02 0.31 -0.63 C5 1.00 0.16 -0.09 C6 1.00 -0.05 C7 1.00 Table 4: Illustration of between-cluster similarity after renement smoothly. Consequently, it is dicult to identify the number of core points based on only this plot. To address the problem, we rst estimate the following probability. (8) Probc (xi ) = jNjkN(x(i )x\)jP j k i Assuming that the core point set (P ) is xed, Equation 8 measures the probability of how many k-nearest neighbors of xi are core points. With a point that locates well inside a cluster, Probc (xi ) is expected to be high. In contrast, with other points (e.g., border point), Probc (xi ) is expected to be relatively low. Thus, Probc (xi) measures actual coreness for each point based on xi 's neighborhood. This probability can be used as a guideline when we determine the value of . For example, to avoid producing singleton clusters in the rough cluster identication step (this is because a core point cannot form a singleton cluster in a high-density region), we choose a moderate value of such that Probc (xi ) 6= 0 for all xi 2 P . k = 20 k = 30 k = 40 k = 50 k = 60 k = 70 K -means 0.4027 - M1 0.6176 0.5859 M2 0.6191 0.5862 0.5824 0.5682 0.5997 0.5828 0.5692 0.6007 0.6318 0.6328 Table 5: Evaluation of rened cluster structure based on 1 k = 20 k = 30 k = 40 k = 50 k = 60 k = 70 K -means 0.2213 - M1 0.2323 0.2544 M2 0.2320 0.2542 0.2708 0.2673 0.2678 0.2702 0.2668 0.2674 0.2307 0.2307 Table 6: Evaluation of rened cluster structure based on 2 To compensate for this 1 's characteristic, we also use the following criteria for measuring average between-cluster similarity for a cluster structure. XK XK jR(Ci; Cj )j 2 = K12 i=1 j=1 (10) Note that R(Ci ; Cj ) is dened as similarity between the centroid vectors of Ci and Cj . In addition, since identifying anti-correlated genes is not our goal, the absolute value of R is taken. In sum, we favor a clustering solution with the largest value of 1 , and the smallest value of 2 (i.e., maximize withincluster similarity and minimize between-cluster similarity). Table 3 and Table 4 illustrate sample result on betweencluster similarity before and after renement, respectively (due to the space limitation, we only show 7 clusters). As shown in Table 3, since only core points are considered, similarity between two dierent clusters are low. (9) As illustrated in Table 4, since centroids for two clusters are used to compute similarity, although border points are added to the cluster, the values of similarity between clusters do not necessarily increase. That is, if added border points to Ci are located at the opposite direction to Cj , then those border points are used to move the centroid of Ci far from Cj . In contrast, if added border points to Ci are located at Cj 's direction, then added border points can be used to move the centroid of Ci to Cj , thus similarity value is increased. In any case, we observed that the between-cluster similarity is not signicantly changed before/after the renement step. This supports the eectiveness of our renement step. In non-uniformly distributed datasets, 1 favors a large number of small-size clusters. We also evaluated our algorithm in terms of 1 and 2 . K means clustering [7] was used as a baseline comparison. In K -means, since the number of clusters should be determined 3.6 Experimental Results For the empirical evaluation of the proposed clustering algorithm, we rst describe the evaluation criteria. In order to measure within-cluster similarity for a cluster structure, the sum of average pairwise similarities between genes (that are assigned to a same cluster) is computed as follows: K X 1 = K1 ( jC1 j2 r r=1 X xi ;xj 2Cr r(xi ; xj ) ) For each gene, since all pairwise distances need to be computed to the k nearest neighbors, the worst time complexity of our clustering algorithm is O(n2 ) where n is the number of genes. For low dimensional datasets, the time complexity of our method can be reduced to O(nlogn) if we utilize spatial data structures [2, 19, 4]. 2.8 Method1 Method2 K−means 2.7 2.6 Criteria value 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 20 25 30 35 40 45 50 55 60 65 70 Length of k−nearest neighbor list Figure 5: Comparison of = 12 beforehand, we tried dierent values of K , and chose the smallest K with the condition that increasing K did not much decrease the average distance of points to their cluster centroids. To be fair, we performed the K -means clustering multiple times, and chose the best result. We xed k = 40, and obtained the value of based on the method discussed in Section 3.5. After then, we performed clustering while changing the value of k (but is xed). Table 5 compares the result based on 1 . As discussed in Section 3.4, M1 is a method for considering the size of the overlapped region and M2 is a method that utilizes weighted voting in the renement step, respectively. As shown, M1 and M2 outperform K -means clustering. This is due to the fact that K -means clustering is sensitive to noise (i.e., a small amount of noise can signicantly inuence the centroid value of clusters). Moreover, K -means clustering can only identify a spherical shape of cluster while the proposed method can detect dierent shapes of clusters. In addition, as the value of k is increased, 1 tends to be decreased since 1 favors small-size clusters. However, we observed that the value of 1 is increased at k = 40. This supports our argument that the relationship between and k is correctly determined. Table 6 compares the result based on 2 . Since small local variation in similarity is ignored with large k, 2 for M1 and M2 tend to be increased as k is increased (and a number of clusters is decreased). However, at k = 40, we observed that 2 drops down. This also supports the eectiveness of our threshold strategy. Figure 5 illustrates overall performance of algorithm. The xaxis represents the value of k, and y-axis represents = 12 . Note that large value of implies a better clustering solution. As depicted, the graph has a peak at k=40. Therefore, based on the above emperical observations, determining (when k is xed) using the methods discussed in Section 3.5 is shown to be eective. This is a signicant improvement from the previous density-based clustering approaches since previous work is known as sensitive to input parameters (e.g., MinPts). 4. EFFICIENT DENSITY ESTIMATION However, as discussed in Weber et al. [28], even with a moderate dimensionality (e.g., 10), since the sequential scan outperforms the best known indexing structures such as Xtrees [4] or SR-Trees [19], the number of dimensionality in gene expression datasets is still too high. Therefore, dimensionality reduction on gene expression datasets needs to be performed. Section 4.1 presents a dimensionality reduction algorithm based on Singular Value Decomposition. In Section 4.2, we discuss why dimensionality reduction using Singular Value Decomposition is eective in gene expression datasets. 4.1 Singular Value Decomposition Approach SVD (Singular Value Decomposition) has been widely used in time-series databases [20] and information retrieval [3]. The basic intuition behind SVD is to examine the entire dataset and rotate the original axis to maximize variance along the rst few dimensions. Thus, the dimensionality reduction eect can be achieved by keeping the rst few dimensions while losing the least information. The following theorem provides the mathematical background of SVD [12]. Theorem 1. Given m n matrix X , we can decompose X as follows. X =U Vt (11) where U is a column orthonormal m r matrix (left singular vector), r is the rank of X , is a diagonal r r matrix with eigenvalues of X , and V is a column-orthonormal r n matrix (right singular vector). Proof. Refer to [12]. Without loss of generality, we can assume that components of (the eigenvalues i of X ) are arranged in decreasing order. The beauty of SVD lies in the fact that the number of dimensions can be reduced by discarding the insignicant dimensions (i.e., less singular values). Hence, Xk can be obtained by keeping the rst k singular values and discarding r , k singular values and the corresponding left and right singular vectors of A. The reduced ones are denoted as Xk , Uk , k , and Vk , respectively. Theorem 2. The matrix C = X X t is a symmetric matrix, which can be decomposed as follows: C = U 2 U t (12) Left and right singular vectors of C correspond to left singular vectors of X (i.e., U ), respectively. In addition, eigenvalues of C corresponds to the squares of eigenvalues of X . Proof. Refer to [12]. 100 0.3 0.8 original 7 coeff original 7 coeff SVD DWT w/Haar 90 0.6 0.2 0.4 0.1 80 Percent Error 70 Expression level 0.2 0 −0.2 60 −0.4 0 −0.1 −0.2 50 −0.6 40 −0.3 −0.8 −0.4 −1 30 −0.5 −1.2 20 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 16 Time points 16 Figure 7: Illustration of reconstructed SVD 10 0 0 0 2 4 6 8 10 12 14 16 Number of coefficients retained Figure 6: Comparison of reconstruction error for DWT and SVD Given the fact that the number of genes (n can be more than 10,000) is much larger than the number of conditions or time points (m is usually less than 100), instead of computing SVD for m n matrix directly, based on Theorem 2, U and are rst obtained by computing SVD of X X t . After then, based on Theorem 1, V can be constructed as follows: V = Xt U (13) Since matrix multiplication between l m matrix and m n matrix takes O(lmn), computing X X t and constructing V take O(m2 n) and O(nmr), respectively. Thus, the complexity of SVD computation for X can be reduced to O(m32 n + r3 + nmr). This is signicant improvement from O(n ) since r; m n. Once a matrix is decomposed, each gene (x) can be projected onto a point in k-dimensional space (^x) as follows: x^ = xt U k ,k 1 (14) Thus, we can build multi-dimensional index structure using x^. 4.2 Discussion Although SVD has been utilized in gene expression clustering research, the main purpose of the previous approaches was to preprocess the data before clustering [6, 9, 15]. In contrast, our main aim here is to eciently support a similarity search in the truncated SVD space. One of the most widely used techniques for dimensionality reduction in time-series datasets is DWT (Discrete Wavelet Transforms) [5]. By transforming time-series data into timefrequency domain, the rst few DWT coecients are indexed through multi-dimensional index structure. The basic motivation behind this approach is that DWT can preserve essentials of the data in the rst few coecients. However, DWT is not eective in dimensionality reduction of gene expression datasets. To observe this, we randomly original 7 coeff 0.2 0.1 Expression level Note that the time complexity for naive SVD computation is O(nm2 + mn2 + n3 ) = O(n3 ) (since m n). However, we can reduce computational complexity when dealing with gene expression datasets. 0.3 0 −0.1 −0.2 −0.3 −0.4 −0.5 0 2 4 6 8 10 12 14 16 Time points Figure 8: Illustration of reconstructed DWT when the largest 7 coecients (in absolute value) are retained selected 3,000 genes (that had no missing values) from the Spellman's dataset. Since the length of time-series needs to be the integral power of 2 in Haar wavelet, for simplicity, instead of padding the overall time-series with zeros, we remove the last two time points for each time-series. Thus, the size of gene expression prole (X ) becomes 16 3,000. Figure 6 compares the average reconstruction error of X . The x-axis represents the number of coecients retained, and the y-axis represents the relative error, which is computed as jjXjj,XXjj jj 100 where X is the original time-series data and X 0 is the reconstructed data from the compressed one. Although no false dismissal is guaranteed (since DWT and SVD are orthonormal transforms), if reconstruction error is large, then the number of false hits is increased. Thus, reconstruction error needs to be minimized in order to reduce the cost of post-processing. As shown in Figure 6, SVD outperforms DWT when the number of dimensions retained corresponds to 1 through 11. Figure 7 also illustrates how well SVD can retain basic shapes of time-series (up/down peak). 0 Assuming that the transformation is orthonormal, in order to minimize the reconstruction error, keeping the largest i coecients (in terms of absolute value) is better than keeping the rst few coecients. Figure 8 illustrates this. However, since the index of retained coecients need to be stored, keeping the largest coecients needs additional indexing structures other than the R-tree family. In addition, the distance computation between two coecients sets becomes expensive since the coecients sets do not align with each other. Furthermore, a conventional DWT-based approach lacks the capability of dealing with unevenly-spaced sampling time points. For instance, using cDNA microarrays, Iyer et al. [16] reported the physiological response of broblasts to serum at 12 time points for 8,613 genes over 24 hour. The sampling times are at 0, 0.15, 0.3, 1, 2, 4, 6, 8, 12, 16, 20, 24 hours after serum stimulation. In this situation, unless we rely on interpolation or lifting [24], DWT is not directly applicable to unevenly spaced sampling time points datasets. 5. RELATED WORK In this section, we briey review the previous gene expression clustering approach. Note that this section should not be considered as a comprehensive survey of all published gene expression clustering algorithms. It only aims to provide a concise overview of algorithms that are directly related with our approach. Jiang et al. [18] provides the comprehensive review on gene expression clustering. For details, refer to that paper [18]. Partition-based clustering decomposes a collection of genes, which is optimal with respect to some pre-dened function such as center-based approach [11, 26]. Center-based algorithms nd the clusters by partitioning the entire dataset into a pre-determined number of clusters [7, 11, 26]. Although the center-based clustering algorithms have been widely used in gene expression clustering, there exist the following drawbacks. First, the algorithm is sensitive to an initial seed selection. Depending on the initial points, it is susceptible to a local optimum. Second, as discussed in Section 3.6, it is sensitive to noises. Third, the number of clusters should be determined beforehand. Hierarchical (agglomerative) clustering (HAC) nds the clusters by initially assigning each gene to its own cluster and then repeatedly merging pairs of clusters until a certain stopping condition is met [7, 9, 22]. Thus, its result is in the form of a tree, which is referred to as a dendrogram. Note that the advantage of HAC lies in its ability to provide a view of data at multiple levels of abstraction. However, a user should determine where to cut the dendrogram to produce actual clusters. This step is usually done by human visual inspection, which is a time-consuming and subjective process. Moreover, the computational complexity of HAC is expensive. HAC takes O(n3 ) if pairwise similarities between clusters are changed when two clusters are merged. However, the complexity can be reduced to O(n2 logn) if a priority queue is utilized. The graph-based approach [29] utilizes graph algorithms (e.g., minimum spanning tree or minimum cut) to partition the graph into connected subgraphs. However, due to the points in the transition region, this approach may end up with a highly connected set of genes. To better explain time-course gene expression datasets, new models are proposed to capture the relationships between time-points [1]. However, they assume that the data ts into a certain distribution, which does not hold in gene expression datasets as discussed in Yeung et al. [32]. As we have discussed, our work is motivated by previous density-based clustering approaches such as DBSCAN [10] or SNN [8]. However, since these approaches utilizes a notion of connectivity to build a cluster, it might not be relevant for the dataset with moderately dense transition regions. In addition, both approaches need user-dened parameters (e.g., MinPts), which are dicult to be determined in advance. Recently, density-based clustering algorithms have been applied to gene expression datasets [17, 18]. Both approaches are promising in that a meaningful hierarchical cluster structure (rather than a dendrogram) can be built. However, these approaches have drawbacks in that MinPts is determined beforehand [17], or computationally expensive due to kernel density estimation [18]. 6. CONCLUSION AND FUTURE WORK We presented the mining framework that is vital to microarray data analysis. An experimental prototype system has been developed, implemented, and tested to demonstrate the eectiveness of the proposed model. In order to identify co-expressed genes in a yeast cell cycle dataset, we developed the clustering algorithm based on KNN density estimation. For an ecient k-nearest neighbor search, we also explored dierent dimensionality reduction methods that are relevant for gene expression data. We intend to extend this work into the following three directions. First, besides evaluating our approach with respect to K -means clustering, we plan to use other gene expression clustering methods for the comprehensive comparison. Second, as discussed by Shatkay et al. [23], two genes with strong anti-correlation in their expression levels may be functionally related each other. That is, a gene may be strongly suppressed to allow another gene to be expressed. Since these anti-correlated genes may be involved in the same biological pathway, rather than separating those genes into dierent clusters, it is essential to detect anti-correlated gene clusters, and investigate functional similarity between those clusters. Finally, in order to interpret obtained gene clusters, external knowledge needs to be involved. Toward this end, we plan to explore the relationship between clusters and known biology knowledge by utilizing gene ontology (e.g., GeneOntology or MIPS) or published biomedical literature (e.g., PubMed). 7. ACKNOWLEDGMENTS This research has been funded in part by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC9529152. 8. REFERENCES [1] Z. Bar-Joseph et al. A new approach to analyzing gene expression time series data. In Proceedings of Annual Conference on Research in Computational Molecular Biology, 2002. [2] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an ecient and robust access method for points and rectangles. ACM SIGMOD Record, 19(2):322-331, 1990. [3] M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995. [4] S. Berchtold, D. A. Keim, and H. P. Kreigel. The X-tree: an index structure for high dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases, 1996. [5] K. Chan, and A. W. Fu. Ecient time series matching by wavelets. In Proceedings of IEEE International Conference on Data Engineering, 1999. [6] C. H. Q. Ding, X. He, H. Zha, and H. D. Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2002 IEEE International Conference on Data Mining, 2002. [7] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication (2nd Ed.). Wiley, New York, 2001. [8] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of dierent sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the SIAM International Conference on Data Mining, 2003. [9] M. B. Eisen et al. Cluster analysis and display of genome-wide expression patterns. In Proceedings of National Academy of Science, 95(25):14863-14868, 1998. [10] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996. [11] A. Gasch, and M. Eisen. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 3(11):1-22, 2002. [12] G. H. Golub et al. Matrix computations. North Oxford Academic, Oxford, UK, 1996. [13] T.R. Golub et al. Molecular classication of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(15):531-537, 1999. [14] J. Herrero et al. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2):126-136, 2001. [15] D. Horn, and I. Axel. Novel clustering algorithm for microarray expression data in a truncated SVD space. Bioinformatics. 19(9):1110-1115, 2003. [16] V. R. Iyer et al. The transcriptional program in the response of human broblasts to serum. Science, 283(5398):83-87, 1999. [17] D. Jiang, J. Pei, and A. Zhang. DHC: a density-based hierarchical clustering method for time series gene expression data. In Proceedings of the 3rd IEEE International Symposium on BioInformatics and BioEngineering, 2003. [18] D. Jiang, J. Pei, and A. Zhang. Towards interactive exploration of gene expression patterns. ACM SIGKDD Explorations, 6(1):79-90, 2004. [19] N. Katayama, and S. Satoh. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, 1997. [20] F. Korn, H. V. Jagadish, and C. Faloutsos. Eciently supporting ad hoc queries in large datasets of time sequences. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997. [21] S. Morishita, T. Hishiki, and K. Okubo. Towards mining gene expression database. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 1999. [22] P. T. Spellman et al. Comprehensive identication of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273-3297, 1998. [23] H. Shatkay, S. Edwards, and M. Boguski. Information retrieval meets gene analysis. IEEE Intelligent Systems, 17(2):45-53, 2002. [24] W. Sweldens, and P. Schroder. Building your own wavelets at home. Wavelets in Computer Graphics, ACM SIGGRAPH Course Notes, 1996. [25] P. Tamayo et al. Interpreting patterns of gene expression with self organizing maps. In Proceedings of National Academy of Science, 96(6):2907-2912, 1999. [26] S. Tavazoie et al. Systematic determination of genetic network architecture. Nature Genetics, 22(3):281-285, 1999. [27] H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2002. [28] R. Weber, H. J. Schek, and S. Blott. Quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases, 1998. [29] Y. Xu, V. Olman, and D. Xu. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4):536-545, 2002. [30] J. Yang, H. Wang, W. Wang, and P. S. Yu. Enhanced biclustering on expression data. In Proceedings of IEEE International Symposium on BioInformatics and BioEngineering, 2003. [31] K. Y. Yeung, and W. Ruzzo. An empirical study on principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763-774, 2001. [32] K.Y. Yeung et al. Model-based clustering and data transformations for gene expression data. Bioinformatics, 17(10):977-987, 2001.