Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Hybrid Clustering Method for Gene Expression Data Baoying Wang, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 Tel: (701) 231-6257 Fax: (701) 231-8255 {baoying.wang, william.perrizo}@ndsu.nodak.edu Abstract. Data clustering methods have been proven to be a successful data mining technique in analysis of gene expression data. However, some concerns and challenges still remain in gene expression clustering. For example, many traditional clustering methods originated from nonbiological fields may not work well if the model is not sufficient to capture the genuine clusters among noisy gene expression data. In this paper, we propose an efficient hybrid clustering method using attractor trees based on both density factors and similarity factors. The combination of density-based approach and similarity-based approach takes consideration of clusters with diverse shapes, densities, and sizes, and is capable of dealing with noises. A vertical data structure, P-tree1, is used to make the clustering process more efficient by accelerating calculation of density functions using P-tree based neighborhood rings. Experiments on common gene expression datasets demonstrate that our approach is more efficient and scalable with competitive accuracy. Keywords: gene expression data, clustering, P-trees, microarray. 1 INTRODUCTION Clustering in data mining is a discovery process that partitions the data set into groups such that the data points in the same group are more similar to each other than the data points in other groups. Clustering analysis of mircroarray gene expression data, which discovers groups that are homogeneous and well separated, has been recognized as an effective method for gene expression analysis. Eisen et al first applied hierarchical linkage clustering approach that groups closest pairs into a hierarchy of nested subsets based on similarity [7]. Golub et al has also 1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308. successfully discovered the tumor classes based on the simultaneous expression profiles of thousands of genes from acute leukemia patient’s testing samples using self-organizing maps clustering approach [8]. Some other clustering approaches, such as k-mean [21], fuzzy kmeans [1], CAST [3], etc, also have been proven to be valuable clustering methods for gene expression data analysis. However, some concerns and challenges still remain in gene expression clustering. For example, many traditional clustering methods originated from nonbiological fields may not work well if the model is not sufficient to capture the genuine clusters among noisy data. Partitioning clustering, distance-based approach and density-based approach, depends on input parameters. Distance-based clustering needs to predetermine the number of clusters. Density-based approach depends on density-related parameters. Hierarchical clustering is more flexible than partitioning clustering. It provides a nested series of partitions instead of a single partition. However, hierarchical clustering is computationally expensive for large data sets. In this paper, we propose an efficient agglomerative hybrid clustering method: Clustering using Attractor tree and Merging Process (CAMP). The contribution of CAMP is that CAMP combines the features of both density-based clustering approach and distance-based clustering approach, which takes consideration of various clusters and is capable of dealing with noisy data. A vertical data structure, P-trees is used to make the algorithm more efficient by accelerating the calculation of the density function. P-trees are also used as bit indexes to clusters. In the merging process, only summary information of the attractor trees is used to find the most similar cluster pair. When two clusters are to be merged, only their P-tree indexes are retrieved to perform the merging process. This paper is organized as follows. In section 2 we give an overview of the related work. We present our new clustering method, CAMP, in section 3. Section 4 discusses the implementation of CAMP using P-trees. Section 5 presents discussions on noise handling and determination of the optimal cutting level. An experimental performance study is described in section 6. Finally we conclude the paper in section 7. 2 RELATED WORK Generally, clustering techniques can be categorized in many ways [9][13][4]. The categorization shown in Figure 1 is based on the structure of clusters. Clustering Hierarchical Agglomerative (Bottom-up) Divisive (Top-down) Hybrid Partitioning Distance-based Density-based Figure 1. Categorization of Clustering Clustering can be subdivided into partitioning clustering, hierarchical clustering, and hybrid clustering. Hierarchical clustering is a nested sequence of partitions, whereas a partitioning clustering is a single partition. Hybrid, as the name indicates, combines the features of both hierarchical clustering and partitioning clustering. Hierarchical clustering methods can be further classified into agglomerative and divisive hierarchical clustering, depending on whether the hierarchical decomposition is accomplished in a bottom-up or a top-down fashion. Partitioning clustering consists of two approaches: distance-based and density-based, according to the similarity measure. 2.1 Partitioning Methods Partitioning clustering methods generate a partition of the data in an attempt to recover natural groups present in the data. Partitioning clustering can be further subdivided into distance-based partitioning and density-based partitioning. A distance-based partitioning method breaks a data set into k subsets, or clusters, such that data points in the same cluster are more similar to each other than the data points in other clusters. The most classical similarity-based partitioning methods are k-means [10] and kmedoid, where each cluster has a gravity center. The time complexity of K-means is O(n) since each iteration is O(n) and only a constant number of iterations is computed. However, there are several problems with distance-based partitioning methods: (1) k is the input parameter and needs to be predetermined; (2) the methods are only suitable for clusters with spherical shapes; (3) they are not good for clusters which are very different in sizes; and (4) they are not robust to the selection of the initial partition and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen. Density-based partitioning clustering has been recognized as a powerful approach for discovering arbitrary-shape clusters. In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noise). A cluster is regarded as a connected dense area of data points, which grows in any direction that density leads. Density-based clustering can usually discover clusters with arbitrary shapes without predetermining the number of clusters. However, density-based clustering is very sensitive to input parameters. Figure 2 shows that the clustering results are sensitive to density threshold [12]. The higher the density threshold, the fewer points fall into clusters and the more points become noises. (a) Density threshold is high Figure 2. (b) Density threshold is low Clustering is sensitive to density threshold 2.2 Hierarchical Clustering Hierarchical algorithms create a hierarchical decomposition of a data set X. The hierarchical decomposition is represented by a dendrogram, a tree that iteratively splits X into smaller subsets until each subset consists of only one object. In such a hierarchy, each level of the tree represents a clustering of X. Figure 3 shows the hierarchical decomposition process and the dendrogram of hierarchical clustering. (a) Hierarchical decomposition Figure 3. (b) Dendrogram Hierarchical decomposition and the dendrogram Hierarchical clustering methods are subdivided into agglomerative (bottom-up) approaches and divisive (top-down) approaches [9]. An agglomerative approach begins with each point in a distinct cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all points in a single cluster and performs splitting until a stopping criterion is met. Besides the algorithm itself, there are several ways to compute cluster similarity. Most hierarchical clustering algorithms are variations of the single-link and the complete link approaches. In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of points from the two clusters. In the complete-link algorithm, the distance between two clusters is the maximum of all pair-wise distance between points in the two clusters. In either case, two clusters are merged to form a larger cluster based on the minimum distance (or maximum similarity) criteria. The complete-link algorithm produces tightly bound or compact clusters while the single-link algorithm suffers when there are a chain of noises between two clusters. Figure 4 illustrates different clustering results between single-link algorithm and complete-link algorithm in case of noise chain [14]. It is noted that single-link clustering produce a screwed result while complete-link algorithm can still produce a correct result. (a) single-link clustering results Figure 4. (b) complete-link clustering results Chain effect on a single-link clustering result In summary, hierarchical algorithms are more flexible than partitioning algorithms. It doesn’t need input parameters from users. However, the computational complexities of hierarchical algorithms are typically higher than those of the partitioning algorithms. Single link clustering is O(n2) and complete link is O(n3), while k-means is only O(n). Hence, many hybrid algorithms have been developed to exploit the good features of both hierarchical clustering and partitioning clustering. 2.3 Hybrid Clustering Several hybrid clustering methods have been proposed to combine the features of hierarchical and partitioning clustering algorithms. In general, these algorithms first partition the data set into preliminary clusters and then construct a hierarchical structure upon these sub-clusters based on some similarity measure. Figure 5 shows that the data set is first partitioned into 15 sub-clusters and these sub-clusters are then merged into two clusters [16]. (a) Obtain sub-clusters Figure 5. (b) Merge sub-clusters into clusters Hybrid clustering process The early hybrid algorithm was developed by combining the k-means and a hierarchical method [18]. This algorithm first partitions the data set into several groups and then performs the k-means on each partition to obtain several sub-clusters. Then a hierarchical method is used to build up levels using the centroids of the sub-clusters in the previous level. This process continues until exactly k clusters are formed. Finally, the algorithm reassigns all points of each sub-cluster to the cluster of their centroids. In this method, the dissimilarity between two clusters is defined as the distance between their centroids. Algorithm BIRCH is one of the most efficient clustering algorithms [22]. The algorithm performs a linear scan of all data points and the cluster summaries are stored in memory in the data structure called a CF-tree. A non-leaf node represents a cluster consisting of all the sub-clusters represented by its entries. BIRCH first partitions the data set into many small sub-clusters and then applies a global clustering algorithm on those sub-clusters to achieve the final results. The main contribution of BIRCH is as an efficient data preprocessor for a large input data set so that the global clustering algorithm can be executed efficiently. CHAMELEON operates on a k-nearest neighbor graph [15]. The algorithm consists of three basic steps: (1) Construct a k-nearest neighbor graph; (2) Partition the k-nearest neighbor graph into many small sub-clusters; and (3) Merge those sub-clusters to get the final clustering results. CHAMELEON has been found to be very effective in clustering isotropic shapes. However, the algorithm cannot handle outliers and needs parameter setting to work effectively. The time complexity of building a k-nearest-neighbor graph of a highdimensional data set is as high as O(d*n2), which makes CHAMELEON infeasible for large data sets. 2.4 Clustering methods of gene expression data There are many newly developed clustering methods [3][7][5][11] which are dedicated to gene expression data. These clustering algorithms partition genes into groups of co-expressed genes. Eisen et al [7] adopted a hierarchical approach using UPGMA (Unweighed Pair Group Method with Arith-metic Mean) to group closest gene pairs. This method displays the clustering results in a colored graph pattern. In this method, the gene expression data is colored according to the measured fluorescence ratio, and genes are re-ordered based on the hierarchical dendrogram structure. Ben-Dor et al. [3] proposed a graph-based algorithm CAST (Cluster Affinity Search Techniques) to improve gene clustering accuracy. Two points are linked in the graph if they are similar. The problem of clustering a set of genes is then converted to a classical graphtheoretical problem. CAST takes as input a parameter called the affinity threshold t, where 0 < t < 1, and tries to guarantee that the average similarity in each generated cluster is higher than the threshold t. However, this method has a high complexity. Hartuv et al. [11] presented a polynomial algorithm HCS (Highly Connected Subgraph). HCS recursively splits the weighted graph into a set of highly connected subgraphs along the minimum cut. Each highly connected sub-graph is called a cluster. Later on, the same research group developed another algorithm, CLICK (Cluster Identification via Connectivity Kernels) [20]. CLICK builds up a statistic framework to measure the coherence within a subset of genes and determines the criterion to stop the recursive splitting process. 3 HYBRID CLUSTERING USING ATTRACTOR TREES In this section, we present an efficient hybrid agglomerative clustering using attractor trees and merging process, CAMP. CAMP consists of two processes: (1) Clustering using Local Attractor trees (CLAT) and (2) cluster Merging Process based on similarity (MP). The final clustering results consist of an attractor tree and a set of P-tree indexes to clusters corresponding to each level of the attractor tree. The attractor tree is composed of leaf nodes, which are the local attractors constructed in CLAT process, and interior nodes, which are virtual attractors resulted from MP process. Figure 6 is an example of an attractor tree. Virtual attractors Local attractor Figure 6. The attractor tree The data set is first grouped into local attractor trees by means of density-based approach in CLAT process. Each local attractor tree represents a preliminary cluster, the root of which is a density attractor of the cluster. Then the small clusters are merged level-by-level in MP process based on cluster similarity. 3.1 Density Function Given a data point x in a data space X, the density function of x is defined as the sum of the influence functions of all data points in the data space X on x. There are many ways to calculate the influence function. In general, the influence of a data point on x is inversely proportional to its distance to x. If we divide the neighborhood of x into neighborhood rings, then points within inner rings have more influence on x than those in outer rings. We define the neighborhood ring as follows: Definition 1. Neighborhood Ring of a data point c with radii r1 and r2 is defined as the set R(c, r1, r2) = {x X | r1<|x-c| r2}, where |x-c| is the distance between x and c. The number of neighbors falling in R(c, r1, r2) is denoted as || R(c, r1, r2)||. Definition 2. Equal Interval Neighborhood Ring (EINring) of a data point c with radii r1=k and r2=(k+1) is defined as the kth equal interval neighborhood ring EINring(c, k, ) = R(c, k, (k+1)), where is a constant. Figure 7 shows 2-D EINrings with k = 1, 2, and 3. The number of neighbors falling in the kth EINring is denoted as ||EINring(c, k, )||. C A v i Figure 7. Diagram of EINrings. Let y be a data point within the kth EINring of x. The EINring-based influence function of y on x is defined as: f ( x, y ) f k ( x ) 1 , k (1) k = 1, 2 ... n The density function of x is defined as the summation of influence function of every EINring neighborhood of x, i.e. DF ( x) f k ( x) || EINring ( x, k , ) || k 1 (2) 3.2 Clustering by Local Attractor Trees The basic idea of clustering by local attractor trees (CLAT) is to partition the data set into clusters in terms of local density attractor trees. Given a data point x, if we follow the steepest density ascending path, the path will finally lead to a local density attractor. All points whose steepest ascending paths lead to the same local attractor form a local attractor tree. If x doesn’t have such a path, it can be either a local attractor or a noise. The local attractor trees are the preliminary clusters. The resultant graph of CLAT process is a collection of local attractor trees with local attractors as the roots. Given the interval of neighborhood ring, , the CLAT is processed as follows: 1. Compute the density function for each point; 2. For an arbitrary point x, find a point, y, with the highest density within its neighborhood R(x, 0, ). If the density of y is higher than the density of x, build a link between x and y. 3. If density of y is lower than the density of x, x is assigned with a new cluster label (x can be an attractor or a noise point). 4. Go back to step 2 with the next point. 5. Finally, the data points in each attractor tree are assigned with the same cluster label as the attractor’s. CLAT produces a set of local attractor trees. Some local attractor trees may contain only root in case of noise points. 3.3 Similarity between Clusters There are many cluster similarity measures. As it is discussed in section 2.2, the most popular similarity measures in hierarchical clustering are complete-link similarity and singlelink similarity. Complete-link similarity is measured by maximum distance between two clusters, while single-link similarity is measured by minimum distance between two clusters. However, these traditional similarity measures are only suitable for clusters with similar densities. For example, they can distinguish the two pairs in Figure 8 (a), but will not distinguish the two pairs in Figure 8 (b). In fact, the left pair of clusters is relatively closer than those on the left. (a) two pair of clusters in similar densities (b) two pair of clusters in different densities Figure 8. Cluster similarity: similar densities vs. different densities Therefore, we consider relative closeness in developing cluster similarity. We define similarity between cluster i and cluster j as follows: CS (i, j ) Vi V j (3) d ( Ai , A j ) where Vi is the average distance between the point in the ith attractor tree and its attractor Ai. d(Ai, Aj) is the distance between attractors Ai and Aj. Vi is calculated as follows: ( x Ai ) Vi 2 xCi || C i || l = 1, 2 ... d (4) Where Ci is the cluster represented by the ith attractor tree. ||Ci|| is the size of Ci. 3.4 Cluster Merging Process After the local attract trees (preliminary clusters) are built in CLAT process, cluster merging process (MP) starts to combine the most similar cluster pair level-by-level based on the similarity measure defined above. When two clusters are merged, two local attractor trees are combined into a new tree, called a virtual attractor tree. It is called “virtual” because the new root is not an existing point. It is only a virtual attractor which could attract all points of two sub-trees. The merging process is shown in Figure 9. The cluster merging is processed recursively by combining (virtual) attractor trees. Av Ai Aj (a) Before merging Figure 9. (b) After merging Cluster merging process After merging, we need to update the attractor Av of the new virtual attractor tree. Take two clusters: Ci and Cj, for example, and assume the size of Cj is greater than or equal to that of Ci, i.e. ||Cj|| ||Ci||, we have the following equations: Avl = || Cj || ( Ail A jl ) || Ci || || Cj || l = 1, 2 ... d (5) where Ail is the lth attribute of the attractor Ai. ||Ci|| is the size of cluster Ci. 4 IMPLEMENTATION OF CAMP USING P-TREES CAMP is implemented using the data-mining-ready vertical bitwise data structure, Ptree, to make the clustering process much more efficient and scalable. The P-tree technology was initially developed by the DataSURG research group for spatial data [19] [6]. In this section, we first briefly discuss representation of a gene dataset in P-tree structures, and computation of P-tree-based neighborhood. Then we detail the implementation of CAMP using P-trees. 4.1 Data Representation Given a gene table G = (E1, E2 … Ed), and the binary representation of jth attribute Ej as bj,mbj,m-1...bj,i… bj,1bj,0, the table is projected into columns, one for each attribute. Then each attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Figure 10 shows a relational table with three attributes. Figure 11 shows the decomposition process from the gene table G into a set of bit vectors. G (E1, E2, E3) 5 2 7 7 2 4 3 1 2 7 2 2 5 7 2 3 2 5 5 1 1 4 A32 2 Figure 10. An example of gene table. G (E1, E2, E3) A2 E1 101 010 111 111 010 100 011 001 E13 E12 E11 1 0 1 1 0 1 0 0 1 1 1 0 1 0 A02 1 1 0 1 1 0 0 1 1 E2 010 011 010 010 101 111 010 011 E3 111 010 010 101 101 001 001 100 E23 E22 E21 E33 E32 E31 0 1 0 0 1 1 0 0 1 1 0 1 1 1 A0 2 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 A02 1 1 0 0 1 1 1 1 0 Figure 11. Decomposition of the gene table After decomposition process, each bit vectors is then converted into a P-tree. A P-tree is built by recording the truth of the predicate “purely 1-bits” recursively on halves of the bit vectors until purity is reached. Three P-tree examples of bit vectors E23, E22 and E21 are illustrated in Figure 12. 0 0 0 0 1 1 0 0 0 0 0 1 0 1 (a) P 23 0 0 0 1 0 1 (b) P 22 0 0 1 (c) P 21 Figure 12. P-trees of bit vectors E23, E22 and E21 The P-tree logic operations are pruned bit-by-bit operations, and performed level-bylevel starting from the root level. For instance, ANDing a pure-0 node with any node results in a pure-0 node, ORing a pure-1 node with any results in a pure-1 node. [6] describes a detailed P-tree logic operations. 4.2 P-tree based neighborhood computation The major computational cost of CAMP lies in computation of densities. To improve the efficiency of density computation, we adopt the P-tree based neighborhood computation by means of the optimized P-tree operations. In this section, we first review the P-tree predicate operations. Then we present the P-tree based neighborhood computation. P-tree predicate operations: Let A be jth dimension of data set X, m be its bit-width, and Pm, Pm-1, … P0 be the P-trees for the vertical bit files of A. c=bm…bi…b0, where bi is ith binary bit value of c. Let PA >c and PAc be the P-tree representing data points satisfying the predicate A>c and Ac respectively, then we have PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk, kim (6) where opi is if bi=1, opi is otherwise. PAc = P’mopm … P’i opi P’i-1 … opk+1P’k, where 1). opi is if bi=0, opi is otherwise. kim (7) In equations above, k is the rightmost bit position with value of “0”. The operators are right binding. Calculation of neighborhood: Let Pc,r be the P-tree representing data points within the neighborhood R(c, 0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree representing data points satisfying the predicate c-r<xc+r. Therefore Pc,r = Pc-r<xc+r = Px >c-r Pxc+r (8) where Px >c-r and Pxc+r are calculated by means of P-tree predicate operations above. Calculation of the EINring neighborhood: Let Pc,k be the P-tree representing data points within EINring(c, k, ) = {x X | k< |c-x| (k+1)}. In fact, EINring(c, k, ) neighborhood is the union of R(c, 0, (k+1)) and the complement of R(c, 0, k). Hence Pc,k = Pc, (k+1) P’c, k (9) where P’c, k is the complement of Pc, k. The count of 1’s in Pc,k, ||Pc,k||, represents the number of data points within the EINring neighborhood, i.e. ||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a specific neighbor point. 4.3 Implementation of CAMP using P-trees The critical implementation steps in CAMP are computations of density function and similarity function and manipulation of the (virtual) attractor trees during clustering process. We will discuss these steps as follows: Calculation of neighborhood: Let Pc,r be the P-tree representing data points within the neighborhood R(c, 0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree representing data points satisfying the predicate c-r<xc+r. Therefore Pc,r = Pc-r<xc+r = Px >c-r Pxc+r (10) where Px >c-r and Pxc+r are calculated by means of P-tree predicate operations above. Calculation of the EINring neighborhood: Let Pc,k be the P-tree representing data points within EINring(c, k, ) = {x X | k< |c-x| (k+1)}. In fact, EINring(c, k, ) neighborhood is the union of R(c, 0, (k+1)) and the complement of R(c, 0, k). Hence Pc,k = Pc, (k+1) P’c, k (11) where P’c, k is the complement of Pc, k. The count of 1’s in Pc,k, ||Pc,k||, represents the number of data points within the EINring neighborhood, i.e. ||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a specific neighbor point. Computation of density function (in CLAT process): According to equation (2) and (7), the density function is calculated using P-trees as follows: DF(x) = f k ( x) || Px, k || k 1 (12) Calculation of cluster similarity: In equation (3), we need to calculate average variation of each cluster, Vi and Vj, and the distance between two attractors, d(Ai, Aj). Vi is calculated by equation (4). As it is mentioned in Section 4.2, equation (4) can be implemented using P-trees very efficiently. Structure of a (virtual) attractor tree: An attractor tree consists of two parts: (1) a collection of summary data, such as the size of the tree, the attractor, the average variation; and (2) a P-tree used as an index to points in the attractor trees. Here is an example of an index P-tree. Assume the data set size is 8, and the first four points and the sixth point of the data set are in an attractor tree. The corresponding bit index (11110100) and P-tree are shown in Figure 13. 0 1 0 0 0 P-tree 0 1 bit index 11110100 Figure 13. The P-tree for an attractor tree. Creating an attractor tree (in CLAT process): When a steepest ascending path (SAP) from a point stops at a new local maximal, we need to create a new attractor tree. The stop point is the attractor. Updating an attractor tree (in CLAT process): If the SAP encounters a point in an attractor tree, the whole SAP is inserted into the attractor tree. The attractor doesn’t change. As the result, the attractor tree needs to be updated by P-tree ORing operation. Pnew Pold Psap (13) where Pnew is P-tree for the new attractor tree, Pold is for the old attractor tree, Psap represents the points in SAP, and is a OR operand. Merging attractor trees (in MP process): When two attractor trees are combined into a new virtual attractor tree, A new P-tree is formed simply by ORing two old P-trees, i.e. Pv= Pi Pj. The new attractor is calculated by equation (5). 5 DISCUSSSION In this section, we discuss problems such as noise handling and locating optimal cutting levels in hierarchical structure, in which clustering results are the best. 5.1 Delayed Noise Handling Process In case of noisy data, it is important to have a proper noise handling process. Naively, the points which stand alone after CLAT process should be noises. However, some sparse clusters might be mistakenly eliminated as noises if noises are handled at this stage. Therefore, we delayed the noise-handling process till the later stage. The neighborhoods of noises are generally sparser than those of points in clusters. In cluster merging process, noises tend to merge with other points with much fewer chances. Therefore, the cluster merging process is tracked to capture those clusters which are growing slowly. We mainly check the case when a large cluster is merged with a very small cluster. If the large cluster didn’t grow within a certain number of previous iterations, we stop and eliminate the small cluster as noises. Figure 14 shows a slow growing cluster merged from a large cluster and a small cluster. In this case, the small cluster is eliminated as noise. large cluster small cluster Figure 14. Noise handling 5.2 Determination of the Optimal Cutting Level Hierarchical clustering generates a nested sequence of clusters ordered level by level. But which level should the user pick for the best results? In this section, we introduce a way to locate the optimal cutting level. This method is modified from one of the traditional cluster validation approaches discussed in [13]: Davies-Bouldin index. Given a cluster partition {C1, C2 ... Ck}, we define the relative similarity between two clusters, Ci and Cj, as RS i , j Ei E j d (mi , m j ) 2 (14) where d(mi, mj) is the distance between the means of cluster i and cluster j, mi and mj. Ei is the average square distance from the points in the ith cluster to the mean of the same cluster. Ei is calculated as follows: Ei 1 2 ( x mi ) n xCi (15) With RSi,j, we can get the maximum relative similarity between cluster i and every other cluster, denoted MRSi. MRSi is calculated as follows: MRS i max {RS i , j } i j (16) The modified Davies-Bouldin (MDB) index for partition {C1, C2 ... Ck} is the average of MRSi (i = 1, 2 ... k), denoted as MDB(k). MDB(k) is calculated as follows: MDB (k ) 1 k MRS i k i 1 (17) The smaller MDB(k), the better the partition. To find the optimal level of clustering, we can draw a diagram MDB – k and search for a minimum. We tested this approach on two data sets: the Iris data from UCI (University of California Irvine) and the data set used in OPTICS [1]. The clustering results are generated using CAMP. Figure 15 shows MDB – k diagram of two data sets. Figure 15. MDB(k) for two data set From Figure 15, we can see MDB reaches the minimum at 3 for the Iris data and at 6 for the OPTICS data. Therefore the optimal levels for the Iris data and the OPTICS data set are when the numbers of clusters are 3 and 6 respectively. These results conform to the ground truth of the two data sets. 6 PERFORMANCE STUDY We used three microarray expression datasets: DS1 and DS2 and DS3. DS1 is the dataset used by CLICK [20]. It contains expression levels of 8,613 human genes measured at 12 time-points. DS2 and DS3 were obtained from the Michael Eisen's lab [17]. DS2 is a gene expression matrix of 6221 80. DS3 is the largest dataset with 13,413 genes under 36 experimental conditions. The raw expression data was first normalized [3]. Then the datasets were then decomposed and converted to P-trees. We implemented HK-means [21], BIRCH [22], CAST [3], and CAMP algorithms in C++ language on a Debian Linux 3.0 PC with 1 GHz Pentium CPU and 1 GB main memory. To make the algorithms comparable, we run the methods up to the level where the number of clusters is equal to the number of clusters that CAST produces, which are 12 for DS1, 8 for DS2, and 22 for DS3. 6.1 Run Time Comparison The total run times for different algorithms on DS1, DS2 and DS3 are shown in Figure 16. It is shown in Figure 16 that CAMP is the fastest among the four methods. Especially CAMP outperforms HK-means and CAST substantially when the data set is large. Figure 16. Run time comparison 6.2 Clustering Results Comparison The clustering results are evaluated by means of MDB statistic index discussed in section 5.2. The MDB value of each method on data sets DS1, DS2, and DS3 are shown in Figure 17. Low MDB value means good clustering results. From Figure 17, we can see that CAMP and CAST both have lower MDB values (better clustering results) than the other two methods. Figure 17. Clustering results measured in MDB In summary, CAMP outperforms the other three methods in terms of execution time with the comparable clustering results with CAST. 6.3 Visualization of Clustering Results It is often useful to visualize the clustering results. Eisen et al [7] developed a software tool to visualize hierarchical clustering results of gene expression data. The software, Java TreeView, was written by Alok Saldanha at Stanford University. This program reads in files matching .cdt and .gtr or other extensions and visualizes gene clustering results in gene tree graph. A .cdt (clustered data table) file contains the original data which is reordered to reflect the clustering results. It is the same format as the input files, except that an additional column, GID, is added. GID column contains a unique identifier for each gene, which will be used in conjunction with the .grt file to build a gene tree. The .gtr (gene tree) file records the order in which the genes are joined during clustering. Given a .cdt and a .gtr file, Java TreeView generates a gene tree graph. The expression values of each gene are rendered in a red-green color scale, where red represents higher expression and green indicates lower expression in the given experiment. Figure 18 is a partial gene tree graph. Figure 18. A snapshot of a partial gene tree Java TreeView was designed for traditional hierarchical clustering. However, it is necessary to extend the tool to facilitate the hybrid clustering. One simple way is to take a representative point from each preliminary cluster and treat it as a leaf node. For example, we can take the local density attractors as the representative points to visualize our clustering results. In DS1, CAMP generates 92 local density attractors. We create a file which contains all the local density attractors in .cdt format, and another file in .gtr format based on the merging process in CAMP. Then we input them into Java TreeView to generate a gene tree of the hybrid clustering results. Figure 19 shows the hybrid gene tree of DS1. Figure 19. The gene tree of hybrid clustering on DS1 7 CONCLUSION In this paper, we have proposed an efficient hybrid clustering method using attractor trees, CAMP, which combines the features of both density-based clustering approach and similarity-based clustering approach. A vertical data structure, P-tree, is used to make the algorithm more efficient by accelerating the calculation of the density function. The process of building local attractor tree prunes the large portion of hierarchical bottom structure. Relative cluster similarity measure and noise handling process improves clustering accuracy. Experiments on common gene expression datasets demonstrated that our approach is more efficient and scalable with competitive accuracy. In the future, we will apply our approach to large scale time series gene expression data, where the efficient and scalable analysis approach is in demand. We will explore a comprehensive tool to visualize the hybrid clustering as well as hierarchical clustering. REFERENCES 1. Ankerst, M., Breunig, M. Kriegel, H.-P. and Sander, J. OPTICS: Ordering points to identify the clustering structure. ACM-SIGMOD Conference on Management of Data (SIGMOD’99), Philadelphia, PA, 1999. pp. 49-60. 2. Arima, C and Hanai, T. “Gene Expression Analysis Using Fuzzy K-Means Clustering”, Genome Informatics 14, pp. 334-335, 2003. 3. Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering gene expression patterns,” Journal of Computational Biology, Vol. 6, 1999, pp. 281-297. 4. Berkhin, P. Survey of Clustering Data Mining Techniques. Technical report, Accrue Software, 2002 5. Cho, R. J. M. et al. “A Genome-Wide Transcriptional Analysis of The Mitotic Cell Cycle.” Molecular Cell, 2:65-73, 1998. 6. Ding, Q., Khan, M., Roy, A., and Perrizo, W., “The P-Tree Algebra”, ACM SAC, 2002. 7. Eisen, M.B., Spellman, P.T., “Cluster analysis and display of genome-wide expression patterns”. Proceedings of the natinoal Academy of Science USA, pp. 14863-14868, 1995. 8. Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M. et al. “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring”. Science 286, pp. 531-537, 1999. 9. Han J. and Kamber M. Data Mining, Concepts and Techniques. Morgan Kaufmann, 2001. 10. Hartigan, J. A. and Wong, M. A. A k-means clustering algorithm. Applied Statistics, 28: 1979. pp.100-108 11. Hartuv, E. and Shamir, R. A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4-6):175-181, 2000. 12. Hinneburg, A., and Keim, D. A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press, 1998. 13. Jain, A K. and Dubes, R. C. Algorithms for Clustering Data. Prentice-Hall advanced reference series. Prentice-Hall, Inc. 1988. 14. Jain, A. K., Murty M.N., and Flynn P. J. Data Clustering: A Review, ACM Computing Surveys, Vol 31, No. 3, 1999. pp. 264-323. 15. Karypis, G., Han, E.-H. and Kumar, V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, August 1999. 16. Lin, C. and Chen, M. Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging, IEEE Transaction on Knowledge and Data Engineering, Vol 17, No. 2, 2005. pp. 145-159. 17. Michael Eisen's gene expression data is available at http://rana.lbl.gov/EisenData.htm 18. Murty, N. M. and Krishna, G. A Hybrid Clustering Procedure for Concentric and ChainLike Clusters, International Journal of Computer and Information Sciences, vol. 10, no. 6, 1981. pp. 397-412. 19. Perrizo, W., “Peano Count Tree Technology”. Technical Report NDSU-CSOR-TR-01-1, 2001. 20. Shamir R. and Sharan R. CLICK: A clustering algorithm for gene expression analysis. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB '00). AAAI Press. 2000. 21. Tavazoie, S. J. Hughes, D. and et al. “Systematic determination of genetic network architecture”. Nature Genetics, 22, pp. 281-285, 1999. 22. Zhang, T., Ramakrisshnan, R. and Livny, M. BIRCH: an efficient data clustering method for very large databases. In Proceedings of of Int’l Conf. on Management of Data, ACM SIGMOD 1996.