Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OMICS A Journal of Integrative Biology Volume 10, Number 4, 2006 © Mary Ann Liebert, Inc. Clustering Methods for Microarray Gene Expression Data NABIL BELACEL,1 QIAN (CHRISTA) WANG,1 and MIROSLAVA CUPERLOVIC-CULF2 ABSTRACT Within the field of genomics, microarray technologies have become a powerful technique for simultaneously monitoring the expression patterns of thousands of genes under different sets of conditions. A main task now is to propose analytical methods to identify groups of genes that manifest similar expression patterns and are activated by similar conditions. The corresponding analysis problem is to cluster multi-condition gene expression data. The purpose of this paper is to present a general view of clustering techniques used in microarray gene expression data analysis. INTRODUCTION W ITH ADVANCES OF deoxyribonucleic acid (DNA) microarray technology, it became possible to monitor the expression levels of tens of thousands of genes simultaneously. To analyze the large amount of data obtained by this technology, researchers usually resort to clustering methods that identify groups of genes that share similar expression profiles. Clustering problems are based on the notion of unsupervised learning in which data objects within the same cluster are similar to one another and dissimilar to the objects in other clusters (Han and Kamber, 2001). In the case of clustering gene expression data, a cluster may contain a number of genes or samples with similar expression patterns. In gene expression clustering, the analysis is performed on a data matrix X {xij}nxd where xij represents expression levels of gene i in samples j. More precisely, each row vector is the expression pattern of a particular gene across all d conditions while each column vector is the profile of all n genes in a particular condition. The clustering of gene expression data can be divided into two main categories: gene-based clustering and sample-based clustering. In gene-based clustering, genes are treated as objects and samples are treated as features or attributes for clustering. The dataset to be clustered contains n objects: Xi {Xi1, Xi2, . . . , Xid} where 1 i n. The resulting clusters can be represented as C {C1, . . . ,Ck} where Cj’s are disjoint clusters. The goal is to cluster and group genes with similar expression patterns (co-expressed genes). This is the cornerstone for further understanding of gene function, gene regulation, and cellular processes (Jiang et al., 2004). Similarly, sample-based clustering takes samples as objects and considers genes as features or attributes. The dataset to be clustered contains d objects: Xi {Xi1, . . . Xni}where 1 i d. The resulting clusters can 1National Research Council Canada, Institute for Information Technology, Scientific Park, Moncton, New Brunswick, Canada. 2Atlantic Cancer Research Institute, Hôtel-Dieu Pavilion, Moncton, New Brunswick, Canada. 507 BELACEL ET AL. be represented as C {C1, . . . ,Cm} where Cj’s are disjoint clusters. Sample-based clustering can be used to reveal sample types, which are possibly indistinguishable by traditional morphology-based approaches (Jiang et al., 2004). Traditional clustering techniques can be classified into two main categories: hierarchical and nonhierarchical algorithms (Stanford et al, 2003; Yeung, 2003). The hierarchical clustering algorithms group objects and provide a natural way for graphical representation of data. The graphical representation resulting from hierarchical clustering is a dendrogram in which each branch forms a group of genes or samples share similar behavior (Eisen et al., 2002). These types of clustering algorithms have been used extensively for the analysis of DNA microarray data (Alizadeh et al., 2000; Nielsen et al., 2002; Ramaswamy et al., 2003, Welcsh et al., 2002). Nonhierarchical clustering algorithms (i.e., partitional), on the other hand, perform a partition of genes into K clusters so that expression patterns in the same cluster are more similar to each other (i.e., homogeneity) than those in different clusters (i.e., separation). Several partitional algorithms, such as K-means, partitioning around medoids and self-organizing maps (SOM) have been applied extensively to DNA microarray data generated from different biological sources (Sherlock, 2000). For instance, a K-means algorithm was developed to identify molecular subtypes of brain tumors (Shai et al., 2003), to cluster transcriptional regulatory cell cycle genes in yeast (Tavazoie et al., 1999), and to correlate changes in gene expression with major physiologic events in potato biology (Ronning et al., 2003). Although both hierarchical and nonhierarchical algorithms have yielded encouraging results in clustering DNA microarray data and have been used extensively, they still suffer from several limitations. When analyzing large-scale gene expression datasets collected under various conditions, hierarchical algorithms generate nonunique dendrograms with higher time and space complexities (Morgan and Ray, 1995); while nonhierarchical methods group gene expression data into a fixed number of predefined clusters. Moreover, when using different clustering algorithms to resolve the same DNA microarray data set, different conclusions can be drawn (Chu et al., 1998). Recently, several new clustering algorithms (e.g., graph-theoretical clustering, model-based clustering) have been developed with the intention to combine and improve the features of traditional clustering algorithms. However, clustering algorithms are based on different assumptions, and the performance of each clustering algorithm depends on properties of the input dataset. Therefore, the winning clustering algorithm does not exist for all datasets, and the optimization of existing clustering algorithms is still a vibrant research area. This paper aims to provide a survey of the various methods available for gene clustering and to illustrate the impact of clustering methodologies on the fascinating and challenging area of genomic research. The taxonomy of clustering algorithms utilized in the field of gene expression data analysis will be introduced and exemplified with the emerging applications. The strengths and weaknesses of each clustering technique will be pointed out. Subsequently, the development of software tools for clustering will be emphasized, and some of the existing commercial and open source software utilizing reviewed clustering algorithms will be discussed. Finally, a conclusion and future research directions will be outlined. CLUSTERING ALGORITHMS Clustering algorithms are divided into several groups that divide simple clustering (each gene belongs to one cluster, i.e., hard or crisp clustering) from complex clustering (each gene can belong to more than one cluster with a certain degree of membership, (i.e., soft or relaxed clustering). The first group includes three conventional, widely used clustering algorithms: hierarchical clustering and two nonhierarchical clustering algorithms, K-means and SOM. The second group of algorithms includes new clustering methods, which are specifically designed for clustering gene expression data. These methods include DHC from densitybased clustering as well as CLICK and CAST from graph-theoretical clustering. In the latter group, complex clustering algorithms representing new advancements over heuristic clustering are presented. Several representative algorithms include fuzzy clustering and probabilistic clustering. 508 CLUSTERING MICROARRAY GENE EXPRESSION DATA CONVENTIONAL CLUSTERING ALGORITHMS Hierarchical clustering The principle behind hierarchical clustering is to group data objects into a tree of clusters through either agglomerative or divisive process. Agglomerative clustering represents a bottom-up approach, where each data object is initially placed into its own cluster. Subsequently, the closest pairs of clusters are merged until either all the data objects are in a single cluster or a certain termination condition is satisfied. Divisive clustering follows a top-down strategy. It starts with all objects in one cluster and splits the cluster until either each object forms its own cluster or until a certain termination condition is met (Fig. 1). Based on linkage metric determined between two clusters, agglomerative hierarchical clustering can be further divided into single linkage, complete linkage, or average linkage. In the single linkage method, the distance between two clusters is determined as the distance between their closest members. Each object in any cluster produced using this method is more closely related to at least one object of its cluster than to any point outside it. In the complete linkage method, the distance between two clusters is given by the distance between their most distant objects. This method produces clusters with objects that lie within some known maximum distance of one another. In the third method, average linkage, the distance between two clusters is measured between the centroids (i.e., clusters average elements). In divisive clustering, for a cluster with N objects, there are (2N-1 1) possible two-subset divisions generated. Determining all the possible divisions is thus too computationally expensive, particularly for gene clustering (Xu and Wunsch, 2005), and thus, divisive clustering is not commonly used in practice. Eisen et al. (1998) developed a clustering software package (Cluster and TreeView, http://rana.lbl. gov/EisenSoftware.htm) based on the average linkage agglomerative clustering algorithm. Cluster iteratively merges groups with the highest similarity value. The output of the algorithm is a two-dimensional dendrogram. (Fig. 2). The branches of a dendrogram record the formation of groups. The length of the horizontal branches indicates the similarity between the clusters. Hierarchical clustering methods have been extensively used in analysis of gene expression data (Fig. 3), as well as other types of microarray data (e.g., CGH arrays, protein array). In the field of cancer research, for example, hierarchical clustering has been used to identify cancer types (Nielsen et al., 2002; Ramaswamy FIG. 1. Two approaches in hierarchical clustering. a–g show different genes/samples. Agglomerative clustering starts from individual genes (one gene in one cluster), divisive clustering starts from all genes in one cluster. 509 BELACEL ET AL. FIG. 2. Dendrogram—the most popular method to cluster microarray data. Starting from a root, the dendrogram splits into multiple clusters according to how the genes (x, y, z, . . . ) are related. Genes are branched off as nodes. The branches record the formation of the clusters. The length of the horizontal branches indicates the similarity between the clusters. (Adapted from UCL Oncology, 2005.) et al., 2003), to discover new subtypes of cancer (Alizadeh et al., 2000), and to investigate cancer tumorigenesis mechanisms (Welcsh et al., 2002) from gene expression data. Au et al. (2004) explored hierarchical clustering for the identification of different subgroups in non-small cell lung carcinoma. Their results showed that hierarchical clustering analysis on an extended immunoprofile can identify two main cluster groups corresponding to adenocarcinoma and squamous cell carcinoma. The hierarchical clustering analysis of Makretsov et al. (2004) on multiple marker microarray immunostaining data resulted in an improved prognosis in patients with invasive breast cancer. Hierarchical clustering was also used by Mougeot et al. (2006) for gene expression profiling of ovarian tissues, showing that clustering can distinguish between low malignant potential/early cancer and possible precancerous stages. The graphic representation of the results of hierarchical clustering allows users to “visualize global patterns in expression data” (Tseng, 2004), making this method a favorite among biologists. However, several key issues in hierarchical clustering still need to be addressed. The most serious problem with this method is its lack of robustness to noise, high dimensionality, and outliers (Jiang et al., 2004). Hierarchical clustering algorithms are also expensive, both computationally and in terms of space complexity (Xu and Wunsch, 2005), and thus their applicability for the analysis of large datasets is limited. Furthermore, both agglomerative and divisive approaches follow a greedy strategy that prevents the cluster refinement. In other 510 CLUSTERING MICROARRAY GENE EXPRESSION DATA FIG. 3. Hierarchical clustering schema of combined yeast datasets. The color codes represent the measured fluorescence ratios. Genes with unchanged expression are colored black. Red represents relatively high and green represents low gene expression level. The intensity of the colors reflects different degrees of expression. (From Eisen et al., 1998. Proc Natl Acad Sci USA 95, 14863–14868.) 511 BELACEL ET AL. words, once a decision is made to merge (or split) clusters, it is never reconsidered or optimized. The iterative merging of clusters is determined locally at each step (i.e., local objective function) by the pair-wise distances rather than a global criterion (Tan et al., 2005). Several alternative approaches were proposed to address the problems of standard hierarchical clustering. These approaches deploy a partitional clustering algorithm (see below) such as K-means to generate small clusters first and then perform hierarchical clustering using these small clusters as initial points (Tan et al., 2005). Ward’s method (Ward, 1963) implemented agglomerative hierarchical clustering using the same proximity function as K-means (Tan et al., 2005). In order to remedy the problems in handling large datasets, several new hierarchical clustering approaches have been proposed. One of them, CURE (clustering using representatives), is an extension of hierarchical clustering. CURE was developed with the objective to be robust to outliers and to explore clusters with nonspherical shapes and variant sizes (Guha et al., 1998). In CURE, each cluster is represented by a fixed number of well-scattered points. Having more than one representative point per cluster allows CURE to capture more sophisticated cluster shapes. Furthermore, CURE is less sensitive to outliers since the effects of outliers are weakened during the process of shrinking scattered points toward the centroid.1 CURE utilizes random sampling and partitioning in order to scale large dataset without sacrificing clustering quality. The space complexity of this approach increases linearly with the dataset size while its time complexity is no worse than that of conventional hierarchical algorithm. Another noteworthy hierarchical clustering approach is BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) (Zhang et al., 1996). BIRCH creates a special data structure, called clustering feature (CF) tree. BIRCH clustering algorithm scans the database, and cluster summaries are stored in memory in the form of a CF tree. During this preclustering phase, crowded data points are grouped into subclusters while scattered data points are removed as outliers. The algorithm then applies centroid-based hierarchical clustering to perform global clustering, incrementally eliminate more outliers, and refine clusters. BIRCH can work with any given amount of memory, and it is also capable of handling outliers effectively. BIRCH represents the state of the art in clustering large-scale datasets. Partitional clustering by K-means K-means is a very straightforward, commonly used partitioning method. The first step in the clustering process is to randomly select K objects, each representing initial cluster mean or centroid. After that the objects are assigned to the clusters by finding the objects’ nearest centroids. The algorithm then computes the new mean for each cluster and reassigns the objects. The iteration stops when the boundaries of the clusters stop changing. Figure 4 illustrates the standard K-means procedure for clustering gene expression data. Tavazoie et al. (1999) used K-means clustering of whole-genome mRNA data to identify transcriptional regulatory sub-networks in yeast. By iteratively relocating cluster members and minimizing the overall intercluster dispersion, 3000 genes were grouped into 30 clusters. Then from the gene coregulation information it was possible to infer biological significance of newly discovered genes from functions of known genes and motifs. Shai et al. (2003) performed K-means clustering analysis to identify molecular subtypes of gliomas. Three obtained clusters corresponded to glioblastomas, lower grade astrocytomas, and oligodendrogliomas. K-means is relatively scalable and efficient when processing large datasets. In addition, K-means can converge to a local optimum in a small number of iterations. But, K-means still has several drawbacks. First, the user has to specify the initial number of clusters and the convergence centroids vary with the initial partitions (Xu and Wunsch, 2005). One of the characteristics of gene expression clustering is that prior knowledge is not available. Thus, in order to detect the optimal number of clusters, users have to run the algorithm repeatedly with different k values, compare the clustering results, and make a decision about the optimal number of clusters accordingly. For a large gene expression dataset, this extensive fine-tuning process is not practical. A second problem in K-means clustering is its sensitivity to noise and outliers. Gene expression data is noisy and has a significant number of outliers, and this can substantially influence 1Outliers locate further from the cluster centroid than other scattered points. 512 CLUSTERING MICROARRAY GENE EXPRESSION DATA FIG. 4. K-means for clustering gene expression data. Genes are represented as points in space, where similarly expressed genes are close together. (1) The process is initiated by randomly partitioning the genes into three groups. (2) The centroids of three groups are given different colors. Genes are assigned to the closest centroid. (3) The results of gene assignment are shown. (4 and 5) Steps 2 and 3 are iteratively repeated. (6) Centroids are stable; the termination condition is met. (Modified from Gasch and Eisen, 2002. Genome Biol 3, 1–22.) the mean values and thus cluster positions. Finally, K-means often terminates at a local, possibly suboptimal, minimum. To improve the robustness to noise and outliers, K-medoid algorithm was introduced (Mercer and College, 2003). A medoid is a representative point for a cluster, arbitrarily selected by the algorithm (k-medoids representing k clusters). Using medoids has two advantages: first, there is no limitation on attribute types; and second, medoids are existing data points and thus, unlike centroids, they are generated without any computation (Berkhin, 2002). Therefore, the K-medoid algorithm is less sensitive to outliers. The most popular K-medoid algorithm is partitioning around medoids (PAM) (Kaufman and Rousseeuw, 1990) and its extension PAMSIL (van der Laan et al., 2003). PAMSIL replaces the objective function used in PAM with average silhouette (first proposed by Kaufman and Rousseeuw, 1990)). The data points are assigned to k clusters resulting in maximum average silhouette. The partition in this case depends not only on how well the data point belongs to its current cluster, but also on how well it belongs to the next closest cluster. The experiment on simulated microarray data demonstrated that PAMSIL has the ability to find small homogeneous clusters (van der Laan et al., 2003). Like most partitional clustering methods, K-means requires a prior knowledge of the number of clusters. Several research efforts were aimed at developing a method that can determine the number of clusters. Yeung et al. (2001) used probabilistic models to resolve the optimal number of clusters. Hastie et al. (2000) proposed the “gene shaving” method, a statistical method to identify distinct clusters of genes. Hruschka et al. (2006) introduced an evolutionary algorithm for clustering (EAC). EAC includes the K-means algorithm as a local search procedure, applies a centroid-based objective function, eliminates the crossover operation, and adds sophisticated mutation operation. EAC extends a clustering genetic algorithm and is capable of automatically discovering an optimal number of clusters. A common criticism of K-means is that it only converges to a local optimum. Some optimal techniques such as Taboo search, simulated annealing algorithm, and genetic algorithm have been utilized to achieve global optimization. However, these algorithms suffer from high computational costs. Trying to alleviate 513 BELACEL ET AL. the cost, Krishna and Narasimha Murty (1999) introduced a new clustering method called genetic K-means (GKA). GKA hybridizes the genetic algorithm with a gradient descent algorithm and K-means algorithm. By refining distance-based mutation instead of expensive crossover operation, GKA converges to the global optimum faster than other evolutionary algorithms. Inspired by GKA, Lu et al. (2004b) proposed a fast genetic K-means algorithm (FGKA). It features several improvements, including an efficient evaluation of the objective value total within-cluster variation (TWCV), avoiding illegal strings by lower probabilities, and simplification of the mutations. FGKA converges to the global optimum faster than GKA. Based on FGKA, Lu et al. (2004a) introduced incremental genetic K-means (IGKA). IGKA has better time performance when the mutation probability is small. Self-organizing maps The self-organizing map (SOM) is one of the best-known, unsupervised neural network learning algorithms; it was first developed by Kohonen (1984). It is based on a single-layered artificial neural network (ANN) where the SOM is constructed by training. The data objects are the input of the network. The output units, called output neurons, are organized as a one-, two-, or three-dimensional map (depending on the type of SOM). Each neuron is associated with a weight vector (reference vector). Along the learning process, each data object acts as a training example, which directs the movement of the initially randomly associated weight vectors towards the denser areas of the input vector space. Clustering is performed by having neurons compete for the current data object. The neuron whose weight vector is closest to the current object becomes the winning unit. The weight vector of the best-matching neuron and its set of neighbors move towards the current object and the weights of the winning neuron and its neighbors are adjusted. As the learning proceeds, the adjustment to the weight vectors diminishes. When the training is complete, clusters are identified by mapping all data objects to the output neurons. SOM is a vector quantization method, which can simplify and reduce the high dimensionality of raw expression data (Wang et al., 2002c). It allows easy visualization of complex data, as well as analysis of largescale datasets (Tseng, 2004). Toronen et al. (1999) developed and employed the prototype software GenePoint (a tree-based SOM) and Sammon’s mapping to analyze and visualize gene expression during a diauxic shift in yeast. The work demonstrated that SOM is a reliable and fast cluster analysis tool for analysis and visualization of gene expression profiles. Genecluster software (Tamayo et al., 1999) is also based on SOM (http://www.broad.mit.edu/cancer/software). Genecluster takes expression levels of any gene-profiling method and topology of neurons as input. It uses a web-based interface to visualize the clusters. Each cluster is represented by its average expression pattern with error bars showing the standard deviation at each condition. Genecluster has been applied to hematopoietic differentiation aimed at determining optimal treatment for acute promyelocytic leukemia. SOM is a good alternative to traditional clustering methods. One of the appealing features of SOM is that it provides an intuitive view for mapping of a high-dimensional dataset. The neuron learning process makes SOM more robust than K-means algorithm to noisy data (Jiang et al., 2004). In addition, SOM is efficient in handling large-scale datasets as it has a linear run time (Herrero and Dopazo, 2002). However, SOM still has several pitfalls. As in K-means, users are required to predefine the number of initial clusters and the topology of the neurons. The convergence is controlled by certain parameters, such as the learning rate and the topology of the neurons. SOM can converge to suboptimum rather than global optimum if the initial weights are not chosen properly (Jain et al., 1999). When clustering gene expression data, SOM results are more dependent on the size of the clusters than on the actual differences among gene profiles (Herrero and Dopazo, 2002). Jiang et al. (2004) state also that SOM can be inaccurate for datasets that are abundant with irrelevant and invariant genes. This data will populate the majority of clusters, and, therefore, most of the meaningful patterns might be unidentified. To tackle problems of SOM, Su and Chang (2001) proposed a novel model called double SOM (DSOM). In DSOM, each node is related not only to an n-dimensional weight vector but also to a two-dimensional position vector. During the learning process, both weight and the position vectors are updated. By plotting the number of groups of two-dimensional position vectors, the number of clusters can be determined. Wang et al. (2002a) applied DSOM to cluster gene expression data in yeast, using figure of merit method for val514 CLUSTERING MICROARRAY GENE EXPRESSION DATA Output Array j wnj wij xn w 1j x1 xi x2 Input Vector FIG. 5. Topology of a self-organizing map with a 5 5 output array. The input vector is connected to all output nodes (only node j shown here). It is significant to note that the arrangements of bins in the output array represents the similarity of patterns; similar patterns and bins are adjacent, and different ones are well separated. idation. This experiment proved that DSOM can reveal the number of clusters based on the final location of the position vectors. Inspired by the idea of integrating merits, hierarchical clustering, and SOM, Hsu et al. (2003) recommended a hierarchical dynamic self-organizing approach, which combines dynamic SOM tree and growing SOM (GSOM). This approach was applied on leukemia and colon cancer microarray data. The results showed that GSOM can automatically generate an appropriate number of clusters and perform cancer class discovery and marker gene identification. Herrero and Dopazo (2002) proposed and implemented a new approach (SOMTree) by combining hierarchical clustering and SOM (http://bioinfo.cnio.es/ wwwsomtree/). SOMTree was used for exploratory analysis of gene expression profiles during a diauxic shift in yeast. The result provided strong evidence that combination of SOM and hierarchical clustering methods constitute a fast and accurate way for exploratory analysis of large datasets. To improve the rate of convergence, Xiao et al. (2003) proposed a hybrid clustering approach that is based on SOM and a simple evolutionary method, particle swarm optimization (PSO) (Kennedy and Eberhart, 1999). Hybrid clustering approach deploys PSO to evolve the weights for SOM. The rate of convergence is improved by introducing a conscience factor to the SOM. The method was used on the rat and yeast benchmark datasets. The result shows that the proposed approach not only maintains the desirable topological ordering of SOM, but also fast convergences to a more refined clustering. NEW CLUSTERING ALGORITHMS Density-based clustering As its name implies, density-based clustering is the process developed to recognize dense areas in the object space. Clusters are defined as regions with a high density of points (Kröger, 2004). In other words, for a point in a cluster, its neighborhood within a given radius must contain a minimum predefined number of points (i.e., the density in the neighborhood has to exceed a predefined threshold) (Mercer and College, 2003). A conventional density-based algorithm, density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., 1996), is constructed based on the concepts of density, connectivity, and boundary. DBSCAN requires two predefined parameters: -radius, which defines a neighborhood of any data object; and MinPts, a predefined threshold (i.e., the minimum number of points within radius from any 515 BELACEL ET AL. point). A data object O is a core object if the point count within its neighborhood is more than MinPts points. The goal of a clustering process is to connect neighboring core objects together. The non-core points inside a cluster form the boundary of the cluster; and the data points that are not connected to any core point are defined as outliers. The performance of this method is quite sensitive to these two predefined parameters, which limits its application in complex datasets. Other density-based algorithms, such as OPTIC (ordering points to identify the clustering structure) and DENCLUE (density-based clustering), were proposed to address this problem. OPTIC (Ankerst et al., 1999) is an extension of DBSCAN for an infinite number of distance parameters i, which are smaller than the generic radius (Fig. 6). Several distance parameters are processed simultaneously. OPTIC also stores an augmented order in which the data objects are processed. Instead of relying solely on the two parameters ( and MinPts), additional distance parameters—core distance and reachability distance—are associated with each data object. Therefore, OPTIC is more robust to these global predefined parameters with a tradeoff for a higher run time than DBSCAN (roughly 1.6 of DBSCAN runtime) (Berkhin, 2002). Both DBSCAN and OPTIC are unsuitable for processing high-dimensional data. Another method, DENCLUE (Hinneburg and Keim, 1998), uses influence function to describe the impact of a certain point within its neighborhood. The overall density function can be estimated as a sum of influence functions of all data points. Clusters are discovered by identifying the maximum of the overall density function. DENCLUE is more efficient than other density-based clustering algorithms and it can be used to describe arbitrarily shaped clusters in high-dimensional datasets. None of the above-mentioned density-based algorithms can fulfill all of the essential requirements for the clustering method—visualization, robustness, and automatic determination of a number of clusters (Jiang et al., 2003b). Density-based hierarchical algorithm were specially designed for clustering gene expression data, especially time series gene expression data with these requirements in mind. Density-based hierarchical clustering Jiang et al. (2003b) proposed DHC, a density-based hierarchical clustering algorithm primarily for effectively clustering time series gene expression data. The algorithm interprets the cluster structure of a dataset by constructing a density tree in two steps. In the first step, all data objects are organized into a hierarchical structure based on the density and attraction properties of data objects. A data object with high density attracts objects with lower density. The attraction of a data object O is a set of objects A(O), which can be defined as A(O) {density(Oj) density(O)}. The attractor of a data object O can be defined as the object Oj A(O) with the largest attraction. Each node of the attraction tree is denoted as a data object; the parent of each node represents the attractor of the data object. During the second step, DHC summarizes the cluster and prunes the noise and outliers. The resulting structure is a density tree, with each node denoting a dense area. The density tree contains two types of nodes, cluster nodes and collection nodes. FIG. 6. The density-based clustering technique known as OPTICS. It generalizes density-based clustering by ordering the points, allowing the extraction of clusters with arbitrary values for . 516 CLUSTERING MICROARRAY GENE EXPRESSION DATA Cluster nodes are leaf nodes that cannot be decomposed further, while collection nodes are internal nodes that can be further decomposed. DHC recursively splits collection nodes until termination criteria are met. Comparative experiments of Jiang et al. (2003b) have shown that DHC has several advantages. DHC can detect the number of clusters automatically. Furthermore, its clustering performance is robust to the predefined parameters ( and MinPts). DHC uses the gene with the highest density within the group as the medoid of the cluster, making it more robust to noises and outliers. DHC provides users with an intuitive view of the relationship and connection between clusters and handles both embedded clusters and intersected clusters. Finally, the result of DHC can be visualized as a hierarchical structure. DHC constructs two kinds of trees, attraction tree and density tree. In the attraction tree, the root of the tree represents the medoid of the cluster. The hierarchical level of the data objects reflects their similarity with the medoid of the cluster, and outliers are classified at leaf level and can be easily recognized. Finally, DHC is scalable in processing large-scale gene expression data. But, DHC still has some shortcomings. First of all, DHC has a high computational cost, primarily because DHC calculates the distance between each pair of data objects in the dataset in order to decide the density property of objects. Also, DHC still requires users to predefine the two threshold parameters, which are used to decompose dense areas (Jiang et al., 2004). Graph-theoretical clustering As the name suggests, graph-theoretical clustering describes clustering problems by means of graphs. Given a data set X, we can construct a weighted graph G(V,E), in which vertex V corresponds to data objects and edges E reflect the proximity between each pair of data objects. Based on a threshold value, proximity is mapped to either 0 or 1 with edges only existing between a pair of objects with the proximity equal to 1 (Jiang et al., 2004). Graph theory has been applied in agglomerative hierarchical clustering algorithm, Chameleon (Jain et al., 1999, Karypis et al., 1999), which uses k-nearest neighbor graph theory to eliminate the irrelevant points. By using minimum edge cut, Chameleon method divides the weighted graph into a set of subclusters. It then merges these small subclusters according to the relative interconnectivity and relative closeness until the ultimate clusters are found. Graph theory has also been utilized for nonhierarchical clustering. Hartuv and Shamir (2000) constructed a clustering algorithm HCS (highly connected subgraph) based on graph connectivity. HCS defines clusters as highly connected subgraphs whose edge connectivity exceeds half the number of vertices. The edge connectivity of a graph G is the minimum number of edges whose removal disconnects a graph. A cut is a set of edges whose removal results in a disconnected graph. A minimum cut aims at separating a graph G with a minimum number of edges and is recursively used for finding highly connected subgraphs (clusters). HCS algorithm has been tested on both simulated and real gene expression data. The experimental result shows that HCS is a promising solution for clustering gene expression data even with high volume of noise. CLICK and CAST are other graphtheoretical clustering algorithms that are commonly used in gene expression data analysis; they will be discussed in more detail below. Cluster identification via connectivity kernels CLICK (cluster identification via connectivity kernels) (Sharan et al., 2003) is a commonly used graphtheoretical clustering approach. The concept behind CLICK is to identify kernels (clusters) of highly similar data objects. CLICK assumes normal distribution of pairwise similarity values between all data. CLICK works in two phases. In the first phase, the clustering process iteratively finds the minimum cut in G and TABLE 1. Program CLICK GeneCluster aAverage COMPARISON BETWEEN CLICK AND GENECLUSTER Dataset Homogeneitya Separationb Yeast cell cycle Yeast cell cycle 0.80 0.74 0.07 0.02 similarity between a gene and the center (average profile) of its cluster. average similarity between centers of clusters. bWeighted 517 BELACEL ET AL. TABLE 2. Program CLICK Hierarchical aAverage COMPARISON BETWEEN CLICK AND HIERARCHICAL CLUSTERING Dataset Homogeneitya Separationb Human fibroblasts to serum Human fibroblasts to serum 0.88 0.87 0.34 0.13 similarity between a gene and the center (average profile) of its cluster. average similarity between centers of clusters. bWeighted recursively splits the dataset into a set of connected components from the minimum cut. The output from this phase is a list of kernels and singletons. In the second phase, kernels are expanded to a set of final clusters. The adoption step repeatedly searches for a singleton and a kernel with the maximum similarity and assigns this singleton to the kernel. The merging step iteratively merges two clusters with similarity exceeding a predefined threshold. The software CLICK is implemented and tested on large-scale gene expression datasets with promising results. A java-based gene expression analysis and visualization software, EXPANDER (expression analyzer and displayer www.cs.tau.ac.il?rshamir/expander/expander.html), was recently developed. It contains several clustering methods, including CLICK, biclustering, and conventional clustering algorithms. It enables clustering, visualizing, and functional enrichment and promoter analysis (Shamir, 2003). One of appealing advantages of CLICK is that it does not require a predefined, initial number of clusters (Sharon et al., 2003). Several refinement procedures guarantee the scalability of CLICK. In (Sharan et al., 2002), CLICK was applied to two publicly available gene expression datasets and the results were compared with two standard clustering methods. The degree of homogeneity (similarity within clusters) and separation (dissimilarity between clusters) produced by CLICK, GeneCluster, and hierarchical clustering algorithm from this comparison are given in Tables 1 and 2. The conclusion from the study is that CLICK is superior over the other two algorithms (Sharan et al., 2002, 2003). In addition, CLICK clustering is very fast (Table 3), with the speed of clustering thousands of elements in minutes and over 100,000 elements in a couple of hours on a regular workstation (Shamir, 2001; Sharan et al., 2003). However, CLICK still has its pitfalls. The search of kernels contains a sequence of heuristic procedures to expand the kernels to full clustering, and the heuristic search is not guaranteed to be exhaustive, and thus, does not guarantee globally optimum results (Tseng, 2004). Furthermore, both embedded clusters and highly intersected clusters cannot be handled by CLICK; they are recognized as a single cluster (Jiang et al., 2004). Cluster affinity search technique Cluster affinity search technique (CAST) (Ben-Dor et al., 1999) is a probabilistic model developed for discovering true clusters with high probability. It is also a graph-based heuristic clustering algorithm inspired by the idea of a corrupted clique graph (Ben-Dor et al., 1999). CAST assumes that complex gene expression measure introduces random errors to the true clustering, (i.e., similarity measure for any two genes with probability is assumed to be wrong). Using graph representation, the input data is represented by an undirected graph, in which each node represents each gene and edges connect genes with similar ex- TABLE 3. TIME PERFORMANCE OF CLICK Elements Problems Time (min) 517 826 2,329 20,275 117,835 Gene expression fibroblasts Gene expression yeast cell cycle cDNA OFP blood monocytes cDNA OFP sea urchin eggs Protein similarity 0.5 0.2 0.8 32.5 126.3 518 CLUSTERING MICROARRAY GENE EXPRESSION DATA pression patterns. Therefore, the true clusters can be denoted by a clique graph H, a set of disjoint cliques. The corrupted random graph G is derived from H by randomly adding or removing edges with probability . The whole process of clustering gene expression profiles can be viewed as finding the ideal clique graph H from the corrupted graph G with the fewest errors. CAST forms nonhierarchical (unrelated) clusters with clear boundaries (Ben-Dor et al., 1999). The heuristic implementation of CAST provides users with the choice of the preferred algorithm based on the goal of the experiment (Ben-Dor et al., 1999). CAST has a wide range of applications in gene expression data analysis. It has been utilized to analyze temporal gene expression patterns, to identify multicondition expression patterns, and to classify tumor tissues (Ben-Dor et al., 1999). The experimental results demonstrate that CAST is a useful, efficient analysis tool in gene expression data analysis. Sharon et al. (2003) did a comparative experiment on CLICK and CAST in terms of compatibility score defined as the number of tissue pairs that are mates or nonmates in both the true labeling and the clustering solution (Ben-Dor et al., 1999; Sharon et al., 2003). In this analysis CLICK and CAST were applied on two datasets: colon epithelial cell samples and acute lymphoblastic leukemia (ALL) samples. The average classification accuracy results are comparable. The performance of CAST was slightly better on colon dataset while CLICK performed better on leukemia dataset. Table 4 shows the results of this comparative experiment (Sharan et al., 2002). Although CAST does not rely on a user-defined cluster number, it still requires predefined affinity parameter t. Also, the running time of the theoretical version is exponential. As claimed by Bellaachia et al. (2002), CAST has an expensive cleaning step that is used to move data points from their current cluster to another cluster so that they may have a higher affinity. COMPLEX CLUSTERING ALGORITHMS Fuzzy clustering The clustering algorithms discussed so far belong to hard or crisp clustering based on the assumption that each gene can be assigned to only one cluster. However, restriction of one-to-one mapping might not be optimal in gene expression data analysis. In fact, a majority of genes can participate in different genetic networks and are governed by a variety of regulatory mechanisms (Futschik and Kasabov, 2002). Therefore, for analysis of gene expression data, it is more desirable to use fuzzy clustering algorithms, which provide one-to-many mapping where single gene can belong to multiple, distinct clusters with certain degrees of membership (Fig. 7). The memberships can further be used to discover more sophisticated relations between the data object and its disclosed clusters (Xu and Wunsch, 2005). In addition, fuzzy logic provides a systematic and unbiased way to transform precise, numerical values into qualitative descriptors through a so-called fuzzification process (Woolf and Wang, 2002). This process is beneficial for analysis of gene expression data where no prior knowledge is available for the datasets. Furthermore, fuzzy clustering methods are robust to noise and biases. Recently, many fuzzy clustering approaches have been applied for clustering microarray data. Gasch and Eison (2002) explored the conditional coregulation of yeast gene expression through fuzzy k-means; Wang et al. (2003) performed tumor classification and marker gene prediction by using fuzzy c-means; Dembélé and Kastner (2003) applied fuzzy c-means for clustering microarray data by assigning membership values to genes. Belacel et al. (2004a) have applied the new fuzzy clustering method, fuzzy j-means, for clusterTABLE 4. COMPARISON OF THE CLASSIFICATION QUALITY OF CLICK AND CAST Dataset Method Correct (%) Incorrect (%) Unclassified (%) Colon CLICK CAST CLICK CAST 87.1 88.7 94.4 87.5 12.9 11.3 2.8 12.5 0.0 0.0 2.8 0.0 Leukemia 519 BELACEL ET AL. FIG. 7. Fuzzy C-means method. The centroids and clusters are determined silimar to standard K-means method. However, in the final result, all the input data points are assumed to belong to all the clusters but with varying degrees of membership, depending on their distance from the centroids. ing microarray gene expression data. In the same paper, the authors also presented different methods for the utilization of cluster membership information in determining gene coregulation. Fuzzy C-means Fuzzy C-means (FCM), first described in 1981 by Bezdek, is still the most popular fuzzy clustering algorithm in gene expression data analysis. FCM considers each gene as a member of all clusters, with different degrees of membership. Membership of a gene is closely related to the similarity between the gene and a given centroid. High similarity between a gene and a closes centroid indicates a strong association to the cluster, and its membership value is close to 1. Otherwise, its membership value is close to 0. FCM starts with a predefined initial number of clusters c, a fixed fuzzification parameter m, and a small positive number . It iteratively updates the membership matrix and the centroid matrix until changes on centroids are less than . FCM is a convenient method to select genes that are corelated to multiple clusters and to unravel complex regulatory pathways that control the expression pattern of genes. However, FCM is a local heuristic search algorithm that can be easily stuck into the local optimum with no guarantee of finding global optimum. Also, it still requires the user to define the initial parameters. Numerous FCM variants have been proposed to improve the drawbacks of FCM. Belacel et al. (2004b) embedded FCM into a variable neighborhood search (VNS) meta-heuristic to address the problem of convergence to a local optimum. VNS is a meta-heuristic developed for solving combinatorial and global optimization problems, the idea of which is the systematic change of neighborhood within a local search (Hansen and Mladenovic, 1998). This method has been tested on four cDNA microarray datasets. The results demonstrated that VNSFCM improves the performance of conventional FCM and provides superior accuracy in clustering cDNA microarray data. Dembélé and Kastner (2003) proposed a method aimed at alleviating the difficulty with predefined parameters. Instead of setting the fuzzification parameter m to the default value of 2, the new method com520 CLUSTERING MICROARRAY GENE EXPRESSION DATA putes an upper bound value of the fuzzification parameter m independent of the number of clusters. The number of clusters in this method is calculated using CLICK algorithm. The work of Dembélé and Kastner elucidated that FCM clustering is a convenient way to define subsets of genes that exhibit tight association with given clusters. Probabilistic clustering Similar to fuzzy clustering, probabilistic clustering allows each data object to belong to multiple clusters with certain probabilities, thus facilitating the identification of overlapped groups of genes under conditional coregulations. Probabilistic clustering is based on statistical mixture models, that is, it assumes that the data is generated by a finite mixture of underlying probability distributions, such as Gaussian distribution, with each component corresponding to a distinct cluster (Yeung et al., 2001). Statistic models can be formulated and easily fit to different datasets. Therefore, statistic model-based clustering offers a principle alternative to heuristic algorithms in terms of determining the number of clusters and suggesting an appropriate clustering method (Yeung et al., 2001). Additionally, a finite mixture of distributions has provided a sound mathematical approach to create a variety of required random phenomena (McLachlan et al., 2002). EM-based probabilistic clustering The goal of statistical mixture models, such as EM clustering, is to identify or at least estimate unknown parameters (the means and standard deviations) of underlying probability distributions for each cluster in order to maximize the likelihood of the observed data distribution. The EM algorithm was first proposed by Dempster et al. (1977) and is a widely used approach for learning unobserved variables in machine learning. In probabilistic clustering, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters. The results of EM clustering are different from those computed by k-means clustering. While the latter assigns observations to clusters by trying to maximize the distances between clusters, the EM algorithm computes classification probabilities rather then actual assignments of observations to cluster. In other words, in this method each observation belongs to each cluster with a certain probability. Of course, from the final result it is usually possible to determine the actual assignment of observations to clusters, based on the (largest) classification probability. The EM algorithm can also accommodate categorical variables. The program will at first randomly assign different probabilities (or more precisely, weights) to each category for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters. Although probabilistic clustering has been widely used in gene expression data analysis, it still has some limitation. Probabilistic clustering relies on the assumption that the dataset fits a specific distribution. This may not always be the case. For instance, the Gaussian distribution model may not be effective for time-series data, since it treats the time points as unordered, static attributes and ignores the inherent dependency of the gene expression on time (Jiang et al., 2003a). In fact, currently, there is no well-established general model for gene expression data (Jiang et al., 2003a). Furthermore, EM converges to local maximum likelihood. McLachlan et al. (2002) applied the principle of probabilistic clustering in a software EMMIX-GENE. EMMIX-GENE can be used to classify tissue samples based on genes and to cluster genes based on tissue samples. Instead of using Gaussian distribution model, it utilizes t mixture distributions in the gene selection stage. By adapting mixtures of factor analyzers, it models the distribution of a high-dimensional gene expression data on tissues. Mar and McLachlan (2003) deployed EMMIX-GENE to cluster breast cancer data samples on the basis of gene expressions. In practice, the problem of classification data samples based on the expression values is nonstandard since the number of genes is significantly larger than the number of tissue samples. The experimental results showed that EMMIX-GENE is a useful tool for reducing large numbers of genes to a more manageable size, making it very attractive for classification of cancer tissue samples. Yeung et al. (2001) applied probabilistic clustering to three gene expression datasets. Their work proved that the probabilistic clustering not only has superior clustering performance but also can be used to select the appropriate clustering model and determine the right number of clusters. 521 BELACEL ET AL. It should be pointed out that several criteria, such as Bayesian Information Criterion (BIC) (Fraley and Raftery, 1998), approximate weight of evidence (AWE) criterion (Banfield and Raftery, 1993), and Bayes factors (Kass and Raftery, 1995), have been used with probabilistic clustering to find the suitable clustering model and determine the right number of clusters. Within them, BIC is in popular use for selecting clusters and modeling structure to fit the given dataset (Fraley and Raftery, 1998). Based on clustering algorithms, several clustering software have been developed for high-throughput gene expression analysis. Well-designed, user-friendly software provide scientists an efficient way to extract, manage, analyze, and visualize DNA microarray data. In the remaining discussion of this section, we will give a brief review of some of the popular clustering software. CLUSTERING SOFTWARE Many commercial and open source software tools have been developed for clustering of high-throughput biological data, either designed to perform specific analysis or to provide the whole set of microarray analysis steps, including data preprocessing, dimensionality reduction, normalization, clustering, and visualization (Leung, 2004). (Bolshakova, 2005; Génopole, 2006; Leung, 2004; Li, 2004), Table 5 provides a comparison of microarray clustering software that is, based on several online resources. The summary below is not intended to ennumerate all existing microarray software. Instead, we aim to demonstrate how the clustering algorithms have been applied as clustering analysis tools and what confined analytical problems can be solved. The comparison is focused on the unsupervised clustering domain. CONCLUSIONS Clustering is the process of grouping data objects into a set of disjoint clusters so that objects within a class have high similarity to each other, while objects in separate clusters have high dissimilarity to each other. In biological applications, clustering methods are utilized for the analysis of DNA and protein sequence information (for review of this topic, see Xu and Wunsch, 2005; Liew et al., 2005), as well as highthroughput OMICS data. In terms of gene expression analysis, clustering can be utilized for gene and sample analysis. Genes that are coregulated are expected to have similar expression patterns, and thus cluster analysis can in principle find genes that share the same transcriptional regulation, providing a useful cue for an understanding of transcriptional regulatory networks. Further study of coexpressed genes can suggest the functions of uncharacterized genes. Many clustering algorithms have been developed to fulfill these analysis tasks. For instance, as we mentioned in the previous section, Eisen et al. (1998) adopted the average linkage agglomerative hierarchical clustering algorithm to discover coexpressed genes in Saccharomyces cerevisiae data. Tavazoie et al. (1999) used K-means clustering of whole-genome mRNA data and sequence motif to identify transcriptional regulatory subnetworks in yeast. Some other conventional clustering algorithms and extensions have been proven to be useful in pattern recognition (Hastie et al., 2000; Hruschka et al., 2006; Lu et al., 2004a; Su and Chang, 2001; Toronen et al., 1999). Newly developed clustering methods, such as CLICK, CAST and fuzzy j-means, have also shown promising experimental results (Belacel et al., 2004a; Ben-Dor et al., 1999; Shamir, 2001; Sharan et al., 2002, 2003). Cluster analysis has also been applied on temporal gene expression data to identify different cell cycles. It is also a part of many important applications in pharmaceutical and clinical research (Liew et al., 2005). Comparison of gene expressions between normal and disease cells or clustering of tissue samples can provide information about disease genes and subtypes of diseases (Xu and Wunsch, 2005). For instance, Shai et al. (2003) performed K-means clustering analysis to identify molecular subtypes of gliomas. Wang et al. (2003) performed tumor classification and marker gene prediction by using fuzzy c-means. McLachlan et al. (2002) applied probabilistic clustering to classify tissue samples on the basis of genes and to cluster genes based on tissue samples. Fuzzy clustering and probabilistic clustering algorithms have been applied for the determination of multifunctional genes (Belacel et al. 2004a). Other methods, such as subspace clus522 Bayes group at Ames Research Center Optimal Design Autoclass Children’s Hospital’s informatics programs, Harvard Medical School Cleaver Stanford Biomedical Information Cluster and Eisen laboratory TreeView Lawrence Berkeley National Lab Alternative: Cluster3.0 (de Hoon, University of Tokyo) CLUSFAVOR Molecular Biology Computation Resource, Baylor College of Medicine CTWC Department of Physics Complex Systems, Weizmann Institute of Science Engene Computer Architecture Department, University of Malaga, Spain CAGED ArrayMiner University of Hong Kong Organization AMIADA Software OF <http://ctwc.we izmann.ac.il/> <http://www.engene. Web-based exploratory data analysis tool cnb.uam.es/> for visualizing, pre-processing, and clustering large sets of gene expression data Free Free Perform cluster and factor analysis to reveal unique expression profiles for genes and ESTs for which pathway and function information is unknown To identify subsets of genes and samples <http://condor.bcm. tmc.edu/genepi/ clusfavor.html> Free for nonprofit users Free Web-based visualization, classification, clustering tool Reveal coexpressed genes and co-regulated genes and visualize possible functional groups <http://classify. stanfford.edu/> <http://rana.lbl.gov/ EisenSoftware.htm> Analyze temporal gene expression data, automatically identify number of clusters Derive the maximum posterior probability classification, optimum number of clusters in gene expression data Reveal the true structure of gene expression data, find the best possible clusters, detect outliers Identify coexpressed genes Analytical features MICROARRAY CLUSTERING SOFTWARE <http://dambe. bio.uottawa.c a/amiada.asp> <http://ic.arc.na sa.gov/ic/projects/ bayes-group/autoclass <http://www.o ptimaldesign. com/ArrayMiner/ ArrayMiner.htm> <http://genome thods.org/caged/ about.htm> URL COMPARISON Free Free for nonprofit users Commercial (light version free) Free Free License TABLE 5. (continued) K-means, HAC, fuzzy and kernel C-means, PCA, Sammon’s map, SOM Coupled two way clustering Hierarachical clustering, PCA Linear discrimination classification, K-means, PCA Hierarchical clustering, SOM, K-means, PCA Bayesian model-based clustering (Bayesian clustering by dynamics) Gaussian mixture model-based clustering, genetic optimization algorithm PCA, single-linkage, completelinkage, and average-linkage algorithms Model-based clustering: unsupervised Bayesian classification system Included clustering algorithms Computational Genomics Lab, Bioinformmatics Graduate Program, Boston University Broad Institute Applied Math BioDiscovery GEMS GeneCluster GeneMaths GeneSight Commercial Commercial Free for academic users Free (open source) Commercial BioSieve, USA Expression Sieve License European Bioinformatics Free (open source) Institute (EBI) Organization OF <www.biodis covery.com/ index/genesight <http://www.ap plied-maths.com/ genemaths/gene maths.htm> <http://www.broad. mit.edu/cancer/ software/gene cluster2/gc2.html> <http://genomics10. bu.edu/terrence/ gems/gems.html <http://www.bio sieve.com/product. Web-based data analysis, visualization tool; export data from ArrayExpress; data analysis including clustering analysis, clustering comparison, between group analysis Link biological significance to expression patterns, data analysis for disease research, drug discovery, and systems biology Identify genes that are functionally related, participating in the same pathways, affected by the same drug or pathological condition, or coregulated genes controlled by a small group of transcription factors Standard SOM analysis tool. Gene Cluster 2.0 extends GeneCluster 1 by adding supervised classification, gene selection and permutation test, and marker gene finder Mathematics-based most versatile software, integrating with error handling, supervised learning: SVM, K-nearest neighbor, active history, analysis of template recording, hypothesis testing, cluster significance indication based on bootstrap techniques, powerful dendrogram layout Perform normalization, visualization, and statistical analysis; identify genes with true differential expression patterns between experimental conditions, disease states Analytical features MICROARRAY CLUSTERING SOFTWARE (CONT’D) <http://ep.ebi.ac. uk/EP/> URL COMPARISON Expression Profiler Software TABLE 5. Hierarchical clustering, K-means, SOM, PCA, time course analysis Hierarchical clustering, K-means, PCA, SOM, a variety of similarity distance, pair-group clustering methods, Ward’s method, pattern matching, time course analysis SOM Biclustering (based on Gibbs sampling paradigm) Hierarchical clustering, K-means, PCA, SOM, eight similarity search Hierarchical clustering, K-means, K-medoids, PCA, similarity search with a variety of distance measures Included clustering algorithms Agilent Technologies Institute for Genomics and Bioinformatics, Graz University of Technology ContentSoftAG Stanford University GeneSpring Genesis GeneViz GeneXpress Free Free for nonprofit users Free for nonprofit users Commercial <http://genexpress. stanford.edu/ Visualization and statistical analysis of outputs of clustering and motif finding algorithms, global and detailed views of expression profiles, promoter regions, and motifs, integrate with Gene Ontology to associate each cluster with cellular processes, identify motifs in the promoter regions of the genes in each cluster Advanced statistical tests for identifying differentially expressed genes; supervised K-nearest neighbors, SVM tools for finding clinically predictive patterns of gene expression data; unsupervised clustering methods for pattern recognition; reveal correlations between experimental parameters and gene expression profiles for hypothesis testing <http://genome. Simultaneously visualize and analyze a tugraz.at/Software/ whole set of gene expression GenesisCenter.html> experiments, integrating one-way ANOVA for detection of differentially expressed genes, SVM for classification of unknown genes and identifying functions of unknown genes, Gene Ontology for monitoring gene and protein roles in cellular process, mapping expression data onto chromosomal sequences <http://businessbox4. Advanced analysis and visualization of server-home.net/ microarray data, detecting sample user/index.php clusters and their correlated genes <www.chem. agilent.com/ scripts/pds.asp? lpage=27881 (continued) Double conjugated clustering, two-way clustering samples and genes simultaneously; Singular value decomposition sorting (SVD) (alter: orders genes according to entries of a left singular vector and samples according to entries of a right singular vector, PCA Hierarchical clustering, K-means or other clustering, motif finding algorithms Hierarchical clustering, K-means, SOM, PCA and more than 10 similarity distance measurements SOM, hierarchical clustering, K-means clustering, QT clustering, PCA Free for nonprofit users NCI Laboratory of Experimental and Computational Biology, open source: <sourceforge.net> Mozilla public (MPL) license 1.1 Bioinformatics Research Group, University of Bergen J-Express Free MAExplore Katholieke Universiteit Leuven INCLUSive Free Free for nonprofit users Bioinformatics Department, CIPF GEPAS License OF <www.ccrnp.ncifcrf. A comprehensive online tool with gov/MAExplorer normalization, data filtering, data Open source: <http:// filtering, viewing data with scatter maexplorer.source plots, histograms, expression profile forge.net/> plots, array pseudoimages, cluster analysis, comparison of expression patterns and outliers, access directly to genomic databases: GenBank, NCI’s mAdb <www.ii.uib.no/~ bjarted/jexpress/> J-Express Pro: <www.molmine.com/ frameset/frm_ jexpress.htm> <www.cs.tcd.ie/ To group samples or genes based on Nadia.Bolshakova/ similar expression patterns, evaluate Machaon.html the quality of the clusters obtained, support third-party clustering tools <http://homes.esat. kuleuven.be/~dna/ BioI/Software.html Included clustering algorithms Clustering algorithms: hierarchical clustering, K-means; Validation algorithms: C-index, Davis-Bouldin, Goodman-Kruskal, and silhouette indices; measure gene-to-gene, sample-to-sample, intercluster, and intracluster distances Hierarchical clustering, K-means, K-medoids Comprehensive tool for normalization, SOTA, hierarchical clustering, K-means, preprocessing, data analysis, and SOM, SOM tree, Caat: visualizing visualization. SVM classification for hierarchical trees class prediction, multiple testing for differentially expressed genes, integrating gene ontology for functional annotation Web portal service for the analysis of gene Adaptive quality based clustering expression data and discovery of cisregulatory sequence elements, a suite of tools including ANOVA normalization, filtering and clustering, functional scoring of gene clusters, sequence retrieval, and detection of known and unknown regulatory motifs in the upstream sequences Integrating with multidimensional scaling Hierarchical clustering with several method to visualize the data in two or similarity distance measurements, three dimensions K-means, PCA, SOM Analytical features MICROARRAY CLUSTERING SOFTWARE (CONT’D) <http://gepas.bio info.cipf.es/ URL COMPARISON Machaon CVE Trinity College Dublin Organization Software TABLE 5. Laboratory of DNA Information Analysis of Human Genome Center, Institute of Medical Science, University of Tokyo Rosetta Inpharmatics, LLC Department of Medical Physics, School of Medicine, University of Patras, Greece TIGR: the Institute of Genomic Research, Rockville, MD Stanford University Open source clustering software Rosetta Resolver s-Net-SOM TIGR MeV XCluster Free for nonprofit users Free (open source) Open source Cluster 3.0 under the original Cluster/TreeView License, C clustering library and Python extension module under Python license, Perl extension module under Artistic License Free for non commercial use <http://genetics. stanford.edu/ ~sherlock/cluster. html <www.tm4.org/ mev.html <http://heart.med. upatras.gr/bio/ <www.rosettabio. com/products/ resolver/default. htm> <http://bonsai.ims. u-tokyo.ac.jp/% 7Emdehoon/soft ware/cluster/index. html> Cluster analysis of gene expression data to determine similar genes under particular conditions, unknown genes, error models and statistic analysis for testing analytical results, a scalable gene expression analysis for pharmaceutical and biological analysis Overcome the drawbacks of most of clustering methods that prior knowledge of number of clusters, adaptively determine the number of clusters with a dynamic extension process, inhomogeneous measure balances unsupervised, supervised and model complexity criteria Comprehensive tool for clustering, visualization, classification, statistical analysis, and implementation of many algorithms including bootstrapping; a variety of clustering algorithms, ANOVA and t test for differentially expressed genes, SVM for classification, SAM for correlation of gene expression data to a variety of clinical parameters, view pathways and genome/ chromosomal maps of gene expression data Similar to Cluster/TreeView; integrated into some databases as data analysis tool. Implement K-means, hierarchical clustering, and SOM as a C clustering library of routines, cluster 3.0 is an improved version of Eisen’s Cluster, Python and Perl interface to the C clustering library with script language SOM, K-means clustering, average linkage hierarchical clustering Hierarchical clustering, K-means, SOM, SOTA, CAST, PCA, QT clustering, gene shaving Supervised Network SOM Hierarchical divisive, agglomerative clustering, SOM, K-means, K-medians K-means, K-medoids, hierarchical (pairwise single-average, maximum-, and centroid-linkage) clustering and SOM, a variety of similarity measurements BELACEL ET AL. tering have been developed specifically for genomics applications (Agrawal et al., 1998; Cheng and Church, 2000; Getz et al., 2000; Parsons et al., 2004). Unlike other clustering algorithms, subspace clustering uncovers the subspace based on genes and samples symmetrically. Microarray technologies have made it possible to monitor expression levels of tens of thousands of genes in parallel. Discovering the patterns hidden in gene expression data offers a tremendous potential for advanced investigation in molecular biology and system biology. High volume of gene expression data and the complexity of biological networks increase the difficulty of interpreting the hidden patterns. Clustering gene expression data is the first step in addressing this challenge. Some conventional clustering methods reviewed in this paper have been proven useful in clustering gene expression data. At the same time, some recently developed clustering methods such as CLICK show promising experimental results. However, there is still no absolute winner among clustering methods. Thus, future research needs to address existing problems and to design algorithms that can more accurately address research directions and needs of the biomedical research community. REFERENCES AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., and RAGHAVAN, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of ACM SIGMODs 99 Inernational Conference Management of Data (Philadelphia, PA) pp. 94-105. ALIZADEH, A., EISEN, M.B., DAVIS, R.E., MA, C., LOSSOS, I.S., ROSENWALD, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. ANKERST, M., BREUNIG, M.M., KRIEGEL, H., and SANDER J. (1999). OPTICS: ordering points to identify the clustering structure. Proceedings of ACM SIGMOD’s 99 International Conference on Management of Data (Philadelphia PA). AU, N.H., CHEAUG, M., HUNTSMAN, D.G., YORIDA, A., COLDMAN, A., and ELLIOTT, W.M. (2004). Evaluation of immunohistochemical markers in non-small cell lung cancer by unsupervised hierarchical clustering analysis: a tissue microarray study of 284 cases and 18 markers. J Pathol 204, 101–109. BANFIELD, J., and RAFTERY, A. (1993). Model-based Gaussian and non-Gaussian clustering. Bioinformatics 49, 803–821. BELACEL, N., CUPERLOVIC-CULF, M,, LAFLAMME, M., and OUELLETTE, R. (2004a). Fuzzy J-means and VNS methods for clustering genes from microarray data. Bioinformatics 20, 1690–1701. BELACEL, N., CUPERLOVIC-CULF, M., OUELLETTE, R., and BOULASSEL, M. (2004b). The variable neighborhood search metaheuristic for fuzzy clustering cDNA microarray gene expression data. Proceedings of IASTEDAIA-04 Conference (Innsbruck, Austria). BELLAACHIA, A., PORTNOY, D., CHEN, Y., and ELKAHLOUN A.G. (2002). E-CAST: a data mining algorithm for gene expression data. Proceedings of 2nd ACM SIGKDD on Data Mining in Bioinformatics, pp. 49–54. BEN-DOR, A., SHAMIR, R., and YAKHINI, Z. (1999). Clustering gene expression patterns. J. Comput Biol, 6, 281–297. BERKHIN, P. (2002). Survey of clustering data mining techniques. Available at: http://citeseer.ist.psu.edu/berkhin02survey.html. Accessed April 27, 2006. BEZDEK, J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press, NY, New York). BOLSHAKOVA, N. (2005). Microarray Software Catalogue. Available at: www.cs.tcd.ie/Nadia.Bolshakova/softwarecatalogue.html. Accessed May 24, 2006. CHENG, Y., and CHURCH, G.M. (2000). Biclustering of expression data. Proceedings of 8th International Conference of Intelligent Systems for Molecular Biology. pp. 93–103. CHU, S., DERISI, J., EISEN, M., MULHOLLAND, J., BOTSTEIN, D., BROWN, P.O., et al. (1998). The transcriptional program of sporulation in budding yeast. Science 283, 699–705. DEMBÉLÉ, D., and KASTNER, P. (2003). Fuzzy C-means for clustering microarray data. Bioinformatics 19, 973–980. DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc B 39, 1–38. EISEN, M.B., SPELLMAN, P.T., BROWN, P.O., and BOTSTEIN, D. (1998). Cluster analysis and display of genomewide expression patterns. PNAS 95, 14863–14868. ESTER, M., KRIEGEL, H.P., SANDER, J., and XU, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd KDD ’96. (AAAI Press, Menlo Park, CA) pp. 226–231. 528 CLUSTERING MICROARRAY GENE EXPRESSION DATA FRALEY, C., and RAFTERY, A. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41, 578–588. FUTSCHIK, M.E., and KASABOV, N.K. (2002). Fuzzy clustering of gene expression data. WCCI 1, 414–419. GASCH, A.P., and EISON, M.B. (2002). Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol 3, 1–22. GÉNOPOLE (2006). Microarray data analysis: software of unsupervised clustering. Available at: http://genopole. toulouse.inra.fr/bioinfo/microarray/index.php?pagelogiciels&sousdomaineUnsupervisedClustering&langen. Accessed on May 24, 2006. GETZ, G., LEVINE, E., and DOMANY, E. (2000). Coupled Two-way clustering analysis of gene expression data. Proc Natl Acad Sci USA 97, 12079–12084. GUHA, S., RASTOGI, R., and SHIM, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of ACM SIGMOD International Conference of Management of Data. HAN, J., and KAMBER, M. (2001). Data Mining: Concepts and Techniques. (Morgan Kaufmann, San Francisco, CA), Ch 8. HANSEN, P., and MLADENOVIC, N. (1998). Variable neighborhood search. Meta-heuristics—Advances and Trends in Local Search Paradigms for Optimization, S. Voss, et al., eds. (Kluwer Academic, Amsterdam), pp. 433–458. HARTUV, E., and SHAMIR, R. (2000). A clustering algorithm based on graph connectivity. Inf. Process. Lett. 76, 175–181. HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A., LEVY, R., and STAUDT, L. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol, 1, RESEARCH0003. HERRERO, J., and DOPAZO, J. (2002). Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. J Proteome Res, 1, 467–470. HINNEBURG, A., and KEIM, D.A. (1998). An efficient approach to clustering in large multimedia databases. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. pp. 58–65. HRUSCHKA, E.R., CAMPELLO, R., and DE CASTRO, L.N. (2006). Evolving clusters in gene-expression data. Inf Sci, 176, 1898–1927. HSU, A.L., TANG, S., and HALGAMUGE, S.K. (2003). An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 19, 2131–2140. JAIN, A.K., and DUBES, R.C. (1988). Algorithms for Clustering Data. (Prentice-Hall, Englewood, NJ). JAIN, A.K., MURTY, M.N., and FLYNN, P.J. (1999). Data clustering: a review. ACM Comput Surveys 31, 264–323. JIANG, D., PEI, J., and ZHANG, A. (2003a). Towards interactive exploration of gene expression patterns. ACM SIGKDD Explor Newslett 5, 79–90. JIANG, D., PEI, J., and ZHANG, A. (2003b). DHC: a density-based hierarchical clustering method for time series gene expression data. Proceedings of the Third IEEE Symposium on Bioinformatics and Bioengineering, Mar. 2003, pp. 393–400. JIANG, D., TANG, C., and ZHANG, A. (2004). Cluster analysis for gene expression data: a survey. IEEE Trans Knowledge and Data Eng 16, 1370–1386. KARYPIS, G., HAN, E.H., and KUMAR, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer 32, 68–75. KASS, R., and RAFTERY, A. (1995). Bayes factors. J Am. Stat Assoc 90, 773–795. KAUFMAN, L., and ROUSSEEUW, P. (1990). Finding Groups in Data. (Wiley, New York, NY). KENNEDY, J., and EBERHART, R.C. (1999). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks. Piscataway, NJ, IV, pp. 1942–1948. KOHONEN, T. (1984). Self-Organization and Associative Memory. (Spring-Verlag, Berlin). KOLEHMAINEN, M.T. (2004). Data exploration with self-organizing maps in environmental informatics and bioinformatics. Kuopio Univ Publ. C Nat Environ Sci 167, 1–73. KRISHNA, K., and NARASIMHA MURTY, M. (1999). Genetic K-means algorithm. IEEE Trans Syst Man Cybern, Part B. 29, 433–439. KRÖGER, P. (2004). Coping with new challenges for density-based clustering [PhD thesis]. München, Germany: Ludwig-Maximilians Universität. Available at http://edoc.ub.uni-muenchen.de/archive/00002396/01/Kroeger_Peer.pdf. Accessed May 16, 2006. LEUNG, Y.F. (2004). My microarray software comparison. Available at http://ihome.cuhk.edu.hk/~b400559/ arraysoft.html. Accessed May 23, 2006. LI, W. (2004). Bibliography on microarray data analysis. Available at http://www.cbi.pku.edu.cn/mirror/microarray/ soft.html. Accessed May 24, 2006. LIEW, A.W., YAN, H., and YANG, M. (2005). Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognition 38, 2055–2073. 529 BELACEL ET AL. LU, Y., LU, S., DENG, Y., and BROWN, S. J. (2004a). Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172–182. LU, Y., LU, S., FOTOUHI, F., DENG, Y., and BROWN, S.J. (2004b). FGKA: a fast genetic K-means clustering algorithm. Proceedings of the 2004 ACM symposium on Applied computing (SAC), Nicosia, Cyprus, March 2004. MAKRETSOV, N.A., HUNTSMAN, D.G., NIELSEN, T.O., YORIDA, E., PEACOCK, M., CHEANG, M.C.U., et al. (2004). Hierarchical clustering analysis of tissue microarray immunostaining data identifies prognostically significant groups of breast carcinoma. Clin Cancer Res 10, 6143–6151. MAR, J.C., and MCLACHLAN, G.J. (2003). Model-based clustering in gene expression microarrays: an application to breast cancer data. Int J Software Eng Knowledge Eng, 13, 579–592. MCLACHLAN, G. J., BEAN, R.W., and PEEL, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413–422. MERCER, D.P., and COLLEGE, L. (2003). Clustering large datasets. Available at http://www.stats.ox.ac.uk/ ~mercer/documents/Transfer.pdf. Accessed May 11, 2006. MORGAN, B.J., and RAY, A.P.G. (1995). Non-uniqueness and inversion in cluster analysis. Appl. Stat 44, 699– 705. MOUGEOT, J.L., BAHRANI-MOSTAFAVI, Z., VACHRIS, J.C., MCKINNEY, K.Q., GURLOV, S., ZHANG, J., et al. (2006). Gene expression profiling of ovarian tissues for determination of molecular pathways reflective of tumorigenesis. J Mol Biol 358, 310–329. NIELSEN, T.O., WEST, R.B., LINN, S.C., ALTER, O., KNOWLING, M.A., O’CONNELL, J.X., et al. (2002). Molecular characterization of soft tissue tumours: a gene expression study. Lancet 359, 1301–1307. PARSONS, L., HAQUE, E., and LIU, H. (2004). Subspace clustering for high dimensional data: a review. SIGKDD Explorations 6, 90–105. RAMASWAMY, S., ROSS, K.N., LANDER, E.S., and GOLUB, T.R. (2003). A molecular signature of metastasis in primary solid tumors. Nat Genet. 33, 49–54. RONNING, C.M., STEGALKINA, S.S., ASCENZI, R.A., BOUGRI, O, HART, A.L., UTTERBACH, T.R., et al. (2003). Comparative analysis of potato expressed sequence tag libraries. Plant Physiol 131, 419–429. SHAI, R., SHI, T., KREMEN, T.J., HORVATH, S., LIAU, L.M., CLOUGHESY, T.F., et al. (2003). Gene expression profiling identifies molecular subtypes of gliomas. Oncogene 22, 4918–4923. SHAMIR, R. (2001). Algorithm performance comparison. Available at www.cs.tau.ac.il/~rshamir/algmb/00/scribe00/ html/lec11/node41.html. Accessed May 18, 2006. SHAMIR, R. (2003). A gene expression analysis and visualization software—expander: online documentation. Available at: www.cs.tau.ac.il/~rshamir/expander/ver3Help.html. Accessed May 18, 2006. SHARAN, R., ELKON, R., and SHAMIR, R. (2002). Cluster analysis and its applications to gene expression data. Ernst Schering Res Found Workshop 38, 83–108. SHARAN, R., MARON-KATZ, A., and SHAMIR, R. (2003). CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 19, 1787–1799. SHERLOCK, G. (2000). Analysis of large-scale gene expression data. Curr Opin Immunol 12, 201–205. STANFORD, D.C., CLARKSON, D.B., and HOERING, A. (2003). Clustering or automatic class discovery: hierarchical methods. In A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, and M. Granzow, eds. (Kluwer Academic, Amsterdam), pp. 246–260. SU, M., and CHANG, H. (2001). A new model of self-organizing neural networks and its application in data projection. IEEE Trans Neural Network 12, 153–158. TAMAYO, P., SLONIM, D., MESIROV, J., ZHU, Q., KITAREEWAN, S., DMITROVSKY, E., and et al. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96, 2907–2912. TAN, P.N., STEINBACH, M., and KUMAR, V. (2005). Cluster analysis: basic concepts and algorithms. In Introduction to Data Mining. (Addison-Wesley, Boston), Ch 8. TAVAZOIE, S., JASON, D., HUGHES, J.D., CAMPBELL, R.J., RAYMOND, J., S et al. (1999). Systematic determination of genetic network architecture. Nat Genet, 22, 281–285. TORONEN, P., KOLEHMAINEN, M., WONG, G., and CASTREN, E. (1999). Analysis of gene expression data using self-organizing maps. FEBS Lett 451, 142–146. TSENG, G. (2004). A comparative review of gene clustering in expression profile. Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV). pp. 1320–1324. UCL ONCOLOGY. (2005). Hierarchical clustering. Available at: www.ucl.ac.uk/oncology/MicroCore/HTML_ resource/tut_frameset.htm. Accessed May 9, 2006. VAN DER LAAN, M., POLLARD, K.S., and BRYAN, J. (2003). A new partitioning around medoids algorithm. J Stat Comput Simul 73, 575–584. 530 CLUSTERING MICROARRAY GENE EXPRESSION DATA WANG, D., RESSOM, H., MUSAVI, M., and DOMNISORU, C., (2002a). Double self-organizing maps to cluster gene expression data. ESANN’s 2002 Proceedings—European Symposium on Artificial Neural Networks. Bruges, Belgium, pp. 45–50. WANG, J., BO, T.H., JONASSEN, I., and HOVIG, E. (2003). Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data. BMC Bioinformatics, 4, 60. WANG, J., DELABIE, J., AASHEIM, H.C., SMELAND, E., and MYKLEBOST, O. (2002c). Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinf, 3, 36. WARD, J.H. (1963). Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58, 235–244. WELCSH, P.L., LEE, M.K., GONZALEZ-HERNANDEZ, R.M., BLACK, D.J., MAHADEVAPPA, M., SWISHER, E.M., and et al. (2002). BRCA1 transcriptionally regulates genes involved in breast tumorigenesis. Proc Natl Acad Sci USA, May 99, 7560–7565. XIAO, X., DOW, E.R., EBERHART, R., BEN MILED, Z., and OPPELT, R.J. (2003). Gene clustering using self-organizing maps and particle swarm optimization. Proceedings of the Seventeenth Intermational Symposium on Parallel and Distributed Processing. p. 154b. XU, R., and WUNSCH, D. (2005). Survey of clustering algorithms. IEEE Trans Neural Networks 16, 645–678. YEUNG, K.Y. (2003). Clustering or automatic class discovery: non-hierarchical, non-SOM. Clustering algorithms and assessment of clustering results, In A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, and M. Granzow, eds. (Kluwer Academic, Amsterdam), pp. 274–288. YEUNG, K.Y., FRALEY, A., MURUA, A.E., RAFTERY, A.E., and RUZZO, W.L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987. ZHANG, T., RAMAKRISHNAN, R., and LIVNY, M. (1996). BIRCH: An efficient data clustering method for very large database. ACM SIGMOD Conference, pp. 103–114. Address reprint requests to: Dr. Nabil Belacel National Research Council Canada Institute for Information Technology #55 Crowley Farm Road Suite 212, Scientific Park Moncton, NB E1A 7R1 Canada E-mail: [email protected] 531