Download Clustering Methods for Microarray Gene Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
OMICS A Journal of Integrative Biology
Volume 10, Number 4, 2006
© Mary Ann Liebert, Inc.
Clustering Methods for Microarray Gene Expression Data
NABIL BELACEL,1 QIAN (CHRISTA) WANG,1
and MIROSLAVA CUPERLOVIC-CULF2
ABSTRACT
Within the field of genomics, microarray technologies have become a powerful technique for
simultaneously monitoring the expression patterns of thousands of genes under different sets
of conditions. A main task now is to propose analytical methods to identify groups of genes
that manifest similar expression patterns and are activated by similar conditions. The corresponding analysis problem is to cluster multi-condition gene expression data. The purpose
of this paper is to present a general view of clustering techniques used in microarray gene
expression data analysis.
INTRODUCTION
W
ITH ADVANCES OF deoxyribonucleic acid (DNA) microarray technology, it became possible to monitor the expression levels of tens of thousands of genes simultaneously. To analyze the large amount
of data obtained by this technology, researchers usually resort to clustering methods that identify groups of
genes that share similar expression profiles. Clustering problems are based on the notion of unsupervised
learning in which data objects within the same cluster are similar to one another and dissimilar to the objects in other clusters (Han and Kamber, 2001). In the case of clustering gene expression data, a cluster
may contain a number of genes or samples with similar expression patterns.
In gene expression clustering, the analysis is performed on a data matrix X {xij}nxd where xij represents expression levels of gene i in samples j. More precisely, each row vector is the expression pattern of
a particular gene across all d conditions while each column vector is the profile of all n genes in a particular condition. The clustering of gene expression data can be divided into two main categories: gene-based
clustering and sample-based clustering.
In gene-based clustering, genes are treated as objects and samples are treated as features or attributes for
clustering. The dataset to be clustered contains n objects: Xi {Xi1, Xi2, . . . , Xid} where 1 i n. The
resulting clusters can be represented as C {C1, . . . ,Ck} where Cj’s are disjoint clusters. The goal is to
cluster and group genes with similar expression patterns (co-expressed genes). This is the cornerstone for
further understanding of gene function, gene regulation, and cellular processes (Jiang et al., 2004). Similarly, sample-based clustering takes samples as objects and considers genes as features or attributes. The
dataset to be clustered contains d objects: Xi {Xi1, . . . Xni}where 1 i d. The resulting clusters can
1National
Research Council Canada, Institute for Information Technology, Scientific Park, Moncton, New Brunswick,
Canada.
2Atlantic Cancer Research Institute, Hôtel-Dieu Pavilion, Moncton, New Brunswick, Canada.
507
BELACEL ET AL.
be represented as C {C1, . . . ,Cm} where Cj’s are disjoint clusters. Sample-based clustering can be used
to reveal sample types, which are possibly indistinguishable by traditional morphology-based approaches
(Jiang et al., 2004).
Traditional clustering techniques can be classified into two main categories: hierarchical and nonhierarchical algorithms (Stanford et al, 2003; Yeung, 2003). The hierarchical clustering algorithms group objects
and provide a natural way for graphical representation of data. The graphical representation resulting from
hierarchical clustering is a dendrogram in which each branch forms a group of genes or samples share similar behavior (Eisen et al., 2002). These types of clustering algorithms have been used extensively for the
analysis of DNA microarray data (Alizadeh et al., 2000; Nielsen et al., 2002; Ramaswamy et al., 2003,
Welcsh et al., 2002).
Nonhierarchical clustering algorithms (i.e., partitional), on the other hand, perform a partition of genes
into K clusters so that expression patterns in the same cluster are more similar to each other (i.e., homogeneity) than those in different clusters (i.e., separation). Several partitional algorithms, such as K-means,
partitioning around medoids and self-organizing maps (SOM) have been applied extensively to DNA microarray data generated from different biological sources (Sherlock, 2000). For instance, a K-means algorithm was developed to identify molecular subtypes of brain tumors (Shai et al., 2003), to cluster transcriptional regulatory cell cycle genes in yeast (Tavazoie et al., 1999), and to correlate changes in gene
expression with major physiologic events in potato biology (Ronning et al., 2003).
Although both hierarchical and nonhierarchical algorithms have yielded encouraging results in clustering DNA microarray data and have been used extensively, they still suffer from several limitations. When
analyzing large-scale gene expression datasets collected under various conditions, hierarchical algorithms
generate nonunique dendrograms with higher time and space complexities (Morgan and Ray, 1995); while
nonhierarchical methods group gene expression data into a fixed number of predefined clusters. Moreover,
when using different clustering algorithms to resolve the same DNA microarray data set, different conclusions can be drawn (Chu et al., 1998).
Recently, several new clustering algorithms (e.g., graph-theoretical clustering, model-based clustering)
have been developed with the intention to combine and improve the features of traditional clustering algorithms. However, clustering algorithms are based on different assumptions, and the performance of each
clustering algorithm depends on properties of the input dataset. Therefore, the winning clustering algorithm
does not exist for all datasets, and the optimization of existing clustering algorithms is still a vibrant research area.
This paper aims to provide a survey of the various methods available for gene clustering and to illustrate
the impact of clustering methodologies on the fascinating and challenging area of genomic research. The
taxonomy of clustering algorithms utilized in the field of gene expression data analysis will be introduced
and exemplified with the emerging applications. The strengths and weaknesses of each clustering technique
will be pointed out. Subsequently, the development of software tools for clustering will be emphasized, and
some of the existing commercial and open source software utilizing reviewed clustering algorithms will be
discussed. Finally, a conclusion and future research directions will be outlined.
CLUSTERING ALGORITHMS
Clustering algorithms are divided into several groups that divide simple clustering (each gene belongs to
one cluster, i.e., hard or crisp clustering) from complex clustering (each gene can belong to more than one
cluster with a certain degree of membership, (i.e., soft or relaxed clustering). The first group includes three
conventional, widely used clustering algorithms: hierarchical clustering and two nonhierarchical clustering
algorithms, K-means and SOM. The second group of algorithms includes new clustering methods, which
are specifically designed for clustering gene expression data. These methods include DHC from densitybased clustering as well as CLICK and CAST from graph-theoretical clustering. In the latter group, complex clustering algorithms representing new advancements over heuristic clustering are presented. Several
representative algorithms include fuzzy clustering and probabilistic clustering.
508
CLUSTERING MICROARRAY GENE EXPRESSION DATA
CONVENTIONAL CLUSTERING ALGORITHMS
Hierarchical clustering
The principle behind hierarchical clustering is to group data objects into a tree of clusters through either
agglomerative or divisive process. Agglomerative clustering represents a bottom-up approach, where each
data object is initially placed into its own cluster. Subsequently, the closest pairs of clusters are merged until either all the data objects are in a single cluster or a certain termination condition is satisfied. Divisive
clustering follows a top-down strategy. It starts with all objects in one cluster and splits the cluster until either each object forms its own cluster or until a certain termination condition is met (Fig. 1).
Based on linkage metric determined between two clusters, agglomerative hierarchical clustering can be
further divided into single linkage, complete linkage, or average linkage. In the single linkage method, the
distance between two clusters is determined as the distance between their closest members. Each object in
any cluster produced using this method is more closely related to at least one object of its cluster than to
any point outside it. In the complete linkage method, the distance between two clusters is given by the distance between their most distant objects. This method produces clusters with objects that lie within some
known maximum distance of one another. In the third method, average linkage, the distance between two
clusters is measured between the centroids (i.e., clusters average elements).
In divisive clustering, for a cluster with N objects, there are (2N-1 1) possible two-subset divisions
generated. Determining all the possible divisions is thus too computationally expensive, particularly for
gene clustering (Xu and Wunsch, 2005), and thus, divisive clustering is not commonly used in practice.
Eisen et al. (1998) developed a clustering software package (Cluster and TreeView, http://rana.lbl.
gov/EisenSoftware.htm) based on the average linkage agglomerative clustering algorithm. Cluster iteratively merges groups with the highest similarity value. The output of the algorithm is a two-dimensional
dendrogram. (Fig. 2). The branches of a dendrogram record the formation of groups. The length of the horizontal branches indicates the similarity between the clusters.
Hierarchical clustering methods have been extensively used in analysis of gene expression data (Fig. 3),
as well as other types of microarray data (e.g., CGH arrays, protein array). In the field of cancer research,
for example, hierarchical clustering has been used to identify cancer types (Nielsen et al., 2002; Ramaswamy
FIG. 1. Two approaches in hierarchical clustering. a–g show different genes/samples. Agglomerative clustering starts
from individual genes (one gene in one cluster), divisive clustering starts from all genes in one cluster.
509
BELACEL ET AL.
FIG. 2. Dendrogram—the most popular method to cluster microarray data. Starting from a root, the dendrogram splits
into multiple clusters according to how the genes (x, y, z, . . . ) are related. Genes are branched off as nodes. The
branches record the formation of the clusters. The length of the horizontal branches indicates the similarity between
the clusters. (Adapted from UCL Oncology, 2005.)
et al., 2003), to discover new subtypes of cancer (Alizadeh et al., 2000), and to investigate cancer tumorigenesis mechanisms (Welcsh et al., 2002) from gene expression data. Au et al. (2004) explored hierarchical clustering for the identification of different subgroups in non-small cell lung carcinoma. Their results
showed that hierarchical clustering analysis on an extended immunoprofile can identify two main cluster
groups corresponding to adenocarcinoma and squamous cell carcinoma. The hierarchical clustering analysis of Makretsov et al. (2004) on multiple marker microarray immunostaining data resulted in an improved
prognosis in patients with invasive breast cancer. Hierarchical clustering was also used by Mougeot et al.
(2006) for gene expression profiling of ovarian tissues, showing that clustering can distinguish between low
malignant potential/early cancer and possible precancerous stages.
The graphic representation of the results of hierarchical clustering allows users to “visualize global patterns in expression data” (Tseng, 2004), making this method a favorite among biologists. However, several
key issues in hierarchical clustering still need to be addressed. The most serious problem with this method
is its lack of robustness to noise, high dimensionality, and outliers (Jiang et al., 2004). Hierarchical clustering algorithms are also expensive, both computationally and in terms of space complexity (Xu and
Wunsch, 2005), and thus their applicability for the analysis of large datasets is limited. Furthermore, both
agglomerative and divisive approaches follow a greedy strategy that prevents the cluster refinement. In other
510
CLUSTERING MICROARRAY GENE EXPRESSION DATA
FIG. 3. Hierarchical clustering schema of combined yeast datasets. The color codes represent the measured fluorescence ratios. Genes with unchanged expression are colored black. Red represents relatively high and green represents
low gene expression level. The intensity of the colors reflects different degrees of expression. (From Eisen et al., 1998.
Proc Natl Acad Sci USA 95, 14863–14868.)
511
BELACEL ET AL.
words, once a decision is made to merge (or split) clusters, it is never reconsidered or optimized. The iterative merging of clusters is determined locally at each step (i.e., local objective function) by the pair-wise
distances rather than a global criterion (Tan et al., 2005).
Several alternative approaches were proposed to address the problems of standard hierarchical clustering. These approaches deploy a partitional clustering algorithm (see below) such as K-means to generate
small clusters first and then perform hierarchical clustering using these small clusters as initial points (Tan
et al., 2005). Ward’s method (Ward, 1963) implemented agglomerative hierarchical clustering using the
same proximity function as K-means (Tan et al., 2005). In order to remedy the problems in handling large
datasets, several new hierarchical clustering approaches have been proposed. One of them, CURE (clustering using representatives), is an extension of hierarchical clustering. CURE was developed with the objective to be robust to outliers and to explore clusters with nonspherical shapes and variant sizes (Guha et
al., 1998). In CURE, each cluster is represented by a fixed number of well-scattered points. Having more
than one representative point per cluster allows CURE to capture more sophisticated cluster shapes. Furthermore, CURE is less sensitive to outliers since the effects of outliers are weakened during the process
of shrinking scattered points toward the centroid.1 CURE utilizes random sampling and partitioning in order to scale large dataset without sacrificing clustering quality. The space complexity of this approach increases linearly with the dataset size while its time complexity is no worse than that of conventional hierarchical algorithm. Another noteworthy hierarchical clustering approach is BIRCH (Balanced Iterative
Reducing and Clustering using Hierarchies) (Zhang et al., 1996). BIRCH creates a special data structure,
called clustering feature (CF) tree. BIRCH clustering algorithm scans the database, and cluster summaries
are stored in memory in the form of a CF tree. During this preclustering phase, crowded data points are
grouped into subclusters while scattered data points are removed as outliers. The algorithm then applies
centroid-based hierarchical clustering to perform global clustering, incrementally eliminate more outliers,
and refine clusters. BIRCH can work with any given amount of memory, and it is also capable of handling
outliers effectively. BIRCH represents the state of the art in clustering large-scale datasets.
Partitional clustering by K-means
K-means is a very straightforward, commonly used partitioning method. The first step in the clustering
process is to randomly select K objects, each representing initial cluster mean or centroid. After that the
objects are assigned to the clusters by finding the objects’ nearest centroids. The algorithm then computes
the new mean for each cluster and reassigns the objects. The iteration stops when the boundaries of the
clusters stop changing. Figure 4 illustrates the standard K-means procedure for clustering gene expression
data.
Tavazoie et al. (1999) used K-means clustering of whole-genome mRNA data to identify transcriptional
regulatory sub-networks in yeast. By iteratively relocating cluster members and minimizing the overall intercluster dispersion, 3000 genes were grouped into 30 clusters. Then from the gene coregulation information it was possible to infer biological significance of newly discovered genes from functions of known
genes and motifs. Shai et al. (2003) performed K-means clustering analysis to identify molecular subtypes
of gliomas. Three obtained clusters corresponded to glioblastomas, lower grade astrocytomas, and oligodendrogliomas.
K-means is relatively scalable and efficient when processing large datasets. In addition, K-means can
converge to a local optimum in a small number of iterations. But, K-means still has several drawbacks.
First, the user has to specify the initial number of clusters and the convergence centroids vary with the initial partitions (Xu and Wunsch, 2005). One of the characteristics of gene expression clustering is that prior
knowledge is not available. Thus, in order to detect the optimal number of clusters, users have to run the
algorithm repeatedly with different k values, compare the clustering results, and make a decision about the
optimal number of clusters accordingly. For a large gene expression dataset, this extensive fine-tuning
process is not practical. A second problem in K-means clustering is its sensitivity to noise and outliers.
Gene expression data is noisy and has a significant number of outliers, and this can substantially influence
1Outliers
locate further from the cluster centroid than other scattered points.
512
CLUSTERING MICROARRAY GENE EXPRESSION DATA
FIG. 4. K-means for clustering gene expression data. Genes are represented as points in space, where similarly expressed genes are close together. (1) The process is initiated by randomly partitioning the genes into three groups. (2)
The centroids of three groups are given different colors. Genes are assigned to the closest centroid. (3) The results of
gene assignment are shown. (4 and 5) Steps 2 and 3 are iteratively repeated. (6) Centroids are stable; the termination
condition is met. (Modified from Gasch and Eisen, 2002. Genome Biol 3, 1–22.)
the mean values and thus cluster positions. Finally, K-means often terminates at a local, possibly suboptimal, minimum.
To improve the robustness to noise and outliers, K-medoid algorithm was introduced (Mercer and College, 2003). A medoid is a representative point for a cluster, arbitrarily selected by the algorithm (k-medoids
representing k clusters). Using medoids has two advantages: first, there is no limitation on attribute types;
and second, medoids are existing data points and thus, unlike centroids, they are generated without any
computation (Berkhin, 2002). Therefore, the K-medoid algorithm is less sensitive to outliers. The most popular K-medoid algorithm is partitioning around medoids (PAM) (Kaufman and Rousseeuw, 1990) and its
extension PAMSIL (van der Laan et al., 2003). PAMSIL replaces the objective function used in PAM with
average silhouette (first proposed by Kaufman and Rousseeuw, 1990)). The data points are assigned to k
clusters resulting in maximum average silhouette. The partition in this case depends not only on how well
the data point belongs to its current cluster, but also on how well it belongs to the next closest cluster. The
experiment on simulated microarray data demonstrated that PAMSIL has the ability to find small homogeneous clusters (van der Laan et al., 2003).
Like most partitional clustering methods, K-means requires a prior knowledge of the number of clusters.
Several research efforts were aimed at developing a method that can determine the number of clusters. Yeung et al. (2001) used probabilistic models to resolve the optimal number of clusters. Hastie et al. (2000)
proposed the “gene shaving” method, a statistical method to identify distinct clusters of genes. Hruschka et
al. (2006) introduced an evolutionary algorithm for clustering (EAC). EAC includes the K-means algorithm
as a local search procedure, applies a centroid-based objective function, eliminates the crossover operation,
and adds sophisticated mutation operation. EAC extends a clustering genetic algorithm and is capable of
automatically discovering an optimal number of clusters.
A common criticism of K-means is that it only converges to a local optimum. Some optimal techniques
such as Taboo search, simulated annealing algorithm, and genetic algorithm have been utilized to achieve
global optimization. However, these algorithms suffer from high computational costs. Trying to alleviate
513
BELACEL ET AL.
the cost, Krishna and Narasimha Murty (1999) introduced a new clustering method called genetic K-means
(GKA). GKA hybridizes the genetic algorithm with a gradient descent algorithm and K-means algorithm.
By refining distance-based mutation instead of expensive crossover operation, GKA converges to the global
optimum faster than other evolutionary algorithms. Inspired by GKA, Lu et al. (2004b) proposed a fast genetic K-means algorithm (FGKA). It features several improvements, including an efficient evaluation of the
objective value total within-cluster variation (TWCV), avoiding illegal strings by lower probabilities, and
simplification of the mutations. FGKA converges to the global optimum faster than GKA. Based on FGKA,
Lu et al. (2004a) introduced incremental genetic K-means (IGKA). IGKA has better time performance when
the mutation probability is small.
Self-organizing maps
The self-organizing map (SOM) is one of the best-known, unsupervised neural network learning algorithms; it was first developed by Kohonen (1984). It is based on a single-layered artificial neural network
(ANN) where the SOM is constructed by training. The data objects are the input of the network. The output units, called output neurons, are organized as a one-, two-, or three-dimensional map (depending on the
type of SOM). Each neuron is associated with a weight vector (reference vector). Along the learning process,
each data object acts as a training example, which directs the movement of the initially randomly associated weight vectors towards the denser areas of the input vector space. Clustering is performed by having
neurons compete for the current data object. The neuron whose weight vector is closest to the current object becomes the winning unit. The weight vector of the best-matching neuron and its set of neighbors move
towards the current object and the weights of the winning neuron and its neighbors are adjusted. As the
learning proceeds, the adjustment to the weight vectors diminishes. When the training is complete, clusters
are identified by mapping all data objects to the output neurons.
SOM is a vector quantization method, which can simplify and reduce the high dimensionality of raw expression data (Wang et al., 2002c). It allows easy visualization of complex data, as well as analysis of largescale datasets (Tseng, 2004). Toronen et al. (1999) developed and employed the prototype software GenePoint (a tree-based SOM) and Sammon’s mapping to analyze and visualize gene expression during a diauxic
shift in yeast. The work demonstrated that SOM is a reliable and fast cluster analysis tool for analysis and
visualization of gene expression profiles. Genecluster software (Tamayo et al., 1999) is also based on SOM
(http://www.broad.mit.edu/cancer/software). Genecluster takes expression levels of any gene-profiling
method and topology of neurons as input. It uses a web-based interface to visualize the clusters. Each cluster is represented by its average expression pattern with error bars showing the standard deviation at each
condition. Genecluster has been applied to hematopoietic differentiation aimed at determining optimal treatment for acute promyelocytic leukemia.
SOM is a good alternative to traditional clustering methods. One of the appealing features of SOM is
that it provides an intuitive view for mapping of a high-dimensional dataset. The neuron learning process
makes SOM more robust than K-means algorithm to noisy data (Jiang et al., 2004). In addition, SOM is
efficient in handling large-scale datasets as it has a linear run time (Herrero and Dopazo, 2002).
However, SOM still has several pitfalls. As in K-means, users are required to predefine the number of
initial clusters and the topology of the neurons. The convergence is controlled by certain parameters, such
as the learning rate and the topology of the neurons. SOM can converge to suboptimum rather than global
optimum if the initial weights are not chosen properly (Jain et al., 1999). When clustering gene expression
data, SOM results are more dependent on the size of the clusters than on the actual differences among gene
profiles (Herrero and Dopazo, 2002). Jiang et al. (2004) state also that SOM can be inaccurate for datasets
that are abundant with irrelevant and invariant genes. This data will populate the majority of clusters, and,
therefore, most of the meaningful patterns might be unidentified.
To tackle problems of SOM, Su and Chang (2001) proposed a novel model called double SOM (DSOM).
In DSOM, each node is related not only to an n-dimensional weight vector but also to a two-dimensional
position vector. During the learning process, both weight and the position vectors are updated. By plotting
the number of groups of two-dimensional position vectors, the number of clusters can be determined. Wang
et al. (2002a) applied DSOM to cluster gene expression data in yeast, using figure of merit method for val514
CLUSTERING MICROARRAY GENE EXPRESSION DATA
Output
Array
j
wnj
wij
xn
w 1j
x1
xi
x2
Input Vector
FIG. 5. Topology of a self-organizing map with a 5 5 output array. The input vector is connected to all output
nodes (only node j shown here). It is significant to note that the arrangements of bins in the output array represents the
similarity of patterns; similar patterns and bins are adjacent, and different ones are well separated.
idation. This experiment proved that DSOM can reveal the number of clusters based on the final location
of the position vectors.
Inspired by the idea of integrating merits, hierarchical clustering, and SOM, Hsu et al. (2003) recommended a hierarchical dynamic self-organizing approach, which combines dynamic SOM tree and growing
SOM (GSOM). This approach was applied on leukemia and colon cancer microarray data. The results
showed that GSOM can automatically generate an appropriate number of clusters and perform cancer class
discovery and marker gene identification. Herrero and Dopazo (2002) proposed and implemented a new
approach (SOMTree) by combining hierarchical clustering and SOM (http://bioinfo.cnio.es/
wwwsomtree/). SOMTree was used for exploratory analysis of gene expression profiles during a diauxic
shift in yeast. The result provided strong evidence that combination of SOM and hierarchical clustering
methods constitute a fast and accurate way for exploratory analysis of large datasets. To improve the rate
of convergence, Xiao et al. (2003) proposed a hybrid clustering approach that is based on SOM and a simple evolutionary method, particle swarm optimization (PSO) (Kennedy and Eberhart, 1999). Hybrid clustering approach deploys PSO to evolve the weights for SOM. The rate of convergence is improved by introducing a conscience factor to the SOM. The method was used on the rat and yeast benchmark datasets.
The result shows that the proposed approach not only maintains the desirable topological ordering of SOM,
but also fast convergences to a more refined clustering.
NEW CLUSTERING ALGORITHMS
Density-based clustering
As its name implies, density-based clustering is the process developed to recognize dense areas in the
object space. Clusters are defined as regions with a high density of points (Kröger, 2004). In other words,
for a point in a cluster, its neighborhood within a given radius must contain a minimum predefined number of points (i.e., the density in the neighborhood has to exceed a predefined threshold) (Mercer and College, 2003). A conventional density-based algorithm, density-based spatial clustering of applications with
noise (DBSCAN) (Ester et al., 1996), is constructed based on the concepts of density, connectivity, and
boundary. DBSCAN requires two predefined parameters: -radius, which defines a neighborhood of any
data object; and MinPts, a predefined threshold (i.e., the minimum number of points within radius from any
515
BELACEL ET AL.
point). A data object O is a core object if the point count within its neighborhood is more than MinPts
points. The goal of a clustering process is to connect neighboring core objects together. The non-core points
inside a cluster form the boundary of the cluster; and the data points that are not connected to any core
point are defined as outliers. The performance of this method is quite sensitive to these two predefined parameters, which limits its application in complex datasets.
Other density-based algorithms, such as OPTIC (ordering points to identify the clustering structure) and
DENCLUE (density-based clustering), were proposed to address this problem. OPTIC (Ankerst et al., 1999)
is an extension of DBSCAN for an infinite number of distance parameters i, which are smaller than the
generic radius (Fig. 6). Several distance parameters are processed simultaneously. OPTIC also stores an
augmented order in which the data objects are processed. Instead of relying solely on the two parameters
( and MinPts), additional distance parameters—core distance and reachability distance—are associated
with each data object. Therefore, OPTIC is more robust to these global predefined parameters with a tradeoff for a higher run time than DBSCAN (roughly 1.6 of DBSCAN runtime) (Berkhin, 2002). Both
DBSCAN and OPTIC are unsuitable for processing high-dimensional data. Another method, DENCLUE
(Hinneburg and Keim, 1998), uses influence function to describe the impact of a certain point within its
neighborhood. The overall density function can be estimated as a sum of influence functions of all data
points. Clusters are discovered by identifying the maximum of the overall density function. DENCLUE is
more efficient than other density-based clustering algorithms and it can be used to describe arbitrarily shaped
clusters in high-dimensional datasets.
None of the above-mentioned density-based algorithms can fulfill all of the essential requirements for
the clustering method—visualization, robustness, and automatic determination of a number of clusters (Jiang
et al., 2003b). Density-based hierarchical algorithm were specially designed for clustering gene expression
data, especially time series gene expression data with these requirements in mind.
Density-based hierarchical clustering
Jiang et al. (2003b) proposed DHC, a density-based hierarchical clustering algorithm primarily for effectively clustering time series gene expression data. The algorithm interprets the cluster structure of a
dataset by constructing a density tree in two steps. In the first step, all data objects are organized into a hierarchical structure based on the density and attraction properties of data objects. A data object with high
density attracts objects with lower density. The attraction of a data object O is a set of objects A(O), which
can be defined as A(O) {density(Oj) density(O)}. The attractor of a data object O can be defined as
the object Oj A(O) with the largest attraction. Each node of the attraction tree is denoted as a data object;
the parent of each node represents the attractor of the data object. During the second step, DHC summarizes the cluster and prunes the noise and outliers. The resulting structure is a density tree, with each node
denoting a dense area. The density tree contains two types of nodes, cluster nodes and collection nodes.
FIG. 6. The density-based clustering technique known as OPTICS. It generalizes density-based clustering by ordering the points, allowing the extraction of clusters with arbitrary values for .
516
CLUSTERING MICROARRAY GENE EXPRESSION DATA
Cluster nodes are leaf nodes that cannot be decomposed further, while collection nodes are internal nodes
that can be further decomposed. DHC recursively splits collection nodes until termination criteria are met.
Comparative experiments of Jiang et al. (2003b) have shown that DHC has several advantages. DHC can
detect the number of clusters automatically. Furthermore, its clustering performance is robust to the predefined parameters ( and MinPts). DHC uses the gene with the highest density within the group as the medoid
of the cluster, making it more robust to noises and outliers. DHC provides users with an intuitive view of
the relationship and connection between clusters and handles both embedded clusters and intersected clusters. Finally, the result of DHC can be visualized as a hierarchical structure. DHC constructs two kinds of
trees, attraction tree and density tree. In the attraction tree, the root of the tree represents the medoid of the
cluster. The hierarchical level of the data objects reflects their similarity with the medoid of the cluster, and
outliers are classified at leaf level and can be easily recognized. Finally, DHC is scalable in processing
large-scale gene expression data. But, DHC still has some shortcomings. First of all, DHC has a high computational cost, primarily because DHC calculates the distance between each pair of data objects in the
dataset in order to decide the density property of objects. Also, DHC still requires users to predefine the
two threshold parameters, which are used to decompose dense areas (Jiang et al., 2004).
Graph-theoretical clustering
As the name suggests, graph-theoretical clustering describes clustering problems by means of graphs.
Given a data set X, we can construct a weighted graph G(V,E), in which vertex V corresponds to data objects and edges E reflect the proximity between each pair of data objects. Based on a threshold value, proximity is mapped to either 0 or 1 with edges only existing between a pair of objects with the proximity equal
to 1 (Jiang et al., 2004). Graph theory has been applied in agglomerative hierarchical clustering algorithm,
Chameleon (Jain et al., 1999, Karypis et al., 1999), which uses k-nearest neighbor graph theory to eliminate the irrelevant points. By using minimum edge cut, Chameleon method divides the weighted graph into
a set of subclusters. It then merges these small subclusters according to the relative interconnectivity and
relative closeness until the ultimate clusters are found. Graph theory has also been utilized for nonhierarchical clustering. Hartuv and Shamir (2000) constructed a clustering algorithm HCS (highly connected subgraph) based on graph connectivity. HCS defines clusters as highly connected subgraphs whose edge connectivity exceeds half the number of vertices. The edge connectivity of a graph G is the minimum number
of edges whose removal disconnects a graph. A cut is a set of edges whose removal results in a disconnected graph. A minimum cut aims at separating a graph G with a minimum number of edges and is recursively used for finding highly connected subgraphs (clusters). HCS algorithm has been tested on both
simulated and real gene expression data. The experimental result shows that HCS is a promising solution
for clustering gene expression data even with high volume of noise. CLICK and CAST are other graphtheoretical clustering algorithms that are commonly used in gene expression data analysis; they will be discussed in more detail below.
Cluster identification via connectivity kernels
CLICK (cluster identification via connectivity kernels) (Sharan et al., 2003) is a commonly used graphtheoretical clustering approach. The concept behind CLICK is to identify kernels (clusters) of highly similar data objects. CLICK assumes normal distribution of pairwise similarity values between all data. CLICK
works in two phases. In the first phase, the clustering process iteratively finds the minimum cut in G and
TABLE 1.
Program
CLICK
GeneCluster
aAverage
COMPARISON BETWEEN CLICK
AND
GENECLUSTER
Dataset
Homogeneitya
Separationb
Yeast cell cycle
Yeast cell cycle
0.80
0.74
0.07
0.02
similarity between a gene and the center (average profile) of its cluster.
average similarity between centers of clusters.
bWeighted
517
BELACEL ET AL.
TABLE 2.
Program
CLICK
Hierarchical
aAverage
COMPARISON BETWEEN CLICK
AND
HIERARCHICAL CLUSTERING
Dataset
Homogeneitya
Separationb
Human fibroblasts to serum
Human fibroblasts to serum
0.88
0.87
0.34
0.13
similarity between a gene and the center (average profile) of its cluster.
average similarity between centers of clusters.
bWeighted
recursively splits the dataset into a set of connected components from the minimum cut. The output from
this phase is a list of kernels and singletons. In the second phase, kernels are expanded to a set of final clusters. The adoption step repeatedly searches for a singleton and a kernel with the maximum similarity and
assigns this singleton to the kernel. The merging step iteratively merges two clusters with similarity exceeding a predefined threshold. The software CLICK is implemented and tested on large-scale gene expression datasets with promising results. A java-based gene expression analysis and visualization software,
EXPANDER (expression analyzer and displayer www.cs.tau.ac.il?rshamir/expander/expander.html),
was recently developed. It contains several clustering methods, including CLICK, biclustering, and conventional clustering algorithms. It enables clustering, visualizing, and functional enrichment and promoter
analysis (Shamir, 2003).
One of appealing advantages of CLICK is that it does not require a predefined, initial number of clusters (Sharon et al., 2003). Several refinement procedures guarantee the scalability of CLICK. In (Sharan et
al., 2002), CLICK was applied to two publicly available gene expression datasets and the results were compared with two standard clustering methods. The degree of homogeneity (similarity within clusters) and
separation (dissimilarity between clusters) produced by CLICK, GeneCluster, and hierarchical clustering
algorithm from this comparison are given in Tables 1 and 2. The conclusion from the study is that CLICK
is superior over the other two algorithms (Sharan et al., 2002, 2003). In addition, CLICK clustering is very
fast (Table 3), with the speed of clustering thousands of elements in minutes and over 100,000 elements in
a couple of hours on a regular workstation (Shamir, 2001; Sharan et al., 2003).
However, CLICK still has its pitfalls. The search of kernels contains a sequence of heuristic procedures
to expand the kernels to full clustering, and the heuristic search is not guaranteed to be exhaustive, and
thus, does not guarantee globally optimum results (Tseng, 2004). Furthermore, both embedded clusters and
highly intersected clusters cannot be handled by CLICK; they are recognized as a single cluster (Jiang et
al., 2004).
Cluster affinity search technique
Cluster affinity search technique (CAST) (Ben-Dor et al., 1999) is a probabilistic model developed for
discovering true clusters with high probability. It is also a graph-based heuristic clustering algorithm inspired by the idea of a corrupted clique graph (Ben-Dor et al., 1999). CAST assumes that complex gene
expression measure introduces random errors to the true clustering, (i.e., similarity measure for any two
genes with probability is assumed to be wrong). Using graph representation, the input data is represented
by an undirected graph, in which each node represents each gene and edges connect genes with similar ex-
TABLE 3.
TIME PERFORMANCE
OF
CLICK
Elements
Problems
Time (min)
517
826
2,329
20,275
117,835
Gene expression fibroblasts
Gene expression yeast cell cycle
cDNA OFP blood monocytes
cDNA OFP sea urchin eggs
Protein similarity
0.5
0.2
0.8
32.5
126.3
518
CLUSTERING MICROARRAY GENE EXPRESSION DATA
pression patterns. Therefore, the true clusters can be denoted by a clique graph H, a set of disjoint cliques.
The corrupted random graph G is derived from H by randomly adding or removing edges with probability
. The whole process of clustering gene expression profiles can be viewed as finding the ideal clique graph
H from the corrupted graph G with the fewest errors. CAST forms nonhierarchical (unrelated) clusters with
clear boundaries (Ben-Dor et al., 1999). The heuristic implementation of CAST provides users with the
choice of the preferred algorithm based on the goal of the experiment (Ben-Dor et al., 1999). CAST has a
wide range of applications in gene expression data analysis. It has been utilized to analyze temporal gene
expression patterns, to identify multicondition expression patterns, and to classify tumor tissues (Ben-Dor
et al., 1999).
The experimental results demonstrate that CAST is a useful, efficient analysis tool in gene expression
data analysis. Sharon et al. (2003) did a comparative experiment on CLICK and CAST in terms of compatibility score defined as the number of tissue pairs that are mates or nonmates in both the true labeling
and the clustering solution (Ben-Dor et al., 1999; Sharon et al., 2003). In this analysis CLICK and CAST
were applied on two datasets: colon epithelial cell samples and acute lymphoblastic leukemia (ALL) samples. The average classification accuracy results are comparable. The performance of CAST was slightly
better on colon dataset while CLICK performed better on leukemia dataset. Table 4 shows the results of
this comparative experiment (Sharan et al., 2002).
Although CAST does not rely on a user-defined cluster number, it still requires predefined affinity parameter t. Also, the running time of the theoretical version is exponential. As claimed by Bellaachia et al.
(2002), CAST has an expensive cleaning step that is used to move data points from their current cluster to
another cluster so that they may have a higher affinity.
COMPLEX CLUSTERING ALGORITHMS
Fuzzy clustering
The clustering algorithms discussed so far belong to hard or crisp clustering based on the assumption
that each gene can be assigned to only one cluster. However, restriction of one-to-one mapping might not
be optimal in gene expression data analysis. In fact, a majority of genes can participate in different genetic
networks and are governed by a variety of regulatory mechanisms (Futschik and Kasabov, 2002). Therefore, for analysis of gene expression data, it is more desirable to use fuzzy clustering algorithms, which
provide one-to-many mapping where single gene can belong to multiple, distinct clusters with certain degrees of membership (Fig. 7). The memberships can further be used to discover more sophisticated relations between the data object and its disclosed clusters (Xu and Wunsch, 2005). In addition, fuzzy logic
provides a systematic and unbiased way to transform precise, numerical values into qualitative descriptors
through a so-called fuzzification process (Woolf and Wang, 2002). This process is beneficial for analysis
of gene expression data where no prior knowledge is available for the datasets. Furthermore, fuzzy clustering methods are robust to noise and biases.
Recently, many fuzzy clustering approaches have been applied for clustering microarray data. Gasch and
Eison (2002) explored the conditional coregulation of yeast gene expression through fuzzy k-means; Wang
et al. (2003) performed tumor classification and marker gene prediction by using fuzzy c-means; Dembélé
and Kastner (2003) applied fuzzy c-means for clustering microarray data by assigning membership values
to genes. Belacel et al. (2004a) have applied the new fuzzy clustering method, fuzzy j-means, for clusterTABLE 4.
COMPARISON
OF THE
CLASSIFICATION QUALITY
OF
CLICK
AND
CAST
Dataset
Method
Correct (%)
Incorrect (%)
Unclassified (%)
Colon
CLICK
CAST
CLICK
CAST
87.1
88.7
94.4
87.5
12.9
11.3
2.8
12.5
0.0
0.0
2.8
0.0
Leukemia
519
BELACEL ET AL.
FIG. 7. Fuzzy C-means method. The centroids and clusters are determined silimar to standard K-means method. However, in the final result, all the input data points are assumed to belong to all the clusters but with varying degrees of
membership, depending on their distance from the centroids.
ing microarray gene expression data. In the same paper, the authors also presented different methods for
the utilization of cluster membership information in determining gene coregulation.
Fuzzy C-means
Fuzzy C-means (FCM), first described in 1981 by Bezdek, is still the most popular fuzzy clustering algorithm in gene expression data analysis. FCM considers each gene as a member of all clusters, with different degrees of membership. Membership of a gene is closely related to the similarity between the gene
and a given centroid. High similarity between a gene and a closes centroid indicates a strong association
to the cluster, and its membership value is close to 1. Otherwise, its membership value is close to 0.
FCM starts with a predefined initial number of clusters c, a fixed fuzzification parameter m, and a small
positive number . It iteratively updates the membership matrix and the centroid matrix until changes on
centroids are less than .
FCM is a convenient method to select genes that are corelated to multiple clusters and to unravel complex regulatory pathways that control the expression pattern of genes. However, FCM is a local heuristic
search algorithm that can be easily stuck into the local optimum with no guarantee of finding global optimum. Also, it still requires the user to define the initial parameters. Numerous FCM variants have been
proposed to improve the drawbacks of FCM.
Belacel et al. (2004b) embedded FCM into a variable neighborhood search (VNS) meta-heuristic to address the problem of convergence to a local optimum. VNS is a meta-heuristic developed for solving combinatorial and global optimization problems, the idea of which is the systematic change of neighborhood
within a local search (Hansen and Mladenovic, 1998). This method has been tested on four cDNA microarray datasets. The results demonstrated that VNSFCM improves the performance of conventional
FCM and provides superior accuracy in clustering cDNA microarray data.
Dembélé and Kastner (2003) proposed a method aimed at alleviating the difficulty with predefined parameters. Instead of setting the fuzzification parameter m to the default value of 2, the new method com520
CLUSTERING MICROARRAY GENE EXPRESSION DATA
putes an upper bound value of the fuzzification parameter m independent of the number of clusters. The
number of clusters in this method is calculated using CLICK algorithm. The work of Dembélé and Kastner elucidated that FCM clustering is a convenient way to define subsets of genes that exhibit tight association with given clusters.
Probabilistic clustering
Similar to fuzzy clustering, probabilistic clustering allows each data object to belong to multiple clusters
with certain probabilities, thus facilitating the identification of overlapped groups of genes under conditional coregulations. Probabilistic clustering is based on statistical mixture models, that is, it assumes that
the data is generated by a finite mixture of underlying probability distributions, such as Gaussian distribution, with each component corresponding to a distinct cluster (Yeung et al., 2001). Statistic models can be
formulated and easily fit to different datasets. Therefore, statistic model-based clustering offers a principle
alternative to heuristic algorithms in terms of determining the number of clusters and suggesting an appropriate clustering method (Yeung et al., 2001). Additionally, a finite mixture of distributions has provided a
sound mathematical approach to create a variety of required random phenomena (McLachlan et al., 2002).
EM-based probabilistic clustering
The goal of statistical mixture models, such as EM clustering, is to identify or at least estimate unknown
parameters (the means and standard deviations) of underlying probability distributions for each cluster in
order to maximize the likelihood of the observed data distribution. The EM algorithm was first proposed
by Dempster et al. (1977) and is a widely used approach for learning unobserved variables in machine learning. In probabilistic clustering, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters.
The results of EM clustering are different from those computed by k-means clustering. While the latter
assigns observations to clusters by trying to maximize the distances between clusters, the EM algorithm
computes classification probabilities rather then actual assignments of observations to cluster. In other
words, in this method each observation belongs to each cluster with a certain probability. Of course, from
the final result it is usually possible to determine the actual assignment of observations to clusters, based
on the (largest) classification probability. The EM algorithm can also accommodate categorical variables.
The program will at first randomly assign different probabilities (or more precisely, weights) to each category for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the
likelihood of the data given the specified number of clusters.
Although probabilistic clustering has been widely used in gene expression data analysis, it still has some
limitation. Probabilistic clustering relies on the assumption that the dataset fits a specific distribution. This
may not always be the case. For instance, the Gaussian distribution model may not be effective for time-series data, since it treats the time points as unordered, static attributes and ignores the inherent dependency of
the gene expression on time (Jiang et al., 2003a). In fact, currently, there is no well-established general model
for gene expression data (Jiang et al., 2003a). Furthermore, EM converges to local maximum likelihood.
McLachlan et al. (2002) applied the principle of probabilistic clustering in a software EMMIX-GENE.
EMMIX-GENE can be used to classify tissue samples based on genes and to cluster genes based on tissue
samples. Instead of using Gaussian distribution model, it utilizes t mixture distributions in the gene selection stage. By adapting mixtures of factor analyzers, it models the distribution of a high-dimensional gene
expression data on tissues. Mar and McLachlan (2003) deployed EMMIX-GENE to cluster breast cancer
data samples on the basis of gene expressions. In practice, the problem of classification data samples based
on the expression values is nonstandard since the number of genes is significantly larger than the number
of tissue samples. The experimental results showed that EMMIX-GENE is a useful tool for reducing large
numbers of genes to a more manageable size, making it very attractive for classification of cancer tissue
samples.
Yeung et al. (2001) applied probabilistic clustering to three gene expression datasets. Their work proved
that the probabilistic clustering not only has superior clustering performance but also can be used to select
the appropriate clustering model and determine the right number of clusters.
521
BELACEL ET AL.
It should be pointed out that several criteria, such as Bayesian Information Criterion (BIC) (Fraley and
Raftery, 1998), approximate weight of evidence (AWE) criterion (Banfield and Raftery, 1993), and Bayes
factors (Kass and Raftery, 1995), have been used with probabilistic clustering to find the suitable clustering model and determine the right number of clusters. Within them, BIC is in popular use for selecting clusters and modeling structure to fit the given dataset (Fraley and Raftery, 1998).
Based on clustering algorithms, several clustering software have been developed for high-throughput gene
expression analysis. Well-designed, user-friendly software provide scientists an efficient way to extract,
manage, analyze, and visualize DNA microarray data. In the remaining discussion of this section, we will
give a brief review of some of the popular clustering software.
CLUSTERING SOFTWARE
Many commercial and open source software tools have been developed for clustering of high-throughput biological data, either designed to perform specific analysis or to provide the whole set of microarray
analysis steps, including data preprocessing, dimensionality reduction, normalization, clustering, and visualization (Leung, 2004). (Bolshakova, 2005; Génopole, 2006; Leung, 2004; Li, 2004), Table 5 provides a
comparison of microarray clustering software that is, based on several online resources. The summary below is not intended to ennumerate all existing microarray software. Instead, we aim to demonstrate how the
clustering algorithms have been applied as clustering analysis tools and what confined analytical problems
can be solved. The comparison is focused on the unsupervised clustering domain.
CONCLUSIONS
Clustering is the process of grouping data objects into a set of disjoint clusters so that objects within a
class have high similarity to each other, while objects in separate clusters have high dissimilarity to each
other. In biological applications, clustering methods are utilized for the analysis of DNA and protein sequence information (for review of this topic, see Xu and Wunsch, 2005; Liew et al., 2005), as well as highthroughput OMICS data. In terms of gene expression analysis, clustering can be utilized for gene and sample analysis. Genes that are coregulated are expected to have similar expression patterns, and thus cluster
analysis can in principle find genes that share the same transcriptional regulation, providing a useful cue
for an understanding of transcriptional regulatory networks. Further study of coexpressed genes can suggest the functions of uncharacterized genes.
Many clustering algorithms have been developed to fulfill these analysis tasks. For instance, as we mentioned in the previous section, Eisen et al. (1998) adopted the average linkage agglomerative hierarchical
clustering algorithm to discover coexpressed genes in Saccharomyces cerevisiae data. Tavazoie et al. (1999)
used K-means clustering of whole-genome mRNA data and sequence motif to identify transcriptional regulatory subnetworks in yeast. Some other conventional clustering algorithms and extensions have been
proven to be useful in pattern recognition (Hastie et al., 2000; Hruschka et al., 2006; Lu et al., 2004a; Su
and Chang, 2001; Toronen et al., 1999). Newly developed clustering methods, such as CLICK, CAST and
fuzzy j-means, have also shown promising experimental results (Belacel et al., 2004a; Ben-Dor et al., 1999;
Shamir, 2001; Sharan et al., 2002, 2003).
Cluster analysis has also been applied on temporal gene expression data to identify different cell cycles.
It is also a part of many important applications in pharmaceutical and clinical research (Liew et al., 2005).
Comparison of gene expressions between normal and disease cells or clustering of tissue samples can provide information about disease genes and subtypes of diseases (Xu and Wunsch, 2005). For instance, Shai
et al. (2003) performed K-means clustering analysis to identify molecular subtypes of gliomas. Wang et al.
(2003) performed tumor classification and marker gene prediction by using fuzzy c-means. McLachlan et
al. (2002) applied probabilistic clustering to classify tissue samples on the basis of genes and to cluster
genes based on tissue samples. Fuzzy clustering and probabilistic clustering algorithms have been applied
for the determination of multifunctional genes (Belacel et al. 2004a). Other methods, such as subspace clus522
Bayes group at
Ames Research
Center
Optimal Design
Autoclass
Children’s Hospital’s
informatics
programs, Harvard
Medical School
Cleaver
Stanford Biomedical
Information
Cluster and
Eisen laboratory
TreeView
Lawrence Berkeley
National Lab
Alternative:
Cluster3.0 (de Hoon,
University of Tokyo)
CLUSFAVOR Molecular Biology
Computation Resource,
Baylor College of
Medicine
CTWC
Department of Physics
Complex Systems,
Weizmann Institute
of Science
Engene
Computer Architecture
Department,
University of
Malaga, Spain
CAGED
ArrayMiner
University of
Hong Kong
Organization
AMIADA
Software
OF
<http://ctwc.we
izmann.ac.il/>
<http://www.engene. Web-based exploratory data analysis tool
cnb.uam.es/>
for visualizing, pre-processing, and
clustering large sets of gene
expression data
Free
Free
Perform cluster and factor analysis to
reveal unique expression profiles for
genes and ESTs for which pathway
and function information is unknown
To identify subsets of genes and samples
<http://condor.bcm.
tmc.edu/genepi/
clusfavor.html>
Free for nonprofit
users
Free
Web-based visualization, classification,
clustering tool
Reveal coexpressed genes and co-regulated
genes and visualize possible functional
groups
<http://classify.
stanfford.edu/>
<http://rana.lbl.gov/
EisenSoftware.htm>
Analyze temporal gene expression data,
automatically identify number of
clusters
Derive the maximum posterior probability
classification, optimum number of
clusters in gene expression data
Reveal the true structure of gene
expression data, find the best
possible clusters, detect outliers
Identify coexpressed genes
Analytical
features
MICROARRAY CLUSTERING SOFTWARE
<http://dambe.
bio.uottawa.c
a/amiada.asp>
<http://ic.arc.na
sa.gov/ic/projects/
bayes-group/autoclass
<http://www.o
ptimaldesign.
com/ArrayMiner/
ArrayMiner.htm>
<http://genome
thods.org/caged/
about.htm>
URL
COMPARISON
Free
Free for nonprofit
users
Commercial
(light version
free)
Free
Free
License
TABLE 5.
(continued)
K-means, HAC, fuzzy and kernel
C-means, PCA, Sammon’s map,
SOM
Coupled two way clustering
Hierarachical clustering, PCA
Linear discrimination classification,
K-means, PCA
Hierarchical clustering, SOM,
K-means, PCA
Bayesian model-based clustering
(Bayesian clustering by dynamics)
Gaussian mixture model-based
clustering, genetic optimization
algorithm
PCA, single-linkage, completelinkage, and average-linkage
algorithms
Model-based clustering: unsupervised
Bayesian classification system
Included clustering
algorithms
Computational
Genomics Lab,
Bioinformmatics
Graduate Program,
Boston University
Broad Institute
Applied Math
BioDiscovery
GEMS
GeneCluster
GeneMaths
GeneSight
Commercial
Commercial
Free for academic
users
Free (open source)
Commercial
BioSieve, USA
Expression
Sieve
License
European Bioinformatics Free (open source)
Institute (EBI)
Organization
OF
<www.biodis
covery.com/
index/genesight
<http://www.ap
plied-maths.com/
genemaths/gene
maths.htm>
<http://www.broad.
mit.edu/cancer/
software/gene
cluster2/gc2.html>
<http://genomics10.
bu.edu/terrence/
gems/gems.html
<http://www.bio
sieve.com/product.
Web-based data analysis, visualization
tool; export data from ArrayExpress;
data analysis including clustering
analysis, clustering comparison,
between group analysis
Link biological significance to expression
patterns, data analysis for disease
research, drug discovery, and
systems biology
Identify genes that are functionally
related, participating in the same
pathways, affected by the same drug
or pathological condition, or
coregulated genes controlled by a
small group of transcription factors
Standard SOM analysis tool. Gene
Cluster 2.0 extends GeneCluster 1 by
adding supervised classification, gene
selection and permutation test, and
marker gene finder
Mathematics-based most versatile
software, integrating with error
handling, supervised learning: SVM,
K-nearest neighbor, active history,
analysis of template recording,
hypothesis testing, cluster significance
indication based on bootstrap
techniques, powerful dendrogram layout
Perform normalization, visualization, and
statistical analysis; identify genes with
true differential expression patterns
between experimental conditions,
disease states
Analytical
features
MICROARRAY CLUSTERING SOFTWARE (CONT’D)
<http://ep.ebi.ac.
uk/EP/>
URL
COMPARISON
Expression
Profiler
Software
TABLE 5.
Hierarchical clustering, K-means, SOM,
PCA, time course analysis
Hierarchical clustering, K-means, PCA,
SOM, a variety of similarity distance,
pair-group clustering methods,
Ward’s method, pattern matching,
time course analysis
SOM
Biclustering (based on Gibbs sampling
paradigm)
Hierarchical clustering, K-means, PCA,
SOM, eight similarity search
Hierarchical clustering, K-means,
K-medoids, PCA, similarity search
with a variety of distance measures
Included clustering
algorithms
Agilent Technologies
Institute for Genomics
and Bioinformatics,
Graz University of
Technology
ContentSoftAG
Stanford University
GeneSpring
Genesis
GeneViz
GeneXpress
Free
Free for nonprofit
users
Free for nonprofit
users
Commercial
<http://genexpress.
stanford.edu/
Visualization and statistical analysis of
outputs of clustering and motif finding
algorithms, global and detailed views
of expression profiles, promoter regions,
and motifs, integrate with Gene
Ontology to associate each cluster with
cellular processes, identify motifs in the
promoter regions of the genes in each
cluster
Advanced statistical tests for identifying
differentially expressed genes;
supervised K-nearest neighbors, SVM
tools for finding clinically predictive
patterns of gene expression data;
unsupervised clustering methods for
pattern recognition; reveal correlations
between experimental parameters and
gene expression profiles for hypothesis
testing
<http://genome.
Simultaneously visualize and analyze a
tugraz.at/Software/
whole set of gene expression
GenesisCenter.html>
experiments, integrating one-way
ANOVA for detection of differentially
expressed genes, SVM for classification
of unknown genes and identifying
functions of unknown genes, Gene
Ontology for monitoring gene and
protein roles in cellular process, mapping
expression data onto chromosomal
sequences
<http://businessbox4. Advanced analysis and visualization of
server-home.net/
microarray data, detecting sample
user/index.php
clusters and their correlated genes
<www.chem.
agilent.com/
scripts/pds.asp?
lpage=27881
(continued)
Double conjugated clustering, two-way
clustering samples and genes
simultaneously; Singular value
decomposition sorting (SVD) (alter:
orders genes according to entries of a
left singular vector and samples
according to entries of a right
singular vector, PCA
Hierarchical clustering, K-means or
other clustering, motif finding
algorithms
Hierarchical clustering, K-means, SOM,
PCA and more than 10 similarity
distance measurements
SOM, hierarchical clustering, K-means
clustering, QT clustering, PCA
Free for nonprofit
users
NCI Laboratory of
Experimental and
Computational
Biology, open source:
<sourceforge.net>
Mozilla public
(MPL) license
1.1
Bioinformatics Research
Group, University of
Bergen
J-Express
Free
MAExplore
Katholieke Universiteit
Leuven
INCLUSive
Free
Free for nonprofit
users
Bioinformatics
Department, CIPF
GEPAS
License
OF
<www.ccrnp.ncifcrf. A comprehensive online tool with
gov/MAExplorer
normalization, data filtering, data
Open source: <http://
filtering, viewing data with scatter
maexplorer.source
plots, histograms, expression profile
forge.net/>
plots, array pseudoimages, cluster
analysis, comparison of expression
patterns and outliers, access directly
to genomic databases: GenBank,
NCI’s mAdb
<www.ii.uib.no/~
bjarted/jexpress/>
J-Express Pro:
<www.molmine.com/
frameset/frm_
jexpress.htm>
<www.cs.tcd.ie/
To group samples or genes based on
Nadia.Bolshakova/
similar expression patterns, evaluate
Machaon.html
the quality of the clusters obtained,
support third-party clustering tools
<http://homes.esat.
kuleuven.be/~dna/
BioI/Software.html
Included clustering
algorithms
Clustering algorithms: hierarchical
clustering, K-means; Validation
algorithms: C-index, Davis-Bouldin,
Goodman-Kruskal, and silhouette
indices; measure gene-to-gene,
sample-to-sample, intercluster, and
intracluster distances
Hierarchical clustering, K-means,
K-medoids
Comprehensive tool for normalization,
SOTA, hierarchical clustering, K-means,
preprocessing, data analysis, and
SOM, SOM tree, Caat: visualizing
visualization. SVM classification for
hierarchical trees
class prediction, multiple testing for
differentially expressed genes, integrating
gene ontology for functional annotation
Web portal service for the analysis of gene
Adaptive quality based clustering
expression data and discovery of cisregulatory sequence elements, a suite of
tools including ANOVA normalization,
filtering and clustering, functional scoring
of gene clusters, sequence retrieval, and
detection of known and unknown regulatory
motifs in the upstream sequences
Integrating with multidimensional scaling
Hierarchical clustering with several
method to visualize the data in two or
similarity distance measurements,
three dimensions
K-means, PCA, SOM
Analytical
features
MICROARRAY CLUSTERING SOFTWARE (CONT’D)
<http://gepas.bio
info.cipf.es/
URL
COMPARISON
Machaon CVE Trinity College Dublin
Organization
Software
TABLE 5.
Laboratory of DNA
Information Analysis
of Human Genome
Center, Institute of
Medical Science,
University of Tokyo
Rosetta Inpharmatics,
LLC
Department of Medical
Physics, School of
Medicine, University
of Patras, Greece
TIGR: the Institute of
Genomic Research,
Rockville, MD
Stanford University
Open source
clustering
software
Rosetta
Resolver
s-Net-SOM
TIGR MeV
XCluster
Free for nonprofit
users
Free (open
source)
Open source
Cluster 3.0 under
the original
Cluster/TreeView
License, C
clustering library
and Python
extension module
under Python
license, Perl
extension module
under Artistic
License
Free for non
commercial use
<http://genetics.
stanford.edu/
~sherlock/cluster.
html
<www.tm4.org/
mev.html
<http://heart.med.
upatras.gr/bio/
<www.rosettabio.
com/products/
resolver/default.
htm>
<http://bonsai.ims.
u-tokyo.ac.jp/%
7Emdehoon/soft
ware/cluster/index.
html>
Cluster analysis of gene expression data
to determine similar genes under
particular conditions, unknown genes,
error models and statistic analysis for
testing analytical results, a scalable
gene expression analysis for
pharmaceutical and biological analysis
Overcome the drawbacks of most of
clustering methods that prior
knowledge of number of clusters,
adaptively determine the number of
clusters with a dynamic extension
process, inhomogeneous measure
balances unsupervised, supervised and
model complexity criteria
Comprehensive tool for clustering,
visualization, classification, statistical
analysis, and implementation of many
algorithms including bootstrapping; a
variety of clustering algorithms,
ANOVA and t test for differentially
expressed genes, SVM for classification,
SAM for correlation of gene expression
data to a variety of clinical parameters,
view pathways and genome/
chromosomal maps of gene
expression data
Similar to Cluster/TreeView; integrated
into some databases as data analysis
tool.
Implement K-means, hierarchical
clustering, and SOM as a C clustering
library of routines, cluster 3.0 is an
improved version of Eisen’s Cluster,
Python and Perl interface to the C
clustering library with script
language
SOM, K-means clustering, average
linkage hierarchical clustering
Hierarchical clustering, K-means, SOM,
SOTA, CAST, PCA, QT clustering,
gene shaving
Supervised Network SOM
Hierarchical divisive, agglomerative
clustering, SOM, K-means,
K-medians
K-means, K-medoids, hierarchical
(pairwise single-average, maximum-,
and centroid-linkage) clustering and
SOM, a variety of similarity
measurements
BELACEL ET AL.
tering have been developed specifically for genomics applications (Agrawal et al., 1998; Cheng and Church,
2000; Getz et al., 2000; Parsons et al., 2004). Unlike other clustering algorithms, subspace clustering uncovers the subspace based on genes and samples symmetrically.
Microarray technologies have made it possible to monitor expression levels of tens of thousands of genes
in parallel. Discovering the patterns hidden in gene expression data offers a tremendous potential for advanced investigation in molecular biology and system biology. High volume of gene expression data and
the complexity of biological networks increase the difficulty of interpreting the hidden patterns. Clustering
gene expression data is the first step in addressing this challenge.
Some conventional clustering methods reviewed in this paper have been proven useful in clustering gene
expression data. At the same time, some recently developed clustering methods such as CLICK show promising experimental results. However, there is still no absolute winner among clustering methods. Thus, future research needs to address existing problems and to design algorithms that can more accurately address
research directions and needs of the biomedical research community.
REFERENCES
AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., and RAGHAVAN, P. (1998). Automatic subspace clustering of
high dimensional data for data mining applications. Proceedings of ACM SIGMODs 99 Inernational Conference
Management of Data (Philadelphia, PA) pp. 94-105.
ALIZADEH, A., EISEN, M.B., DAVIS, R.E., MA, C., LOSSOS, I.S., ROSENWALD, A., et al. (2000). Distinct types
of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511.
ANKERST, M., BREUNIG, M.M., KRIEGEL, H., and SANDER J. (1999). OPTICS: ordering points to identify the
clustering structure. Proceedings of ACM SIGMOD’s 99 International Conference on Management of Data (Philadelphia PA).
AU, N.H., CHEAUG, M., HUNTSMAN, D.G., YORIDA, A., COLDMAN, A., and ELLIOTT, W.M. (2004). Evaluation of immunohistochemical markers in non-small cell lung cancer by unsupervised hierarchical clustering analysis: a tissue microarray study of 284 cases and 18 markers. J Pathol 204, 101–109.
BANFIELD, J., and RAFTERY, A. (1993). Model-based Gaussian and non-Gaussian clustering. Bioinformatics 49,
803–821.
BELACEL, N., CUPERLOVIC-CULF, M,, LAFLAMME, M., and OUELLETTE, R. (2004a). Fuzzy J-means and VNS
methods for clustering genes from microarray data. Bioinformatics 20, 1690–1701.
BELACEL, N., CUPERLOVIC-CULF, M., OUELLETTE, R., and BOULASSEL, M. (2004b). The variable neighborhood search metaheuristic for fuzzy clustering cDNA microarray gene expression data. Proceedings of IASTEDAIA-04 Conference (Innsbruck, Austria).
BELLAACHIA, A., PORTNOY, D., CHEN, Y., and ELKAHLOUN A.G. (2002). E-CAST: a data mining algorithm
for gene expression data. Proceedings of 2nd ACM SIGKDD on Data Mining in Bioinformatics, pp. 49–54.
BEN-DOR, A., SHAMIR, R., and YAKHINI, Z. (1999). Clustering gene expression patterns. J. Comput Biol, 6, 281–297.
BERKHIN, P. (2002). Survey of clustering data mining techniques. Available at: http://citeseer.ist.psu.edu/berkhin02survey.html. Accessed April 27, 2006.
BEZDEK, J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press, NY, New York).
BOLSHAKOVA, N. (2005). Microarray Software Catalogue. Available at: www.cs.tcd.ie/Nadia.Bolshakova/softwarecatalogue.html. Accessed May 24, 2006.
CHENG, Y., and CHURCH, G.M. (2000). Biclustering of expression data. Proceedings of 8th International Conference of Intelligent Systems for Molecular Biology. pp. 93–103.
CHU, S., DERISI, J., EISEN, M., MULHOLLAND, J., BOTSTEIN, D., BROWN, P.O., et al. (1998). The transcriptional program of sporulation in budding yeast. Science 283, 699–705.
DEMBÉLÉ, D., and KASTNER, P. (2003). Fuzzy C-means for clustering microarray data. Bioinformatics 19, 973–980.
DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B. (1977). Maximum likelihood from incomplete data via the EM
algorithm. J Royal Stat Soc B 39, 1–38.
EISEN, M.B., SPELLMAN, P.T., BROWN, P.O., and BOTSTEIN, D. (1998). Cluster analysis and display of genomewide expression patterns. PNAS 95, 14863–14868.
ESTER, M., KRIEGEL, H.P., SANDER, J., and XU, X. (1996). A density-based algorithm for discovering clusters in
large spatial databases with noise. Proceedings of the 2nd KDD ’96. (AAAI Press, Menlo Park, CA) pp. 226–231.
528
CLUSTERING MICROARRAY GENE EXPRESSION DATA
FRALEY, C., and RAFTERY, A. (1998). How many clusters? Which clustering method? Answers via model-based
cluster analysis. Comput J 41, 578–588.
FUTSCHIK, M.E., and KASABOV, N.K. (2002). Fuzzy clustering of gene expression data. WCCI 1, 414–419.
GASCH, A.P., and EISON, M.B. (2002). Exploring the conditional coregulation of yeast gene expression through fuzzy
k-means clustering. Genome Biol 3, 1–22.
GÉNOPOLE (2006). Microarray data analysis: software of unsupervised clustering. Available at: http://genopole.
toulouse.inra.fr/bioinfo/microarray/index.php?pagelogiciels&sousdomaineUnsupervisedClustering&langen.
Accessed on May 24, 2006.
GETZ, G., LEVINE, E., and DOMANY, E. (2000). Coupled Two-way clustering analysis of gene expression data. Proc
Natl Acad Sci USA 97, 12079–12084.
GUHA, S., RASTOGI, R., and SHIM, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of ACM SIGMOD International Conference of Management of Data.
HAN, J., and KAMBER, M. (2001). Data Mining: Concepts and Techniques. (Morgan Kaufmann, San Francisco, CA),
Ch 8.
HANSEN, P., and MLADENOVIC, N. (1998). Variable neighborhood search. Meta-heuristics—Advances and Trends
in Local Search Paradigms for Optimization, S. Voss, et al., eds. (Kluwer Academic, Amsterdam), pp. 433–458.
HARTUV, E., and SHAMIR, R. (2000). A clustering algorithm based on graph connectivity. Inf. Process. Lett. 76,
175–181.
HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A., LEVY, R., and STAUDT, L. (2000). ‘Gene shaving’
as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol, 1, RESEARCH0003.
HERRERO, J., and DOPAZO, J. (2002). Combining hierarchical clustering and self-organizing maps for exploratory
analysis of gene expression patterns. J Proteome Res, 1, 467–470.
HINNEBURG, A., and KEIM, D.A. (1998). An efficient approach to clustering in large multimedia databases. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. pp. 58–65.
HRUSCHKA, E.R., CAMPELLO, R., and DE CASTRO, L.N. (2006). Evolving clusters in gene-expression data. Inf
Sci, 176, 1898–1927.
HSU, A.L., TANG, S., and HALGAMUGE, S.K. (2003). An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 19, 2131–2140.
JAIN, A.K., and DUBES, R.C. (1988). Algorithms for Clustering Data. (Prentice-Hall, Englewood, NJ).
JAIN, A.K., MURTY, M.N., and FLYNN, P.J. (1999). Data clustering: a review. ACM Comput Surveys 31, 264–323.
JIANG, D., PEI, J., and ZHANG, A. (2003a). Towards interactive exploration of gene expression patterns. ACM
SIGKDD Explor Newslett 5, 79–90.
JIANG, D., PEI, J., and ZHANG, A. (2003b). DHC: a density-based hierarchical clustering method for time series gene
expression data. Proceedings of the Third IEEE Symposium on Bioinformatics and Bioengineering, Mar. 2003, pp.
393–400.
JIANG, D., TANG, C., and ZHANG, A. (2004). Cluster analysis for gene expression data: a survey. IEEE Trans Knowledge and Data Eng 16, 1370–1386.
KARYPIS, G., HAN, E.H., and KUMAR, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer 32, 68–75.
KASS, R., and RAFTERY, A. (1995). Bayes factors. J Am. Stat Assoc 90, 773–795.
KAUFMAN, L., and ROUSSEEUW, P. (1990). Finding Groups in Data. (Wiley, New York, NY).
KENNEDY, J., and EBERHART, R.C. (1999). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks. Piscataway, NJ, IV, pp. 1942–1948.
KOHONEN, T. (1984). Self-Organization and Associative Memory. (Spring-Verlag, Berlin).
KOLEHMAINEN, M.T. (2004). Data exploration with self-organizing maps in environmental informatics and bioinformatics. Kuopio Univ Publ. C Nat Environ Sci 167, 1–73.
KRISHNA, K., and NARASIMHA MURTY, M. (1999). Genetic K-means algorithm. IEEE Trans Syst Man Cybern,
Part B. 29, 433–439.
KRÖGER, P. (2004). Coping with new challenges for density-based clustering [PhD thesis]. München, Germany: Ludwig-Maximilians Universität. Available at http://edoc.ub.uni-muenchen.de/archive/00002396/01/Kroeger_Peer.pdf.
Accessed May 16, 2006.
LEUNG, Y.F. (2004). My microarray software comparison. Available at http://ihome.cuhk.edu.hk/~b400559/
arraysoft.html. Accessed May 23, 2006.
LI, W. (2004). Bibliography on microarray data analysis. Available at http://www.cbi.pku.edu.cn/mirror/microarray/
soft.html. Accessed May 24, 2006.
LIEW, A.W., YAN, H., and YANG, M. (2005). Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognition 38, 2055–2073.
529
BELACEL ET AL.
LU, Y., LU, S., DENG, Y., and BROWN, S. J. (2004a). Incremental genetic K-means algorithm and its application in
gene expression data analysis. BMC Bioinformatics 5, 172–182.
LU, Y., LU, S., FOTOUHI, F., DENG, Y., and BROWN, S.J. (2004b). FGKA: a fast genetic K-means clustering algorithm. Proceedings of the 2004 ACM symposium on Applied computing (SAC), Nicosia, Cyprus, March 2004.
MAKRETSOV, N.A., HUNTSMAN, D.G., NIELSEN, T.O., YORIDA, E., PEACOCK, M., CHEANG, M.C.U., et al.
(2004). Hierarchical clustering analysis of tissue microarray immunostaining data identifies prognostically significant groups of breast carcinoma. Clin Cancer Res 10, 6143–6151.
MAR, J.C., and MCLACHLAN, G.J. (2003). Model-based clustering in gene expression microarrays: an application
to breast cancer data. Int J Software Eng Knowledge Eng, 13, 579–592.
MCLACHLAN, G. J., BEAN, R.W., and PEEL, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413–422.
MERCER, D.P., and COLLEGE, L. (2003). Clustering large datasets. Available at http://www.stats.ox.ac.uk/
~mercer/documents/Transfer.pdf. Accessed May 11, 2006.
MORGAN, B.J., and RAY, A.P.G. (1995). Non-uniqueness and inversion in cluster analysis. Appl. Stat 44, 699–
705.
MOUGEOT, J.L., BAHRANI-MOSTAFAVI, Z., VACHRIS, J.C., MCKINNEY, K.Q., GURLOV, S., ZHANG, J., et
al. (2006). Gene expression profiling of ovarian tissues for determination of molecular pathways reflective of tumorigenesis. J Mol Biol 358, 310–329.
NIELSEN, T.O., WEST, R.B., LINN, S.C., ALTER, O., KNOWLING, M.A., O’CONNELL, J.X., et al. (2002). Molecular characterization of soft tissue tumours: a gene expression study. Lancet 359, 1301–1307.
PARSONS, L., HAQUE, E., and LIU, H. (2004). Subspace clustering for high dimensional data: a review. SIGKDD
Explorations 6, 90–105.
RAMASWAMY, S., ROSS, K.N., LANDER, E.S., and GOLUB, T.R. (2003). A molecular signature of metastasis
in primary solid tumors. Nat Genet. 33, 49–54.
RONNING, C.M., STEGALKINA, S.S., ASCENZI, R.A., BOUGRI, O, HART, A.L., UTTERBACH, T.R., et al. (2003).
Comparative analysis of potato expressed sequence tag libraries. Plant Physiol 131, 419–429.
SHAI, R., SHI, T., KREMEN, T.J., HORVATH, S., LIAU, L.M., CLOUGHESY, T.F., et al. (2003). Gene expression
profiling identifies molecular subtypes of gliomas. Oncogene 22, 4918–4923.
SHAMIR, R. (2001). Algorithm performance comparison. Available at www.cs.tau.ac.il/~rshamir/algmb/00/scribe00/
html/lec11/node41.html. Accessed May 18, 2006.
SHAMIR, R. (2003). A gene expression analysis and visualization software—expander: online documentation. Available at: www.cs.tau.ac.il/~rshamir/expander/ver3Help.html. Accessed May 18, 2006.
SHARAN, R., ELKON, R., and SHAMIR, R. (2002). Cluster analysis and its applications to gene expression data.
Ernst Schering Res Found Workshop 38, 83–108.
SHARAN, R., MARON-KATZ, A., and SHAMIR, R. (2003). CLICK and EXPANDER: a system for clustering and
visualizing gene expression data. Bioinformatics 19, 1787–1799.
SHERLOCK, G. (2000). Analysis of large-scale gene expression data. Curr Opin Immunol 12, 201–205.
STANFORD, D.C., CLARKSON, D.B., and HOERING, A. (2003). Clustering or automatic class discovery: hierarchical methods. In A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, and M. Granzow,
eds. (Kluwer Academic, Amsterdam), pp. 246–260.
SU, M., and CHANG, H. (2001). A new model of self-organizing neural networks and its application in data projection. IEEE Trans Neural Network 12, 153–158.
TAMAYO, P., SLONIM, D., MESIROV, J., ZHU, Q., KITAREEWAN, S., DMITROVSKY, E., and et al. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96, 2907–2912.
TAN, P.N., STEINBACH, M., and KUMAR, V. (2005). Cluster analysis: basic concepts and algorithms. In Introduction to Data Mining. (Addison-Wesley, Boston), Ch 8.
TAVAZOIE, S., JASON, D., HUGHES, J.D., CAMPBELL, R.J., RAYMOND, J., S et al. (1999). Systematic determination of genetic network architecture. Nat Genet, 22, 281–285.
TORONEN, P., KOLEHMAINEN, M., WONG, G., and CASTREN, E. (1999). Analysis of gene expression data using self-organizing maps. FEBS Lett 451, 142–146.
TSENG, G. (2004). A comparative review of gene clustering in expression profile. Eighth International Conference on
Control, Automation, Robotics and Vision (ICARCV). pp. 1320–1324.
UCL ONCOLOGY. (2005). Hierarchical clustering. Available at: www.ucl.ac.uk/oncology/MicroCore/HTML_
resource/tut_frameset.htm. Accessed May 9, 2006.
VAN DER LAAN, M., POLLARD, K.S., and BRYAN, J. (2003). A new partitioning around medoids algorithm. J Stat
Comput Simul 73, 575–584.
530
CLUSTERING MICROARRAY GENE EXPRESSION DATA
WANG, D., RESSOM, H., MUSAVI, M., and DOMNISORU, C., (2002a). Double self-organizing maps to cluster gene
expression data. ESANN’s 2002 Proceedings—European Symposium on Artificial Neural Networks. Bruges, Belgium, pp. 45–50.
WANG, J., BO, T.H., JONASSEN, I., and HOVIG, E. (2003). Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data. BMC Bioinformatics, 4, 60.
WANG, J., DELABIE, J., AASHEIM, H.C., SMELAND, E., and MYKLEBOST, O. (2002c). Clustering of the SOM
easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinf, 3, 36.
WARD, J.H. (1963). Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 58, 235–244.
WELCSH, P.L., LEE, M.K., GONZALEZ-HERNANDEZ, R.M., BLACK, D.J., MAHADEVAPPA, M., SWISHER,
E.M., and et al. (2002). BRCA1 transcriptionally regulates genes involved in breast tumorigenesis. Proc Natl Acad
Sci USA, May 99, 7560–7565.
XIAO, X., DOW, E.R., EBERHART, R., BEN MILED, Z., and OPPELT, R.J. (2003). Gene clustering using self-organizing maps and particle swarm optimization. Proceedings of the Seventeenth Intermational Symposium on Parallel and Distributed Processing. p. 154b.
XU, R., and WUNSCH, D. (2005). Survey of clustering algorithms. IEEE Trans Neural Networks 16, 645–678.
YEUNG, K.Y. (2003). Clustering or automatic class discovery: non-hierarchical, non-SOM. Clustering algorithms and
assessment of clustering results, In A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky,
and M. Granzow, eds. (Kluwer Academic, Amsterdam), pp. 274–288.
YEUNG, K.Y., FRALEY, A., MURUA, A.E., RAFTERY, A.E., and RUZZO, W.L. (2001). Model-based clustering
and data transformations for gene expression data. Bioinformatics 17, 977–987.
ZHANG, T., RAMAKRISHNAN, R., and LIVNY, M. (1996). BIRCH: An efficient data clustering method for very
large database. ACM SIGMOD Conference, pp. 103–114.
Address reprint requests to:
Dr. Nabil Belacel
National Research Council Canada
Institute for Information Technology
#55 Crowley Farm Road
Suite 212, Scientific Park
Moncton, NB E1A 7R1
Canada
E-mail: [email protected]
531