Download Density Clustering Method for Gene Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A Hybrid Clustering Method for Gene Expression Data
Baoying Wang, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105
Tel: (701) 231-6257
Fax: (701) 231-8255
{baoying.wang, william.perrizo}@ndsu.nodak.edu
Abstract.
Data clustering methods have been proven to be a successful data mining technique in
analysis of gene expression data. However, some concerns and challenges still remain in gene
expression clustering. For example, many traditional clustering methods originated from nonbiological fields may not work well if the model is not sufficient to capture the genuine
clusters among noisy gene expression data. In this paper, we propose an efficient hybrid
clustering method using attractor trees based on both density factors and similarity factors.
The combination of density-based approach and similarity-based approach takes
consideration of clusters with diverse shapes, densities, and sizes, and is capable of dealing
with noises. A vertical data structure, P-tree1, is used to make the clustering process more
efficient by accelerating calculation of density functions using P-tree based neighborhood
rings. Experiments on common gene expression datasets demonstrate that our approach is
more efficient and scalable with competitive accuracy.
Keywords: gene expression data, clustering, P-trees, microarray.
1
INTRODUCTION
Clustering in data mining is a discovery process that partitions the data set into groups
such that the data points in the same group are more similar to each other than the data points
in other groups. Clustering analysis of mircroarray gene expression data, which discovers
groups that are homogeneous and well separated, has been recognized as an effective method
for gene expression analysis.
Eisen et al first applied hierarchical linkage clustering approach that groups closest
pairs into a hierarchy of nested subsets based on similarity [7]. Golub et al has also
1
Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.
successfully discovered the tumor classes based on the simultaneous expression profiles of
thousands of genes from acute leukemia patient’s testing samples using self-organizing maps
clustering approach [8]. Some other clustering approaches, such as k-mean [21], fuzzy kmeans [1], CAST [3], etc, also have been proven to be valuable clustering methods for gene
expression data analysis. However, some concerns and challenges still remain in gene
expression clustering. For example, many traditional clustering methods originated from nonbiological fields may not work well if the model is not sufficient to capture the genuine
clusters among noisy data.
Partitioning clustering, distance-based approach and density-based approach, depends
on input parameters. Distance-based clustering needs to predetermine the number of clusters.
Density-based approach depends on density-related parameters. Hierarchical clustering is
more flexible than partitioning clustering. It provides a nested series of partitions instead of a
single partition. However, hierarchical clustering is computationally expensive for large data
sets. In this paper, we propose an efficient agglomerative hybrid clustering method:
Clustering using Attractor tree and Merging Process (CAMP).
The contribution of CAMP is that CAMP combines the features of both density-based
clustering approach and distance-based clustering approach, which takes consideration of
various clusters and is capable of dealing with noisy data. A vertical data structure, P-trees is
used to make the algorithm more efficient by accelerating the calculation of the density
function. P-trees are also used as bit indexes to clusters. In the merging process, only
summary information of the attractor trees is used to find the most similar cluster pair. When
two clusters are to be merged, only their P-tree indexes are retrieved to perform the merging
process.
This paper is organized as follows. In section 2 we give an overview of the related
work. We present our new clustering method, CAMP, in section 3. Section 4 discusses the
implementation of CAMP using P-trees. Section 5 presents discussions on noise handling
and determination of the optimal cutting level. An experimental performance study is
described in section 6. Finally we conclude the paper in section 7.
2
RELATED WORK
Generally, clustering techniques can be categorized in many ways [9][13][4]. The
categorization shown in Figure 1 is based on the structure of clusters.
Clustering
Hierarchical
Agglomerative
(Bottom-up)
Divisive
(Top-down)
Hybrid
Partitioning
Distance-based
Density-based
Figure 1. Categorization of Clustering
Clustering can be subdivided into partitioning clustering, hierarchical clustering, and
hybrid clustering. Hierarchical clustering is a nested sequence of partitions, whereas a
partitioning clustering is a single partition. Hybrid, as the name indicates, combines the
features of both hierarchical clustering and partitioning clustering.
Hierarchical clustering methods can be further classified into agglomerative and
divisive hierarchical clustering, depending on whether the hierarchical decomposition is
accomplished in a bottom-up or a top-down fashion. Partitioning clustering consists of two
approaches: distance-based and density-based, according to the similarity measure.
2.1 Partitioning Methods
Partitioning clustering methods generate a partition of the data in an attempt to
recover natural groups present in the data. Partitioning clustering can be further subdivided
into distance-based partitioning and density-based partitioning.
A distance-based partitioning method breaks a data set into k subsets, or clusters, such
that data points in the same cluster are more similar to each other than the data points in other
clusters. The most classical similarity-based partitioning methods are k-means [10] and kmedoid, where each cluster has a gravity center. The time complexity of K-means is O(n)
since each iteration is O(n) and only a constant number of iterations is computed.
However, there are several problems with distance-based partitioning methods: (1) k
is the input parameter and needs to be predetermined; (2) the methods are only suitable for
clusters with spherical shapes; (3) they are not good for clusters which are very different in
sizes; and (4) they are not robust to the selection of the initial partition and may converge to a
local minimum of the criterion function value if the initial partition is not properly chosen.
Density-based partitioning clustering has been recognized as a powerful approach for
discovering arbitrary-shape clusters. In density-based clustering, clusters are dense areas of
points in the data space that are separated by areas of low density (noise). A cluster is
regarded as a connected dense area of data points, which grows in any direction that density
leads. Density-based clustering can usually discover clusters with arbitrary shapes without
predetermining the number of clusters. However, density-based clustering is very sensitive to
input parameters. Figure 2 shows that the clustering results are sensitive to density threshold
[12]. The higher the density threshold, the fewer points fall into clusters and the more points
become noises.
(a) Density threshold is high
Figure 2.
(b) Density threshold is low
Clustering is sensitive to density threshold
2.2 Hierarchical Clustering
Hierarchical algorithms create a hierarchical decomposition of a data set X. The
hierarchical decomposition is represented by a dendrogram, a tree that iteratively splits X into
smaller subsets until each subset consists of only one object. In such a hierarchy, each level
of the tree represents a clustering of X. Figure 3 shows the hierarchical decomposition
process and the dendrogram of hierarchical clustering.
(a) Hierarchical decomposition
Figure 3.
(b) Dendrogram
Hierarchical decomposition and the dendrogram
Hierarchical clustering methods are subdivided into agglomerative (bottom-up)
approaches and divisive (top-down) approaches [9]. An agglomerative approach begins with
each point in a distinct cluster, and successively merges clusters together until a stopping
criterion is satisfied. A divisive method begins with all points in a single cluster and performs
splitting until a stopping criterion is met.
Besides the algorithm itself, there are several ways to compute cluster similarity. Most
hierarchical clustering algorithms are variations of the single-link and the complete link
approaches. In the single-link method, the distance between two clusters is the minimum of
the distances between all pairs of points from the two clusters. In the complete-link
algorithm, the distance between two clusters is the maximum of all pair-wise distance
between points in the two clusters. In either case, two clusters are merged to form a larger
cluster based on the minimum distance (or maximum similarity) criteria.
The complete-link algorithm produces tightly bound or compact clusters while the
single-link algorithm suffers when there are a chain of noises between two clusters. Figure 4
illustrates different clustering results between single-link algorithm and complete-link
algorithm in case of noise chain [14]. It is noted that single-link clustering produce a screwed
result while complete-link algorithm can still produce a correct result.
(a) single-link clustering results
Figure 4.
(b) complete-link clustering results
Chain effect on a single-link clustering result
In summary, hierarchical algorithms are more flexible than partitioning algorithms. It
doesn’t need input parameters from users. However, the computational complexities of
hierarchical algorithms are typically higher than those of the partitioning algorithms. Single
link clustering is O(n2) and complete link is O(n3), while k-means is only O(n). Hence, many
hybrid algorithms have been developed to exploit the good features of both hierarchical
clustering and partitioning clustering.
2.3 Hybrid Clustering
Several hybrid clustering methods have been proposed to combine the features of
hierarchical and partitioning clustering algorithms. In general, these algorithms first partition
the data set into preliminary clusters and then construct a hierarchical structure upon these
sub-clusters based on some similarity measure. Figure 5 shows that the data set is first
partitioned into 15 sub-clusters and these sub-clusters are then merged into two clusters [16].
(a) Obtain sub-clusters
Figure 5.
(b) Merge sub-clusters into clusters
Hybrid clustering process
The early hybrid algorithm was developed by combining the k-means and a
hierarchical method [18]. This algorithm first partitions the data set into several groups and
then performs the k-means on each partition to obtain several sub-clusters. Then a
hierarchical method is used to build up levels using the centroids of the sub-clusters in the
previous level. This process continues until exactly k clusters are formed. Finally, the
algorithm reassigns all points of each sub-cluster to the cluster of their centroids. In this
method, the dissimilarity between two clusters is defined as the distance between their
centroids.
Algorithm BIRCH is one of the most efficient clustering algorithms [22]. The
algorithm performs a linear scan of all data points and the cluster summaries are stored in
memory in the data structure called a CF-tree. A non-leaf node represents a cluster consisting
of all the sub-clusters represented by its entries. BIRCH first partitions the data set into many
small sub-clusters and then applies a global clustering algorithm on those sub-clusters to
achieve the final results. The main contribution of BIRCH is as an efficient data preprocessor
for a large input data set so that the global clustering algorithm can be executed efficiently.
CHAMELEON operates on a k-nearest neighbor graph [15]. The algorithm consists of
three basic steps: (1) Construct a k-nearest neighbor graph; (2) Partition the k-nearest
neighbor graph into many small sub-clusters; and (3) Merge those sub-clusters to get the final
clustering results. CHAMELEON has been found to be very effective in clustering isotropic
shapes. However, the algorithm cannot handle outliers and needs parameter setting to work
effectively. The time complexity of building a k-nearest-neighbor graph of a highdimensional data set is as high as O(d*n2), which makes CHAMELEON infeasible for large
data sets.
2.4 Clustering methods of gene expression data
There are many newly developed clustering methods [3][7][5][11] which are
dedicated to gene expression data. These clustering algorithms partition genes into groups of
co-expressed genes.
Eisen et al [7] adopted a hierarchical approach using UPGMA (Unweighed Pair
Group Method with Arith-metic Mean) to group closest gene pairs. This method displays the
clustering results in a colored graph pattern. In this method, the gene expression data is
colored according to the measured fluorescence ratio, and genes are re-ordered based on the
hierarchical dendrogram structure.
Ben-Dor et al. [3] proposed a graph-based algorithm CAST (Cluster Affinity Search
Techniques) to improve gene clustering accuracy. Two points are linked in the graph if they
are similar. The problem of clustering a set of genes is then converted to a classical graphtheoretical problem. CAST takes as input a parameter called the affinity threshold t, where 0
< t < 1, and tries to guarantee that the average similarity in each generated cluster is higher
than the threshold t. However, this method has a high complexity.
Hartuv et al. [11] presented a polynomial algorithm HCS (Highly Connected
Subgraph). HCS recursively splits the weighted graph into a set of highly connected subgraphs along the minimum cut. Each highly connected sub-graph is called a cluster. Later on,
the same research group developed another algorithm, CLICK (Cluster Identification via
Connectivity Kernels) [20]. CLICK builds up a statistic framework to measure the coherence
within a subset of genes and determines the criterion to stop the recursive splitting process.
3
HYBRID CLUSTERING USING ATTRACTOR TREES
In this section, we present an efficient hybrid agglomerative clustering using attractor
trees and merging process, CAMP. CAMP consists of two processes: (1) Clustering using
Local Attractor trees (CLAT) and (2) cluster Merging Process based on similarity (MP). The
final clustering results consist of an attractor tree and a set of P-tree indexes to clusters
corresponding to each level of the attractor tree.
The attractor tree is composed of leaf nodes, which are the local attractors constructed
in CLAT process, and interior nodes, which are virtual attractors resulted from MP process.
Figure 6 is an example of an attractor tree.
Virtual
attractors
Local
attractor
Figure 6.
The attractor tree
The data set is first grouped into local attractor trees by means of density-based
approach in CLAT process. Each local attractor tree represents a preliminary cluster, the root
of which is a density attractor of the cluster. Then the small clusters are merged level-by-level
in MP process based on cluster similarity.
3.1 Density Function
Given a data point x in a data space X, the density function of x is defined as the sum
of the influence functions of all data points in the data space X on x. There are many ways to
calculate the influence function. In general, the influence of a data point on x is inversely
proportional to its distance to x. If we divide the neighborhood of x into neighborhood rings,
then points within inner rings have more influence on x than those in outer rings. We define
the neighborhood ring as follows:
Definition 1. Neighborhood Ring of a data point c with radii r1 and r2 is defined as
the set R(c, r1, r2) = {x X | r1<|x-c| r2}, where |x-c| is the distance between x and c. The
number of neighbors falling in R(c, r1, r2) is denoted as || R(c, r1, r2)||.
Definition 2. Equal Interval Neighborhood Ring (EINring) of a data point c with
radii r1=k and r2=(k+1) is defined as the kth equal interval neighborhood ring EINring(c, k,
) = R(c, k, (k+1)), where  is a constant. Figure 7 shows 2-D EINrings with k = 1, 2, and
3. The number of neighbors falling in the kth EINring is denoted as ||EINring(c, k, )||.
C
A 
v

i

Figure 7.
Diagram of EINrings.
Let y be a data point within the kth EINring of x. The EINring-based influence
function of y on x is defined as:
f ( x, y )  f k ( x ) 
1
,
k
(1)
k = 1, 2 ... n
The density function of x is defined as the summation of influence function of every
EINring neighborhood of x, i.e.

DF ( x)  f k ( x) || EINring ( x, k ,  ) ||
k 1
(2)
3.2 Clustering by Local Attractor Trees
The basic idea of clustering by local attractor trees (CLAT) is to partition the data set
into clusters in terms of local density attractor trees. Given a data point x, if we follow the
steepest density ascending path, the path will finally lead to a local density attractor. All
points whose steepest ascending paths lead to the same local attractor form a local attractor
tree. If x doesn’t have such a path, it can be either a local attractor or a noise.
The local attractor trees are the preliminary clusters. The resultant graph of CLAT
process is a collection of local attractor trees with local attractors as the roots. Given the
interval of neighborhood ring, , the CLAT is processed as follows:
1.
Compute the density function for each point;
2.
For an arbitrary point x, find a point, y, with the highest density within its
neighborhood R(x, 0, ). If the density of y is higher than the density of x, build a
link between x and y.
3.
If density of y is lower than the density of x, x is assigned with a new cluster
label (x can be an attractor or a noise point).
4.
Go back to step 2 with the next point.
5.
Finally, the data points in each attractor tree are assigned with the same cluster
label as the attractor’s.
CLAT produces a set of local attractor trees. Some local attractor trees may contain
only root in case of noise points.
3.3 Similarity between Clusters
There are many cluster similarity measures. As it is discussed in section 2.2, the most
popular similarity measures in hierarchical clustering are complete-link similarity and singlelink similarity. Complete-link similarity is measured by maximum distance between two
clusters, while single-link similarity is measured by minimum distance between two clusters.
However, these traditional similarity measures are only suitable for clusters with
similar densities. For example, they can distinguish the two pairs in Figure 8 (a), but will not
distinguish the two pairs in Figure 8 (b). In fact, the left pair of clusters is relatively closer
than those on the left.
(a) two pair of clusters in similar densities
(b) two pair of clusters in different densities
Figure 8.
Cluster similarity: similar densities vs. different densities
Therefore, we consider relative closeness in developing cluster similarity. We define
similarity between cluster i and cluster j as follows:
CS (i, j ) 
Vi  V j
(3)
d ( Ai , A j )
where Vi is the average distance between the point in the ith attractor tree and its attractor Ai.
d(Ai, Aj) is the distance between attractors Ai and Aj. Vi is calculated as follows:
 ( x  Ai )
Vi 
2
xCi
|| C i ||
l = 1, 2 ... d
(4)
Where Ci is the cluster represented by the ith attractor tree. ||Ci|| is the size of Ci.
3.4 Cluster Merging Process
After the local attract trees (preliminary clusters) are built in CLAT process, cluster
merging process (MP) starts to combine the most similar cluster pair level-by-level based on
the similarity measure defined above. When two clusters are merged, two local attractor trees
are combined into a new tree, called a virtual attractor tree. It is called “virtual” because the
new root is not an existing point. It is only a virtual attractor which could attract all points of
two sub-trees. The merging process is shown in Figure 9. The cluster merging is processed
recursively by combining (virtual) attractor trees.
Av
Ai
Aj
(a) Before merging
Figure 9.
(b) After merging
Cluster merging process
After merging, we need to update the attractor Av of the new virtual attractor tree.
Take two clusters: Ci and Cj, for example, and assume the size of Cj is greater than or equal to
that of Ci, i.e. ||Cj||  ||Ci||, we have the following equations:
Avl =
|| Cj ||
( Ail  A jl )
|| Ci ||  || Cj ||
l = 1, 2 ... d
(5)
where Ail is the lth attribute of the attractor Ai. ||Ci|| is the size of cluster Ci.
4
IMPLEMENTATION OF CAMP USING P-TREES
CAMP is implemented using the data-mining-ready vertical bitwise data structure, Ptree, to make the clustering process much more efficient and scalable. The P-tree technology
was initially developed by the DataSURG research group for spatial data [19] [6]. In this
section, we first briefly discuss representation of a gene dataset in P-tree structures, and
computation of P-tree-based neighborhood. Then we detail the implementation of CAMP
using P-trees.
4.1 Data Representation
Given a gene table G = (E1, E2 … Ed), and the binary representation of jth attribute Ej
as bj,mbj,m-1...bj,i… bj,1bj,0, the table is projected into columns, one for each attribute. Then
each attribute column is further decomposed into separate bit vectors, one for each bit
position of the values in that attribute. Figure 10 shows a relational table with three attributes.
Figure 11 shows the decomposition process from the gene table G into a set of bit vectors.
G (E1, E2, E3)
5
2
7
7
2
4
3
1
2
7
2
2
5
7
2
3
2
5
5
1
1
4
A32 2
Figure 10. An example of gene table.
G (E1, E2, E3)
A2
E1
101
010
111
111
010
100
011
001
E13 E12 E11
1
0
1
1
0
1
0
0
1
1
1
0
1
0
A02 1
1
0
1
1
0
0
1
1
E2
010
011
010
010
101
111
010
011
E3
111
010
010
101
101
001
001
100
E23 E22 E21
E33 E32 E31
0
1
0
0
1
1
0
0
1
1
0
1
1
1
A0 2 1
0
1
0
0
1
1
0
1
1
1
0
1
1
0
0
1
1
0
0
0
0
0
A02 1
1
0
0
1
1
1
1
0
Figure 11. Decomposition of the gene table
After decomposition process, each bit vectors is then converted into a P-tree. A P-tree
is built by recording the truth of the predicate “purely 1-bits” recursively on halves of the bit
vectors until purity is reached. Three P-tree examples of bit vectors E23, E22 and E21 are
illustrated in Figure 12.
0
0
0
0
1
1
0
0
0
0
0
1
0 1
(a) P 23
0
0
0 1
0 1
(b) P 22
0
0 1
(c) P 21
Figure 12. P-trees of bit vectors E23, E22 and E21
The P-tree logic operations are pruned bit-by-bit operations, and performed level-bylevel starting from the root level. For instance, ANDing a pure-0 node with any node results
in a pure-0 node, ORing a pure-1 node with any results in a pure-1 node. [6] describes a
detailed P-tree logic operations.
4.2 P-tree based neighborhood computation
The major computational cost of CAMP lies in computation of densities. To improve
the efficiency of density computation, we adopt the P-tree based neighborhood computation
by means of the optimized P-tree operations. In this section, we first review the P-tree
predicate operations. Then we present the P-tree based neighborhood computation.
P-tree predicate operations: Let A be jth dimension of data set X, m be its bit-width,
and Pm, Pm-1, … P0 be the P-trees for the vertical bit files of A. c=bm…bi…b0, where bi is ith
binary bit value of c. Let PA >c and PAc be the P-tree representing data points satisfying the
predicate A>c and Ac respectively, then we have
PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk,
kim
(6)
where opi is  if bi=1, opi is  otherwise.
PAc = P’mopm … P’i opi P’i-1 … opk+1P’k,
where 1). opi is  if bi=0, opi is  otherwise.
kim
(7)
In equations above, k is the rightmost bit position with value of “0”. The operators are
right binding.
Calculation of neighborhood: Let Pc,r be the P-tree representing data points within
the neighborhood R(c, 0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree representing
data points satisfying the predicate c-r<xc+r. Therefore
Pc,r = Pc-r<xc+r = Px >c-r  Pxc+r
(8)
where Px >c-r and Pxc+r are calculated by means of P-tree predicate operations above.
Calculation of the EINring neighborhood: Let Pc,k be the P-tree representing data
points within EINring(c, k, ) = {x X | k< |c-x|  (k+1)}. In fact, EINring(c, k, )
neighborhood is the union of R(c, 0, (k+1)) and the complement of R(c, 0, k). Hence
Pc,k = Pc, (k+1)  P’c, k
(9)
where P’c, k is the complement of Pc, k.
The count of 1’s in Pc,k, ||Pc,k||, represents the number of data points within the
EINring neighborhood, i.e. ||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a specific
neighbor point.
4.3 Implementation of CAMP using P-trees
The critical implementation steps in CAMP are computations of density function and
similarity function and manipulation of the (virtual) attractor trees during clustering process.
We will discuss these steps as follows:
Calculation of neighborhood: Let Pc,r be the P-tree representing data points within
the neighborhood R(c, 0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree representing
data points satisfying the predicate c-r<xc+r. Therefore
Pc,r = Pc-r<xc+r = Px >c-r  Pxc+r
(10)
where Px >c-r and Pxc+r are calculated by means of P-tree predicate operations above.
Calculation of the EINring neighborhood: Let Pc,k be the P-tree representing data
points within EINring(c, k, ) = {x X | k< |c-x|  (k+1)}. In fact, EINring(c, k, )
neighborhood is the union of R(c, 0, (k+1)) and the complement of R(c, 0, k). Hence
Pc,k = Pc, (k+1)  P’c, k
(11)
where P’c, k is the complement of Pc, k.
The count of 1’s in Pc,k, ||Pc,k||, represents the number of data points within the
EINring neighborhood, i.e. ||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a specific
neighbor point.
Computation of density function (in CLAT process): According to equation (2) and
(7), the density function is calculated using P-trees as follows:

DF(x) =  f k ( x) || Px, k ||
k 1
(12)
Calculation of cluster similarity: In equation (3), we need to calculate average
variation of each cluster, Vi and Vj, and the distance between two attractors, d(Ai, Aj). Vi is
calculated by equation (4). As it is mentioned in Section 4.2, equation (4) can be
implemented using P-trees very efficiently.
Structure of a (virtual) attractor tree: An attractor tree consists of two parts: (1) a
collection of summary data, such as the size of the tree, the attractor, the average variation;
and (2) a P-tree used as an index to points in the attractor trees.
Here is an example of an index P-tree. Assume the data set size is 8, and the first four
points and the sixth point of the data set are in an attractor tree. The corresponding bit index
(11110100) and P-tree are shown in Figure 13.
0
1
0
0
0
P-tree
0 1
bit index
11110100
Figure 13. The P-tree for an attractor tree.
Creating an attractor tree (in CLAT process): When a steepest ascending path (SAP)
from a point stops at a new local maximal, we need to create a new attractor tree. The stop
point is the attractor.
Updating an attractor tree (in CLAT process): If the SAP encounters a point in an
attractor tree, the whole SAP is inserted into the attractor tree. The attractor doesn’t change.
As the result, the attractor tree needs to be updated by P-tree ORing operation.
Pnew  Pold  Psap
(13)
where Pnew is P-tree for the new attractor tree, Pold is for the old attractor tree, Psap represents
the points in SAP, and  is a OR operand.
Merging attractor trees (in MP process): When two attractor trees are combined into
a new virtual attractor tree, A new P-tree is formed simply by ORing two old P-trees, i.e. Pv=
Pi  Pj. The new attractor is calculated by equation (5).
5
DISCUSSSION
In this section, we discuss problems such as noise handling and locating optimal
cutting levels in hierarchical structure, in which clustering results are the best.
5.1 Delayed Noise Handling Process
In case of noisy data, it is important to have a proper noise handling process. Naively,
the points which stand alone after CLAT process should be noises. However, some sparse
clusters might be mistakenly eliminated as noises if noises are handled at this stage.
Therefore, we delayed the noise-handling process till the later stage.
The neighborhoods of noises are generally sparser than those of points in clusters. In
cluster merging process, noises tend to merge with other points with much fewer chances.
Therefore, the cluster merging process is tracked to capture those clusters which are growing
slowly. We mainly check the case when a large cluster is merged with a very small cluster. If
the large cluster didn’t grow within a certain number of previous iterations, we stop and
eliminate the small cluster as noises. Figure 14 shows a slow growing cluster merged from a
large cluster and a small cluster. In this case, the small cluster is eliminated as noise.
large cluster
small cluster
Figure 14. Noise handling
5.2 Determination of the Optimal Cutting Level
Hierarchical clustering generates a nested sequence of clusters ordered level by level.
But which level should the user pick for the best results? In this section, we introduce a way
to locate the optimal cutting level. This method is modified from one of the traditional cluster
validation approaches discussed in [13]: Davies-Bouldin index.
Given a cluster partition {C1, C2 ... Ck}, we define the relative similarity between two
clusters, Ci and Cj, as
RS i , j 
Ei  E j
d (mi , m j ) 2
(14)
where d(mi, mj) is the distance between the means of cluster i and cluster j, mi and mj. Ei is
the average square distance from the points in the ith cluster to the mean of the same cluster.
Ei is calculated as follows:
Ei 
1
2
 ( x  mi )
n xCi
(15)
With RSi,j, we can get the maximum relative similarity between cluster i and every
other cluster, denoted MRSi. MRSi is calculated as follows:
MRS i  max {RS i , j }
i j
(16)
The modified Davies-Bouldin (MDB) index for partition {C1, C2 ... Ck} is the average
of MRSi (i = 1, 2 ... k), denoted as MDB(k). MDB(k) is calculated as follows:
MDB (k ) 
1 k
 MRS i
k i 1
(17)
The smaller MDB(k), the better the partition. To find the optimal level of clustering,
we can draw a diagram MDB – k and search for a minimum. We tested this approach on two
data sets: the Iris data from UCI (University of California Irvine) and the data set used in
OPTICS [1]. The clustering results are generated using CAMP. Figure 15 shows MDB – k
diagram of two data sets.
Figure 15. MDB(k) for two data set
From Figure 15, we can see MDB reaches the minimum at 3 for the Iris data and at 6
for the OPTICS data. Therefore the optimal levels for the Iris data and the OPTICS data set
are when the numbers of clusters are 3 and 6 respectively. These results conform to the
ground truth of the two data sets.
6
PERFORMANCE STUDY
We used three microarray expression datasets: DS1 and DS2 and DS3. DS1 is the
dataset used by CLICK [20]. It contains expression levels of 8,613 human genes measured at
12 time-points. DS2 and DS3 were obtained from the Michael Eisen's lab [17]. DS2 is a gene
expression matrix of 6221  80. DS3 is the largest dataset with 13,413 genes under 36
experimental conditions. The raw expression data was first normalized [3]. Then the datasets
were then decomposed and converted to P-trees. We implemented HK-means [21], BIRCH
[22], CAST [3], and CAMP algorithms in C++ language on a Debian Linux 3.0 PC with 1
GHz Pentium CPU and 1 GB main memory. To make the algorithms comparable, we run the
methods up to the level where the number of clusters is equal to the number of clusters that
CAST produces, which are 12 for DS1, 8 for DS2, and 22 for DS3.
6.1 Run Time Comparison
The total run times for different algorithms on DS1, DS2 and DS3 are shown in
Figure 16. It is shown in Figure 16 that CAMP is the fastest among the four methods.
Especially CAMP outperforms HK-means and CAST substantially when the data set is large.
Figure 16. Run time comparison
6.2 Clustering Results Comparison
The clustering results are evaluated by means of MDB statistic index discussed in
section 5.2. The MDB value of each method on data sets DS1, DS2, and DS3 are shown in
Figure 17. Low MDB value means good clustering results. From Figure 17, we can see that
CAMP and CAST both have lower MDB values (better clustering results) than the other two
methods.
Figure 17. Clustering results measured in MDB
In summary, CAMP outperforms the other three methods in terms of execution time
with the comparable clustering results with CAST.
6.3 Visualization of Clustering Results
It is often useful to visualize the clustering results. Eisen et al [7] developed a
software tool to visualize hierarchical clustering results of gene expression data. The
software, Java TreeView, was written by Alok Saldanha at Stanford University. This program
reads in files matching .cdt and .gtr or other extensions and visualizes gene clustering results
in gene tree graph.
A .cdt (clustered data table) file contains the original data which is reordered to reflect
the clustering results. It is the same format as the input files, except that an additional
column, GID, is added. GID column contains a unique identifier for each gene, which will be
used in conjunction with the .grt file to build a gene tree. The .gtr (gene tree) file records the
order in which the genes are joined during clustering.
Given a .cdt and a .gtr file, Java TreeView generates a gene tree graph. The
expression values of each gene are rendered in a red-green color scale, where red represents
higher expression and green indicates lower expression in the given experiment. Figure 18 is
a partial gene tree graph.
Figure 18. A snapshot of a partial gene tree
Java TreeView was designed for traditional hierarchical clustering. However, it is
necessary to extend the tool to facilitate the hybrid clustering. One simple way is to take a
representative point from each preliminary cluster and treat it as a leaf node. For example,
we can take the local density attractors as the representative points to visualize our clustering
results. In DS1, CAMP generates 92 local density attractors. We create a file which contains
all the local density attractors in .cdt format, and another file in .gtr format based on the
merging process in CAMP. Then we input them into Java TreeView to generate a gene tree
of the hybrid clustering results. Figure 19 shows the hybrid gene tree of DS1.
Figure 19. The gene tree of hybrid clustering on DS1
7
CONCLUSION
In this paper, we have proposed an efficient hybrid clustering method using attractor
trees, CAMP, which combines the features of both density-based clustering approach and
similarity-based clustering approach. A vertical data structure, P-tree, is used to make the
algorithm more efficient by accelerating the calculation of the density function. The process
of building local attractor tree prunes the large portion of hierarchical bottom structure.
Relative cluster similarity measure and noise handling process improves clustering accuracy.
Experiments on common gene expression datasets demonstrated that our approach is more
efficient and scalable with competitive accuracy.
In the future, we will apply our approach to large scale time series gene expression
data, where the efficient and scalable analysis approach is in demand. We will explore a
comprehensive tool to visualize the hybrid clustering as well as hierarchical clustering.
REFERENCES
1. Ankerst, M., Breunig, M. Kriegel, H.-P. and Sander, J. OPTICS: Ordering points to
identify the clustering structure. ACM-SIGMOD Conference on Management of Data
(SIGMOD’99), Philadelphia, PA, 1999. pp. 49-60.
2. Arima, C and Hanai, T. “Gene Expression Analysis Using Fuzzy K-Means Clustering”,
Genome Informatics 14, pp. 334-335, 2003.
3. Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering gene expression patterns,” Journal of
Computational Biology, Vol. 6, 1999, pp. 281-297.
4. Berkhin, P. Survey of Clustering Data Mining Techniques. Technical report, Accrue
Software, 2002
5. Cho, R. J. M. et al. “A Genome-Wide Transcriptional Analysis of The Mitotic Cell
Cycle.” Molecular Cell, 2:65-73, 1998.
6. Ding, Q., Khan, M., Roy, A., and Perrizo, W., “The P-Tree Algebra”, ACM SAC, 2002.
7. Eisen, M.B., Spellman, P.T., “Cluster analysis and display of genome-wide expression
patterns”. Proceedings of the natinoal Academy of Science USA, pp. 14863-14868, 1995.
8. Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M. et al. “Molecular
classification of cancer: class discovery and class prediction by gene expression
monitoring”. Science 286, pp. 531-537, 1999.
9. Han J. and Kamber M. Data Mining, Concepts and Techniques. Morgan Kaufmann, 2001.
10. Hartigan, J. A. and Wong, M. A. A k-means clustering algorithm. Applied Statistics, 28:
1979. pp.100-108
11. Hartuv, E. and Shamir, R. A clustering algorithm based on graph connectivity.
Information Processing Letters, 76(4-6):175-181, 2000.
12. Hinneburg, A., and Keim, D. A.: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. Proceeding 4th Int. Conf. on Knowledge Discovery
and Data Mining, AAAI Press, 1998.
13. Jain, A K. and Dubes, R. C. Algorithms for Clustering Data. Prentice-Hall advanced
reference series. Prentice-Hall, Inc. 1988.
14. Jain, A. K., Murty M.N., and Flynn P. J. Data Clustering: A Review, ACM Computing
Surveys, Vol 31, No. 3, 1999. pp. 264-323.
15. Karypis, G., Han, E.-H. and Kumar, V. CHAMELEON: A hierarchical clustering
algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, August 1999.
16. Lin, C. and Chen, M. Combining Partitional and Hierarchical Algorithms for Robust and
Efficient Data Clustering with Cohesion Self-Merging, IEEE Transaction on Knowledge
and Data Engineering, Vol 17, No. 2, 2005. pp. 145-159.
17. Michael Eisen's gene expression data is available at http://rana.lbl.gov/EisenData.htm
18. Murty, N. M. and Krishna, G. A Hybrid Clustering Procedure for Concentric and ChainLike Clusters, International Journal of Computer and Information Sciences, vol. 10, no. 6,
1981. pp. 397-412.
19. Perrizo, W., “Peano Count Tree Technology”. Technical Report NDSU-CSOR-TR-01-1,
2001.
20. Shamir R. and Sharan R. CLICK: A clustering algorithm for gene expression analysis. In
Proceedings of the 8th International Conference on Intelligent Systems for Molecular
Biology (ISMB '00). AAAI Press. 2000.
21. Tavazoie, S. J. Hughes, D. and et al. “Systematic determination of genetic network
architecture”. Nature Genetics, 22, pp. 281-285, 1999.
22. Zhang, T., Ramakrisshnan, R. and Livny, M. BIRCH: an efficient data clustering method
for very large databases. In Proceedings of of Int’l Conf. on Management of Data, ACM
SIGMOD 1996.