Download Density Clustering Method for Gene Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
A Comprehensive Hierarchical Clustering Method for Gene Expression Data
Baoying Wang, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105
Tel: (701) 231-6257
Fax: (701) 231-8255
{baoying.wang, william.perrizo}@ndsu.nodak.edu
Abstract. Data clustering methods have been proven
to be a successful data mining technique in analysis of
gene expression data. However, some concerns and
challenges still remain in gene expression clustering.
For example, many traditional clustering methods
originated from non-biological fields may not work
well if the model is not sufficient to capture the genuine
clusters among noisy gene expression data. In this
paper, we propose an efficient comprehensive
hierarchical clustering method using attractor trees
based on both density factors and similarity factors. The
combination of density-based approach and similaritybased approach takes consideration of clusters with
diverse shapes, densities, and sizes, and is capable of
dealing with noises. A vertical data structure, P-tree1, is
used to make the clustering process more efficient by
accelerating calculation of density functions using Ptree based neighborhood rings. Experiments on
common gene expression datasets demonstrate that our
approach is more efficient and scalable with
competitive accuracy.
Keywords: gene expression data, hierarchical
clustering, P-trees, microarray.
1
Introduction
Clustering in data mining is a discovery process
that partitions the data set into groups such that the data
points in the same group are more similar to each other
than the data points in other groups. Clustering analysis
of mircroarray gene expression data, which discovers
groups that are homogeneous and well separated, has
been recognized as an effective method for gene
expression analysis.
Eisen et al first applied hierarchical linkage
clustering approach that groups closest pairs into a
hierarchy of nested subsets based on similarity [5].
Golub et al has also successfully discovered the tumor
classes based on the simultaneous expression profiles of
thousands of genes from acute leukemia patient’s
testing samples using self-organizing maps clustering
approach [7]. Some other clustering approaches, such as
k-mean [14], fuzzy k-means [1], CAST[2], etc, also
have been proven to be valuable clustering methods for
1
Patents are pending on the P-tree technology. This work is
partially supported by GSA Grant ACT#: K96130308.
gene expression data analysis. However, some concerns
and challenges still remain in gene expression
clustering. For example, many traditional clustering
methods originated from non-biological fields may not
work well if the model is not sufficient to capture the
genuine clusters among noisy data.
There are mainly two clustering methods: similaritybased partitioning methods and density-based clustering
methods. A similarity-based partitioning algorithm
breaks a dataset into k subsets, which are supposed to
be convex shapes and with similar sizes. Density-based
clustering assumes that clusters in a dataset have similar
densities. Most hierarchical clustering methods belong
to similarity-based partitioning algorithms. As a result,
they can only handle clusters with convex shapes and
similar size. In this paper, we propose an efficient
comprehensive agglomerative hierarchical clustering
method using attractor trees: Clustering using Attractor
tree and Merging Process (CAMP). CAMP combines
the features of both density-based clustering approach
and similarity-based clustering approach, which takes
consideration of clusters with diverse shapes, densities,
and sizes, , and is capable of dealing with noisy data.
A vertical data structure, P-trees is used to make the
algorithm more efficient by accelerating the calculation
of the density function. P-trees are also used as bit
indexes to clusters. In the merging process, only
summary information of the attractor trees is used to
find the closest cluster pair. When two clusters are to be
merged, only their P-tree indexes are retrieved to
perform the merging process. The clustering results
consist of an attractor tree and a collection of P-tree
indexes to clusters. In Experiments on common gene
expression datasets demonstrate that our approach is
more efficient and scalable with competitive accuracy.
This paper is organized as follows. In section 2 we
give an overview of the related work. We present our
new clustering method, CAMP, in section 3. Section 4
discusses the implementation of CAMP using P-trees.
An experimental performance study is described in
section 5. Finally we conclude the paper in section 6.
2
2.1
Related Work
Similarity-based clustering vs. density-based
clustering
There are mainly two major clustering categories:
similarity based partitioning methods and density-based
clustering methods. A similarity-based partitioning
algorithm breaks a dataset into k subsets, called
clusters. The major problems with similarity-based
partitioning methods are:
(1) k has to be
predetermined; (2) it is difficult to identify clusters with
different sizes; (3) they only finds convex clusters.
Density-based clustering methods have been developed
to discover clusters with arbitrary shapes. The most
typical algorithm is DBSCAN [6]. The basic idea for
the algorithm DBSCAN is that for each point of a
cluster the neighborhood of a given radius () has to
contain at least a minimum number of points (MinPts)
where  and MinPts are input parameters.
2.2
Hierarchical clustering algorithms
Hierarchical algorithms create a hierarchical
decomposition of a dataset X. The hierarchical
decomposition is represented by a dendrogram, a tree
that iteratively splits X into smaller subsets until each
subset consists of only one object. In such a hierarchy,
each level of the tree represents a clustering of X.
Most hierarchical clustering algorithms are variants
of the single-link and the complete link approaches. In
the single-link method, the distance between two
clusters is the minimum of the distances between all
pairs of points, which are from either of the two clusters.
In the complete-link algorithm, the distance between
two clusters is the maximum of all pair-wise distance
between points in the two clusters. In either case, two
clusters are merged to form a larger cluster based on the
minimum distance (or maximum similarity) criteria.
The complete-link algorithm produces tightly bound or
compact clusters while the single-link algorithm suffers
when there are a chain of noises between two clusters.
CHAMELEON [10] is a variant of the complete-link
approaches. It operates on a k-nearest neighbor graph.
The algorithm first uses a graph partitioning approach
to divide a dataset into a set of small clusters. Then the
small clusters are merged based on their similarity
measure. CHAMELEON has been found to be very
effective in clustering convex shapes. However, the
algorithm is not designed for very noisy data sets.
When the dataset size is large, hierarchical clustering
algorithms break down due to their non-linear time
complexity and huge I/O costs. In order to remedy this
problem, BIRCH [16] was developed. BIRCH performs
a linear scan of all data points and the cluster
summaries are stored in memory in the data structure
called a CF-tree. A nonleaf node represents a cluster
consisting of all the subclusters represented by its
entries. A leaf node has to contain at most L entries and
the diameter of each entry in a leaf node has to be less
than T. A point is inserted by inserting the
corresponding CF-value into the closest leaf of the tree.
If an entry in the leaf can absorb the new point without
violating the threshold condition, the CF-values for this
entry are updated; otherwise a new entry in the leaf
node is created. Once the clusters are generated, each
data point is assigned to the cluster with the closest
centroid. Such label assignment may have problems
when the clusters do not have similar size and shapes.
2.3
Clustering methods of gene expression data
There are many newly developed clustering
methods which are dedicated to gene expression data.
These clustering algorithms partition genes into groups
of co-expressed genes.
Eisen et al [5] adopted a hierarchical approach
using UPGMA (Unweighed Pair Group Method with
Arith-metic Mean) to group closest gene pairs. This
method displays the clustering results in a colored graph
pattern. In this method, the gene expression data is
colored according to the measured fluorescence ratio,
and genes are re-ordered based on the hierarchical
dendrogram structure.
Ben-Dor et al. [2] proposed a graph-based
algorithm CAST (Cluster Affinity Search Techniques)
to improve clustering results. Two points are linked in
the graph if they are similar. The problem of clustering
a set of genes is then converted to a classical graphtheoretical problem. CAST takes as input a parameter
called the affinity threshold t, where 0 < t < 1, and tries
to guarantee that the average similarity in each
generated cluster is higher than the threshold t.
However, this method has a high complexity.
To improve the performance, Hartuv et al. [9]
presented a polynomial algorithm HCS (Highly
Connected Subgraph). HCS recursively splits the
weighted graph into a set of highly connected subgraphs along the minimum cut. Each highly connected
sub-graph is called a cluster. Later on, the same
research group developed another algorithm, CLICK
(Cluster Identification via Connectivity Kernels) [13].
CLICK builds up a statistic framework to measure the
coherence within a subset of genes and determines the
criterion to stop the recursive splitting process.
3
Agglomerative Hierarchical Clustering
Using Attractor Trees
In this section, we propose an efficient
comprehensive agglomerative hierarchical clustering
using attractor trees, CAMP. CAMP consists of two
processes: (1) clustering by local attractor trees (CLA)
and (2) cluster merging based on similarity (MP). The
final clustering result is an attractor tree and a set of Ptree indexes to clusters corresponding to each level of
the attractor tree. The attractor tree is composed of leaf
nodes, which are the local attractors of the attractor sub-
trees constructed in CLA process, and interior nodes,
which virtual attractors resulted from MP process.
Figure 1 is an example of an attractor tree.
The density function of x is defined as the
summation of influence within every EINring
neighborhood of x.

DF(x)=  f k ( x) || EINring( x, k ,  ) ||
Virtual
attractors
(2)
k 1
Local
attractor
Figure 1.
The attractor tree
C
The data set is first grouped into local attractor trees
by means of density-based approach in CLA process.
Each local attractor tree represents a preliminary
cluster, the root of which is a density attractor of the
cluster. Then the small clusters are merged level-bylevel in MP process according to their similarity until
the whole data set becomes a cluster.
In this section, we first define density function of
data points and describe the detailed clustering process
of CLA. Then we define similarity function between
clusters and propose the algorithm of cluster merging
process (MP). Finally we discuss our noise handling
technique.
3.1
Density Function
Given a data point x, the density function of x is
defined as the sum of the influence functions of all data
points in the data space X. There are many ways to
calculate the influence function. The influence of a data
point on x is inversely proportional to the distance
between the point and x. If we divided the
neighborhood of x into neighborhood rings, then points
within smaller rings have more influence on x than
those in bigger rings. We define the neighborhood ring
as follows:
Definition 1. Neighborhood Ring of a data
point c with radii r1 and r2 is defined as the set R(c, r1,
r2) = {x X | r1<|c-x| r2}, where |c-x| is the distance
between x and c. The number of neighbors falling in
R(c, r1, r2) is denoted as N = || R(c, r1, r2)||.
Definition 2. Equal Interval Neighborhood
Ring (EINring) of a data point c with radii r1=k and
r2=(k+1) is defined as the kth neighborhood ring
EINring(c, k, ) = R(c, r1, r2) = R (c, k, (k+1)), where
 is a constant. Figure 2 shows 2-D EINrings with k =
1, 2, and 3. The number of neighbors falling within the
kth EINring is denoted as ||EINring(c, k, )||.
Let y be a data point within the kth EINring of x.
The EINring-based influence function of y on x is
defined as:
f(y,x)
= fk(x) =
1
logk
(1)



i
Figure 2.
3.2
Diagram of EINrings.
Clustering by Local Attractor Trees
The basic idea of clustering by local attractor
trees (CLA) is to partition the data set into clusters in
terms of density attractor trees. The clusters can be any
shape and any size of density-connected areas. Given a
data point x, if we follow the steepest density ascending
path, the path will finally lead to a local density
attractor. If x doesn’t have such a path, it can be either a
local attractor or a noise. All points whose steepest
ascending paths lead to the same local attractor form a
cluster. The resultant graph is a collection of local
attractor trees with the local attractor as the root. The
leaves are the boundary points of clusters. An example
of a dataset and the attractor trees are shown in Figure
3.
Figure 3.
A dataset and the attractor trees
Given the step size s and the EINring interval , the
CLA clustering is processed as follows:
1. Compute density function for each point;
2. For an arbitrary point x, find the point with the
highest density in the neighborhood R(x, 0, s).
If it is higher than the density of x, build a
direct edge from x to that point.
3. If none of the neighbors has higher density
than x, x becomes the root of the local attractor
tree, and is assigned with a new cluster label.
4. Go back to step 2 with the next point.
5. The data points in each attractor tree are
assigned with the same cluster label as the
attractor/root.
3.3
Similarity between Clusters
There have been many similarity measures to find
the most similar cluster pairs. CURE uses the similarity
between the closest pair of points which belong to
different clusters [8]. CHAMELEON uses both relative
connectivity and relative closeness by complicated
graph implementations. We also consider both relative
connectivity and relative closeness, but combine them
by means of attractor trees. We define similarity
between cluster i and j as follows:
CS(i, j) = (
hi
fi

hj
1
)
f j d ( Ai , A j )
(3)
real attractor. It is only a virtual attractor which could
attract all points of two sub-trees. The merging process
is shown in Figure 5. The cluster merging is processed
recursively by combining (virtual) attractor trees.
Av
Aj
Ai
(a) Before merging
Figure 5.
(b) After merging
Cluster merging process
th
where hi is the average height of the i attractor tree; f i
is the average fan-out of the ith attractor tree; d(Ai, Aj) is
the Euclidean distance between two local attractors Ai and
Aj. The calculations of hi and f i are discussed later.
Cluster similarity depends on the following factors:
the distance between attractors of two clusters, the
average heights and average fan-outs of two clusters. Our
cluster similarity function can distinguish the cases shown
in Figure 4. The cluster pairs on the left have higher
similarity than those on the right according to our
definition.
After merging, we need to compute the new
root/attractor Av, the average height hv , the average fanout f v of the new virtual attractor tree. Take two
clusters: Ci and Cj, for example, and assume the size of
Cj is greater or equal than Ci, i.e. ||Cj||  ||Ci||, we have
the following equations:
Avl =
hv
Figure 4.
3.4
Examples of cluster similarity
Cluster Merging Process
After the local attract trees (sub-clusters) are
built in CLA process, cluster merging process (MP)
starts combining the most similar sub-cluster pair levelby-level based on similarity measure. When two
clusters are merged, two local attractor trees are
combined into a new tree, called a virtual local attractor
tree. It is called “virtual” because the new root is not a
=
l = 1, 2 ... d
|| Cj ||
d ( Ai , A j )
|| Ci ||  || Cj ||
|| Ci || * f i  || Cj || * f j
|| Ci ||  || Cj ||
(4)
(5)
(6)
where Ail is the lth attribute of the attractor Ai. ||Ci|| is the
size of cluster Ci. d(Ai, Aj) is the distance between two
local attractors Ai and Aj. hi and f i are the average height
and the average fan-out of the ith attractor tree
respectively.
3.5
(c) Different d(Ai, Aj)
= Max{ hi , h j }+
fv
(a) Different average heights
(b) Different average fan-outs
|| Cj ||
( Ail  A jl )
|| Ci ||  || Cj ||
Delayed Noise Eliminating Process
Since gene expression data is high noise data, it is
important to have a proper noise eliminating process.
Naively, the points which stand alone after CLA
process should be noises. However, some sparse
clusters might be mistakenly eliminated if the step size
s is small. Therefore, we delayed the noise-handling
process till the later stage. The neighborhoods of noises
are generally sparser than points in clusters. In cluster
merging process, noises tend to merge with other points
with much fewer chances, and grow much slowly.
Therefore, the cluster merging process is tracked to
capture those clusters which are growing very slowly.
In case of slow growing clusters, if the cluster is small,
eliminate the whole cluster as noise. Otherwise if the
cluster is large, peel off the points which are recently
merged to it at a low rate.
4
G (E1, E2, E3)
Implementation of CAMP in P-trees
CAMP is implemented using the data-mining-ready
vertical bitwise data structures, P-trees, to make the
clustering process much more efficient and scalable.
The P-tree technology was initially developed by the
DataSURG research group for spatial data [12] [4]. Ptrees provide a lot of information and are structured to
facilitate data mining processes. In this section, we first
briefly discuss representation of a gene dataset in P-tree
structure and P-tree-based neighborhood computation.
Then we detail the implementation of CAMP using Ptrees.
4.1
E13 E12 E11
G (E1, E2, E3)
5
A22
7
7
2
4
3
1
Figure 6.
2
3
2
2
5
7
2
3
7
2
2
5
5
1
1
4
An example of gene table.
1
0
1
1
0
1
0
0
1
1
1
0
1
0
A02 1
Data Representation
We organize the gene expression data as a
relational table with row of genes and column of
experiments, or time series. Instead of using double
precision float numbers with a mantissa and exponent
represented in complement of two, we partition the data
space of gene expression as follows. First, we need to
decide the number of intervals and specify the range of
each interval. For example, we could partition the gene
expression data space into 256 intervals along each
dimension equally. After that, we replace each gene
value within the interval by a string, and use strings
from 00000000 to 11111111 to represent the 256
intervals. The length of the bit string is base two
logarithm of the number of intervals. The optimal
number of intervals and their ranges depend on the size
of datasets and accuracy requirements.
Given a gene table G = (E1, E2 … Ed), and the
binary representation of jth attribute Ej as bj,mbj,m-1...bj,i…
bj,1bj,0, the table are projected into columns, one for
each attribute. Then each attribute column is further
decomposed into separate bit vectors, one for each bit
position of the values in that attribute. Figure 6 shows a
relational table with three attributes. Figure 7 shows the
decomposition process from the gene table G to a set of
bit vectors.
A2
E1
101
010
111
111
010
100
011
001
Figure 7.
1
0
1
1
0
0
1
1
E2
010
011
010
010
101
111
010
011
E3
111
010
010
101
101
001
001
100
E23 E22 E21
E33 E32 E31
0
1
0
0
1
1
0
0
1
1
0
1
1
1
0
1
0
0
1
1
0
1
A0 2 1
1
1
0
1
1
0
0
1
1
0
0
0
0
0
1
0
0
1
1
1
1
0
A02 1
Decomposition of the gene table
After decomposition process, each bit vectors is then
converted into a P-tree. A P-tree is built by recording
the truth of the predicate “purely 1-bits” recursively on
halves of the bit vectors until purity is reached. Three Ptree examples are illustrated in Figure 8.
0
0
0
0
1
1
0
0
0
0
0 1
(a) P 2 1
Figure 8.
(b) P 2 2
0
1
0
0
0 1
0 1
0
0 1
(c) P 2 3
P-trees of attributes E21, E22 and E23
The P-tree logic operations are pruned bit-by-bit
operations, and performed level-by-level starting from
the root level. For instance, ANDing a pure-0 node with
any node results in a pure-0 node, ORing a pure-1 node
with any results in a pure-1 node.
4.2
P-tree based neighborhood computation
The major computational cost of CAMP is in the
preliminary clustering process, CLA, which mainly
involves computation of densities. To improve the
efficiency of density computation, we adopt the P-tree
based neighborhood computation by means of the
optimized P-tree operations. In this section, we first
review the optimized P-tree operations. Then we
present the P-tree based neighborhood computation.
P-tree predicate operations: Let A be jth dimension of
data set X, m be its bit-width, and Pm, Pm-1, … P0 be the
P-trees for the vertical bit files of A. c=b m…bi…b0,
where bi is ith binary bit value of c. Let PA >c and PAc be
the P-tree representing data points satisfying the
predicate A>c and Ac respectively, then we have
PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk,
(7)
kim
where opi is  if bi=1, opi is  otherwise.
PAc = P’mopm … P’i opi P’i-1 … opk+1P’k,
kim (8)
where 1). opi is  if bi=0, opi is  otherwise.
In equations above, k is the rightmost bit position
with value of “0”. The operators are right binding.
Calculation of neighborhood: Let Pc,r be the P-tree
representing data points within the neighborhood R(c,
0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree
representing data points satisfying the predicate cr<xc+r. Therefore
Pc,r = Pc-r<xc+r = Px >c-r  Pxc+r

DF(x) =  f k ( x) || Px, k ||
(11)
k 1
Structure of a (virtual) attractor tree: An attractor tree
consists of two parts: (1) a collection of summary data,
such as the size of the tree, the attractor, average height,
average fan-out, etc; and (2) a P-tree used as an index to
points in the attractor trees. In merging process, only
the first part needs to be in memory. The second part is
only needed at the time of merging. Besides, a lookup
table is used to record the level of each point in the
attractor tree. The lookup table is only used in initial
clustering process.
Here is an example of an index P-tree. Assume the
dataset size is 8, and an attractor tree contains the first
four points and the sixth point in the data set. The
corresponding bit index is (11110100), which is
converted into a P-tree as shown in Figure 9.
0
1
0
0
(9)
0
P-tree
0 1
bit index
11110100
where Px >c-r and Pxc+r are calculated by means of Ptree predicate operations above.
Figure 9.
Calculation of the EINring neighborhood: Let Pc,k be
the P-tree representing data points within EINring(c, k,
) = {x X | k< |c-x|  (k+1)}. In fact, EINring(c, k,
) neighborhood is the union of R(c, 0, (k+1)) and the
complement of R(c, 0, k). Hence
Pc,k = Pc, (k+1)  P’c, k
(10)
where P’c, k is the complement of Pc, k.
The count of 1’s in Pc,k, ||Pc,k||, represents the number of
data points within the EINring neighborhood, i.e.
||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a
specific neighbor point.
4.3
The P-tree for an attractor tree.
Creating an attractor tree (in CLA process): When a
steepest ascending path (SAP) from a point stops at a
new local maximal, we need to create a new attractor
tree. The stop point is the attractor. The average height
h = Ns / 2, where Ns is the number of steps in the
path. The average fan-out f = 1. The corresponding
index P-tree is built.
Updating an attractor tree (in CLA process): If the
SAP encounters a point in an attractor tree, the whole
SAP is inserted into the attractor tree. As the result, the
attractor tree needs to be updated. The attractor doesn’t
change. The new average height hnew and the new
average fan-out f new are calculated as follows:
Implementation of CAMP using P-trees
CAMP consists of two steps: clustering using
local attractor trees (CLA) and cluster merging process
(MP). The critical issues are computations of density
function and similarity function and manipulation of the
(virtual) attractor trees during clustering process. In
fact, similarity function is easy to compute, given the
summary information of two attractor trees. In this
section, we will focus on density function and the
attractor trees.
Computation of density function (in CLA process):
According to equation (2) and (10), the density function
is calculated using P-trees as follows:
hnew
=
m
h old * N old     m * l
2
N old  m
f new 
i
Nold
* f old  m
i
Nold
 m 1
(12)
(13)
where Nold is the size of the old attractor tree; m is the
number of points added to the tree; l represents the level
i
of the insertion point; N old
is the number of interior
nodes of the old attractor tree.
Merging attractor trees (in MP process): When two
attractor trees are combined into a new virtual attractor
tree, the summary data are computed using equations
(4) – (6). A new P-tree is formed simply by ORing two
old P-trees, i.e. Pv= Pi  Pj.
5
1
M

n 1
 X (i, j )  X  Y (i, j )  Y


X
Y
j i 1

n
  
i 1
where
Performance Study
 
G
To evaluate the efficiency and accuracy of CAMP,
we used three microarray expression datasets: DS1 and
DS2 and DS3. DS1 is the dataset used by CLICK [13].
It contains expression levels of 8,613 human genes
measured at 12 time-points. DS2 and DS3 were
obtained from the Michael Eisen's lab [11]. DS2 is a
gene expression matrix of 6221  80. DS3 is the largest
dataset with 13,413 genes under 36 experimental
conditions. The raw expression data was first
normalized [2]. Then the datasets were then
decomposed and converted to P-trees. We implemented
k-means [14], BIRCH [16], CAST [2], and CAMP
algorithms in C++ language on a Debian Linux 3.0 PC
with 1 GHz Pentium CPU and 1 GB main memory. To
make the algorithms comparable, we run BIRCH and
CAMP up to the level where the number of clusters is
equal to the number of clusters of CAST find.
The total run times for different algorithms on
DS1, DS2 and DS3 are shown in Figure 10. Note that
our approach outperformed k-means, BIRCH and
CAST substantially when the dataset is large. In
particular, our approach performed almost 4 times faster
than k-means for DS3.
K-means
BIRCH
CAST
G
k
G
n

2
M = n (n - 1) / 2 and  is between [-1, 1].  is used to
measure the correlation between the similarity matrix X
and the adjacent matrix of the clustering results.
The best clustering qualities for different methods
on DS1, DS2 and DS3 are shown in Figure 11. From
Figure 11, it is obvious that our approach and CAST
have better clustering results than the other two
methods. For DS1, our approach has better results than
CAST.
0 .8
0 .6
0 .4
0 .2
0
DS1
DS2
DS3
0.446
0.302
0.285
B IRCH
0.51
0.504
0.435
CA ST
0.678
0.685
0.727
CA M P
0.785
0.625
0.716
K-means
Figure 11.
 value comparisons
In summary, CAMP outperforms the other three
methods in terms of execution time with high
scalability. Our clustering results are almost as good as
CAST which, however, is not scalable for large datasets
because of its high complexity.
12000
run time(s)

k 1
CAMP
14000
n




10000
8000
6000
4000
2000
0
DS1
Figure 10.
DS2
DS3
Run time comparisons
The clustering results are evaluated by means of
“Hubert’s  statistic” [15]. Given two matrixes X=[X(i,
j)] and Y=[Y(i, j)], where X(i, j) is a similarity matrix
between every pair of genes, and Y(i, j) is defined as
1 if genes i and j are in the same cluster
Y (i, j )  
 0 otherwise
Hubert’s  statistic indicates the point serial correlation
between two matrixes X and Y, and is computed as
6
Conclusion
In this paper, we have proposed an efficient
comprehensive hierarchical clustering method using
attractor trees, CAMP, which combines the features of
both density-based clustering approach and similaritybased clustering approach. The combination of densitybased approach and similarity-based approach takes
consideration of clusters with diverse shapes, densities,
and sizes, and is capable of dealing with noises. A
vertical data structure, P-trees, and optimized P-tree
operations are used to make the algorithm more
efficient by accelerating the calculation of the density
function. Experiments on common gene expression
datasets demonstrated that our approach is more
efficient and scalable with competitive accuracy. As a
result, our approach can be a powerful tool for gene
expression data analysis.
In the future, we will apply our approach to large
scale time series gene expression data, where the
efficient and scalable analysis approach is in demand.
We will also work on poster cluster analysis and result
interpretation. For example, we will explore to build
Bayesian network to model the potential pathway for
each discovered cluster and subcluster.
8.
9.
10.
11.
Reference
1.
2.
3.
4.
5.
6.
7.
Arima, C and Hanai, T. “Gene Expression Analysis
Using Fuzzy K-Means Clustering”, Genome
Informatics 14, pp. 334-335, 2003.
Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering
gene
expression
patterns,”
Journal
of
Computational Biology, Vol. 6, 1999, pp. 281-297.
Cho, R. J. M. et al. “A Genome-Wide
Transcriptional Analysis of The Mitotic Cell
Cycle.” Molecular Cell, 2:65-73, 1998.
Ding, Q., Khan, M., Roy, A., and Perrizo, W., “The
P-Tree Algebra”, ACM SAC, 2002.
Eisen, M.B., Spellman, P.T., “Cluster analysis and
display of genome-wide expression patterns”.
Proceedings of the natinoal Academy of Science
USA, pp. 14863-14868, 1995.
Ester, M., Kriegel, H-P., Sander, J. And Xu, X. “A
density-based algorithm for discovering clusters in
large spatial databases with noise”. In Proceedings
of the 2nd ACM SIGKDD, Portland, Oregon, pp.
226-231, 1996.
Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard,
C.; Gaasenbeek, M. et al. “Molecular classification
of cancer: class discovery and class prediction by
gene expression monitoring”. Science 286, pp. 531537, 1999.
12.
13.
14.
15.
16.
Guha, K. S. and Rastogi, S. R. CURE: An efficient
clustering algorithm for large databases. In
SIGMOD’98, Seattle, Washington, 1998.
Hartuv, E. and Shamir, R. A clustering algorithm
based on graph connectivity. Information
Processing Letters, 76(4-6):175-181, 2000.
Karypis, G., Han, E.-H. and Kumar, V.
CHAMELEON:
A
hierarchical
clustering
algorithm using dynamic modeling. IEEE
Computer, 32(8):68–75, August 1999.
Michael Eisen's gene expression data is available at
http://rana.lbl.gov/EisenData.htm
Perrizo, W., “Peano Count Tree Technology”.
Technical Report NDSU-CSOR-TR-01-1, 2001.
Shamir R. and Sharan R. CLICK: A clustering
algorithm for gene expression analysis. In
Proceedings of the 8th International Conference on
Intelligent Systems for Molecular Biology (ISMB
'00). AAAI Press. 2000.
Tavazoie, S. J. Hughes, D. and et al. “Systematic
determination of genetic network architecture”.
Nature Genetics, 22, pp. 281-285, 1999.
Tseng, V. S. and Kao, C. “An Efficient Approach
to Identifying and Validating Clusters in
Multivariate Datasets with Applications in Gene
Expression Analysis,” Journal of Information
Science and Engineering, Vol. 20 No. 4, pp. 665677. 2004.
Zhang, T., Ramakrisshnan, R. and Livny, M.
BIRCH: an efficient data clustering method for
very large databases. In Proceedings of of Int’l
Conf. on Management of Data, ACM SIGMOD
1996.