Download Density Clustering Method for Gene Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
A Comprehensive Hierarchical Clustering Method for Gene Expression Data
Baoying Wang, Honglin Li, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105
Tel: (701) 231-6257
Fax: (701) 231-8255
{baoying.wang, honglin.li, william.perrizo}@ndsu.nodak.edu
Abstract. Data clustering methods have been proven
to be a successful data mining technique in the analysis
of gene expression data. However, some concerns and
challenges still remain in gene expression clustering.
For example, many traditional clustering methods
originated from non-biological fields may break down
if the choice of parameters is incorrect, or if the model
is not sufficient to capture the genuine clusters among
noisy data. In this paper, we propose an efficient
comprehensive hierarchical clustering method using
attractor trees (CAMP) based on both density and
similarity factors. The combination of density-based
approach and similarity-based approach takes
consideration of clusters with diverse shapes, densities,
and sizes. A vertical data structure, P-tree1, is used to
make the clustering process even more efficient by
accelerating the calculation of the density function
using neighborhood rings. Experiments on common
gene expression datasets demonstrate that our approach
is more efficient and scalable with competitive
accuracy.
Keywords: gene expression data, hierarchical
clustering, P-trees, microarray.
1
Introduction
Clustering in data mining is a discovery process
that partitions the data set into groups such that the data
points in the same group are more similar to each other
than the data points in other groups. Clustering analysis
of mircroarray gene expression data, which discovers
groups that are homogeneous and well separated, has
been recognized as an effective method for gene
expression analysis.
Eisen et al first applied hierarchical linkage
clustering approach that groups closest pairs into a
hierarchy of nested subsets based on similarity [5].
Golub et al has also successfully discovered the tumor
classes based on the simultaneous expression profiles of
thousands of genes from acute leukemia patient’s
testing samples using self-organizing maps clustering
approach [7]. Some other clustering approaches, such as
k-mean [14], fuzzy k-means [1], CAST[2], etc, also
have been proven to be valuable clustering methods for
1
Patents are pending on the P-tree technology. This work is
partially supported by GSA Grant ACT#: K96130308.
gene expression data analysis. However, some concerns
and challenges still remain in gene expression
clustering. For example, many traditional clustering
methods originated from non-biological fields may
break down if the choice of parameters is incorrect, or if
the model is not sufficient to capture the genuine
clusters among noisy data.
There are mainly two clustering methods: similaritybased partitioning methods and density-based clustering
methods. A similarity-based partitioning algorithm
breaks a dataset into k subsets, which are supposed to
be convexes and are in similar sizes. Density-based
clustering assumes that all points within each cluster are
density reachable, and all clusters are in similar
densities. Most hierarchical clustering methods are
based on similarity-based partitioning algorithms. As a
result, they can only handle clusters in convex shapes
and with similar size. In this paper, we propose an
efficient comprehensive agglomerative hierarchical
clustering method using attractor trees (CAMP). CAMP
combines the features of both density-based clustering
approach and similarity-based clustering approach,
which takes consideration of clusters with diverse
shapes, densities, and sizes.
A vertical data structure, P-trees is used to make the
algorithm more efficient by accelerating the calculation
of the density function. P-trees are also used as bit
indexes to clusters. In the merging process, only
summary information of the attractor sub-trees is used
to find the closest cluster pair. When two clusters need
to be merged, their P-tree indexes are retrieved for
merging. The clustering results are an attractor tree and
a collection of P-tree indexes to clusters corresponding
to each level of the attractor tree. In Experiments on
common gene expression datasets demonstrate that our
approach is more efficient and scalable with
competitive accuracy.
This paper is organized as follows. In section 2 we
give an overview of the related work. We present our
new clustering method, CAMP, in section 3. Section 4
discusses the implementation of CAMP using P-trees.
An experimental performance study is described in
section 5. Finally we conclude the paper in section 6.
2
2.1
Related Work
Similarity-based clustering vs. density-based
clustering
There are mainly two major clustering categories:
similarity based partitioning methods and density-based
clustering methods. A similarity-based partitioning
algorithm breaks a dataset into k subsets, called
clusters. The major problems with similarity-based
partitioning methods are:
(1) k has to be
predetermined; (2) it is difficult to identify clusters with
different sizes; (3) it only finds convex clusters.
Density-based clustering methods have been developed
to discover clusters with arbitrary shapes. The most
typical algorithm is DBSCAN [6]. The basic idea for
the algorithm DBSCAN is that for each point of a
cluster the neighborhood of a given radius () has to
contain at least a minimum number of points (MinPts)
where  and MinPts are input parameters.
2.2
Hierarchical clustering algorithms
Hierarchical algorithms create a hierarchical
decomposition of a dataset X. The hierarchical
decomposition is represented by a dendrogram, a tree
that iteratively splits X into smaller subsets until each
subset consists of only one object. In such a hierarchy,
each level of the tree represents a clustering of X.
Most hierarchical clustering algorithms are variants
of the single-link and the complete link approaches. In
the single-link method, the distance between two
clusters is the minimum of the distances between all
pairs of points, which are from either of the two clusters.
In the complete-link algorithm, the distance between
two clusters is the maximum of all pair-wise distance
between points in the two clusters. In either case, two
clusters are merged to form a larger cluster based on
minimum distance (or maximum similarity) criteria.
The complete-link algorithm produces tightly bound or
compact clusters while the single-link algorithm suffers
when there are a chain of noises between two clusters.
CHAMELEON [10] is a variant of the complete-link
approaches. It operates on a k-nearest neighbor graph.
The algorithm first uses a graph partitioning approach
to divide the dataset into a set of small clusters. Then
the small clusters are merged based on their similarity
measure. CHAMELEON has been found to be very
effective in clustering convex shapes. However, the
algorithm is not designed for very noisy data sets.
When the dataset size is large, hierarchical clustering
algorithms break down due to their non-linear time
complexity and huge I/O costs. In order to remedy this
problem, BIRCH [16] was developed. BIRCH performs
a linear scan of all data points and the cluster
summaries are stored in memory in the data structure
called a CF-tree. A nonleaf node represents a cluster
consisting of all the subclusters represented by its
entries. A leaf node has to contain at most L entries and
the diameter of each entry in a leaf node has to be less
than T. A point is inserted by inserting the
corresponding CF-value into the closest leaf of the tree.
If an entry in the leaf can absorb the new point without
violating the threshold condition, the CF-values for this
entry are updated; otherwise a new entry in the leaf
node is created. Once the clusters are generated, each
data point is assigned to the cluster with the closest
centroid. Such label assignment may have problems
when the clusters do not have similar size and shapes.
2.3
Clustering methods of gene expression data
There are many newly developed clustering
methods which are dedicated to gene expression data.
These clustering algorithms partition genes into groups
of co-expressed genes.
Eisen et al [5] adopted a hierarchical approach
using UPGMA (Unweighed Pair Group Method with
Arith-metic Mean) to group closest gene pairs. This
method displays the clustering results in a colored graph
pattern. In this method, the gene expression data is
colored according to the measured fluorescence ratio,
and genes are re-ordered based on the hierarchical
dendrogram structure.
Ben-Dor et al. [2] proposed a graph-based
algorithm CAST (Cluster Affinity Search Techniques).
Two points are linked in the graph if they are similar.
The problem of clustering a set of genes is then
converted to a classical graph-theoretical problem.
CAST takes as input a parameter called the affinity
threshold t, where 0 < t < 1, and tries to guarantee that
the average similarity in each generated cluster is higher
than the threshold t.
Hartuv et al. [9] presented an algorithm HCS
(Highly Connected Subgraph). HCS recursively splits
the weighted graph into a set of highly connected subgraphs along the minimum cut. Each highly connected
sub-graph is called a cluster. Later on, the same
research group developed another algorithm, CLICK
(Cluster Identification via Connectivity Kernels) [13].
CLICK builds up a statistic framework to measure the
coherence within a subset of genes and determines the
criterion to stop the recursive splitting process.
3
Agglomerative Hierarchical Clustering
Using Attractor Trees
In this section, we propose a comprehensive
agglomerative hierarchical clustering using attractor
trees (CAMP). CAMP consists of two processes: (1)
clustering by local attractor trees (CLA) and (2) cluster
merging based on similarity (MP). The final clustering
result is an attraction tree and a set of P-tree indexes to
clusters corresponding to each level of the attractor tree.
The attraction tree is composed of leaf nodes, which are
the local attractors of the attractor sub-trees constructed
in CLA process, and interior nodes, which virtual
attractors resulted from MP process. Figure 1 is an
example of an attraction tree.

DF(x)=  f k ( x) || EINring( x, k ,  ) ||
(2)
k 1
Virtual
attractors
Local
attractor
C
Figure 1.
The attraction tree
Density Function
Given a data point x, the density function of x is
defined as the sum of the influence functions of all data
points in the data space X. There are many ways to
calculate the influence function. The influence of a data
point on x is inversely proportional to the distance
between the point and x. If we divided the
neighborhood of x into neighborhood rings, then points
within smaller rings have more influence on x than
those in bigger rings. We define the neighborhood ring
as follows:
Definition 1. Neighborhood Ring of a data
point c with radii r1 and r2 is defined as the set R(c, r1,
r2) = {x X | r1<|c-x| r2}, where |c-x| is the distance
between x and c. The number of neighbors falling in
R(c, r1, r2) is denoted as N = || R(c, r1, r2)||.
Definition 2. Equal Interval Neighborhood
Ring (EINring) of a data point c with radii r1=k and
r2=(k+1) is defined as the kth neighborhood ring
EINring(c, k, ) = R(c, r1, r2) = R (c, k, (k+1)), where
 is a constant. Figure 2 shows 2-D EINrings with k =
1, 2, and 3. The number of neighbors falling within the
kth EINring is denoted as ||EINring(c, k, )||.
Let y be a data point within the kth EINring of x.
The EINring-based influence function of y on x is
defined as:
f(y,x)
= fk(x) = 1 /logk


i
The data set is first grouped into local attractor trees
by means of density-based approach in CLA process.
Each local attractor tree represents a preliminary
cluster, the root of which is a density attractor of the
cluster. Then the small clusters are merged level-bylevel in MP process according to their similarity until
the whole data set becomes a cluster.
In this section, we first define density function of
data points and describe the detailed clustering process
of CLA. Then we define similarity function between
clusters and propose the algorithm of cluster merging
process (MP). Finally we discuss our noise handling
technique.
3.1

(1)
The density function of x is defined as the
summation of influence within every EINring
neighborhood of x.
Figure 2.
3.2
Diagram of EINrings.
Clustering by Local Attractor Trees
The basic idea of clustering by local attractor
trees (CLA) is to partition the data set into clusters in
terms of density attractor trees. The clusters can be any
shape and any size of density-connected areas. Given a
data point x, if we follow the steepest density ascending
path, the path will finally lead to a local density
attractor. If x doesn’t have such a path, it can be either a
local attractor or a noise. All points whose steepest
ascending paths lead to the same local attractor form a
cluster. The resultant graph is a collection of local
attractor trees with the local attractor as the root. The
leaves are the boundary points of clusters. An example
of a dataset and the attractor trees are shown in Figure
3.
Figure 3.
A dataset and the attractor trees
Given the step size s and the EINring interval , the
CLA clustering is processed as follows:
1. Compute density function for each point;
2. For an arbitrary point x, find the point with the
highest density in the neighborhood R(x, 0, s).
If it is higher than the density of x, build a
direct edge from x to that point.
3. If none of the neighbors has higher density
than x, x becomes the root of the local attractor
tree, and is assigned with a new cluster label.
4. Go back to step 2 with the next point.
5. The data points in each attractor tree are
assigned with the same cluster label as the
attractor/root.
3.3
Similarity between Clusters
There have been many similarity measures to find
the most similar cluster pairs. CURE uses the similarity
between the closest pair of points which belong to
different clusters [8]. CHAMELEON uses both relative
connectivity and relative closeness by complicated
graph implementations. We also consider both relative
connectivity and relative closeness, but combine them
by means of attractor trees. We define similarity
between cluster i and j as follows:
CS(i, j) = (
hi
fi

hj
1
)
f j d ( Ai , A j )
(3)
real attractor. It is only a virtual attractor which could
attract all points of two sub-trees. The merging process
is shown in Figure 5. The cluster merging is processed
recursively by combining (virtual) attractor trees.
Av
Aj
Ai
(a) Before merging
Figure 5.
(b) After merging
Cluster merging process
th
where hi is the average height of the i attractor tree; f i
is the average fan-out of the ith attractor tree; d(Ai, Aj) is
the distance between two local attractors Ai and Aj. The
calculations of hi and f i are discussed later.
Cluster similarity depends on the following factors:
the distance between attractors of two clusters, the
average heights and average fan-outs of two clusters. Our
cluster similarity function can distinguish the cases shown
in Figure 4. The cluster pairs on the left have higher
similarity than those on the right according to our
definition.
After merging, we need to compute the new
root/attractor Av, the average height hv , the average fanout f v of the new virtual attractor tree. Take two
clusters: Ci and Cj, for example, and assume the size of
Cj is greater or equal than Ci, i.e. ||Cj||  ||Ci||, we have
the following equations:
Avl =
hv
Figure 4.
3.4
Examples of cluster similarity
Cluster Merging Process
After the local attract trees (sub-clusters) are
built in CLA process, cluster merging process (MP)
starts combining the most similar sub-cluster pair levelby-level based on similarity measure. When two
clusters are merged, two local attractor trees are
combined into a new tree, called a virtual local attractor
tree. It is called “virtual” because the new root is not a
=
l = 1, 2 ... d
|| Cj ||
d ( Ai , A j )
|| Ci ||  || Cj ||
|| Ci || * f i  || Cj || * f j
|| Ci ||  || Cj ||
(4)
(5)
(6)
where Ail is the lth attribute of the attractor Ai. ||Ci|| is the
size of cluster Ci. d(Ai, Aj) is the distance between two
local attractors Ai and Aj. hi and f i are the average height
and the average fan-out of the ith attractor tree
respectively.
3.5
(c) Different d(Ai, Aj)
= Max{ hi , h j }+
fv
(a) Different average heights
(b) Different average fan-outs
|| Cj ||
( Ail  A jl )
|| Ci ||  || Cj ||
Delayed Noise Eliminating Process
Since gene expression data is high noise data, it is
important to have a proper noise eliminating process.
Naively, the points which stand alone after CLA
process should be noises. However, some sparse
clusters might be mistakenly eliminated if the step size
s is small. Therefore, we delayed the noise-handling
process till the later stage. The neighborhoods of noises
are generally sparser than points in clusters. In cluster
merging process, noises tend to merge with other points
with much fewer chances, and grow much slowly.
Therefore, the cluster merging process is tracked to
capture those clusters which are growing very slowly.
In case of slow growing clusters, if the cluster is small,
eliminate the whole cluster as noise. Otherwise if the
cluster is large, peel off the points which are recently
merged to it at a low rate.
4
G (E1, E2, E3)
Implementation of CAMP in P-trees
CAMP is implemented using the data-mining-ready
vertical bitwise data structures, P-trees, to make the
clustering process much more efficient and scalable.
The P-tree technology was initially developed by the
DataSURG research group for spatial data [12] [4]. Ptrees provide a lot of information and are structured to
facilitate data mining processes. In this section, we first
briefly discuss representation of a gene dataset in P-tree
structure and P-tree-based neighborhood computation.
Then we detail the implementation of CAMP using Ptrees.
4.1
E13 E12 E11
G (E1, E2, E3)
5
A22
7
7
2
4
3
1
Figure 6.
2
3
2
2
5
7
2
3
7
2
2
5
5
1
1
4
An example of gene table.
1
0
1
1
0
1
0
0
1
1
1
0
1
0
A02 1
Data Representation
We organize the gene expression data as a
relational table with row of genes and column of
experiments, or time series. Instead of using double
precision float numbers with a mantissa and exponent
represented in complement of two, we partition the data
space of gene expression as follows. First, we need to
decide the number of intervals and specify the range of
each interval. For example, we could partition the gene
expression data space into 256 intervals along each
dimension equally. After that, we replace each gene
value within the interval by a string, and use strings
from 00000000 to 11111111 to represent the 256
intervals. The length of the bit string is base two
logarithm of the number of intervals. The optimal
number of intervals and their ranges depend on the size
of datasets and accuracy requirements.
Given a gene table G = (E1, E2 … Ed), and the
binary representation of jth attribute Ej as bj,mbj,m-1...bj,i…
bj,1bj,0, the table are projected into columns, one for
each attribute. Then each attribute column is further
decomposed into separate bit vectors, one for each bit
position of the values in that attribute. Figure 6 shows a
relational table with three attributes. Figure 7 shows the
decomposition process from the gene table G to a set of
bit vectors.
A2
E1
101
010
111
111
010
100
011
001
Figure 7.
1
0
1
1
0
0
1
1
E2
010
011
010
010
101
111
010
011
E3
111
010
010
101
101
001
001
100
E23 E22 E21
E33 E32 E31
0
1
0
0
1
1
0
0
1
1
0
1
1
1
0
1
0
0
1
1
0
1
A0 2 1
1
1
0
1
1
0
0
1
1
0
0
0
0
0
1
0
0
1
1
1
1
0
A02 1
Decomposition of the gene table
After decomposition process, each bit vectors is then
converted into a P-tree. A P-tree is built by recording
the truth of the predicate “purely 1-bits” recursively on
halves of the bit vectors until purity is reached. Three Ptree examples are illustrated in Figure 8.
0
0
0
0
1
1
0
0
0
0
0 1
(a) P 2 1
Figure 8.
(b) P 2 2
0
1
0
0
0 1
0 1
0
0 1
(c) P 2 3
P-trees of attributes E21, E22 and E23
The P-tree logic operations are pruned bit-by-bit
operations, and performed level-by-level starting from
the root level. For instance, ANDing a pure-0 node with
any node results in a pure-0 node, ORing a pure-1 node
with any results in a pure-1 node.
4.2
P-tree based neighborhood computation
The major computational cost of CAMP is in the
preliminary clustering process, CLA, which mainly
involves computation of densities. To improve the
efficiency of density computation, we adopt the P-tree
based neighborhood computation by means of the
optimized P-tree operations. In this section, we first
review the optimized P-tree operations. Then we
present the P-tree based neighborhood computation.
P-tree predicate operations: Let A be jth dimension of
data set X, m be its bit-width, and Pm, Pm-1, … P0 be the
P-trees for the vertical bit files of A. c=bm…bi…b0,
where bi is ith binary bit value of c. Let PA >c and PAc be
the P-tree representing data points satisfying the
predicate A>c and Ac respectively, then we have
PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk,
(7)
kim
where opi is  if bi=1, opi is  otherwise.
PAc = P’mopm … P’i opi P’i-1 … opk+1P’k,
kim (8)
where 1). opi is  if bi=0, opi is  otherwise.
In equations above, k is the rightmost bit position
with value of “0”. The operators are right binding.
Calculation of neighborhood: Let Pc,r be the P-tree
representing data points within the neighborhood R(c,
0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree
representing data points satisfying the predicate cr<xc+r. Therefore
Pc,r = Pc-r<xc+r = Px >c-r  Pxc+r

DF(x) =  f k ( x) || Px, k ||
(11)
k 1
Structure of a (virtual) attractor tree: An attractor tree
consists of two parts: (1) a collection of summary data,
such as the size of the tree, the attractor, average height,
average fan-out, etc; and (2) a P-tree used as an index to
points in the attractor trees. In merging process, only
the first part needs to be in memory. The second part is
only needed at the time of merging. Besides, a lookup
table is used to record the level of each point in the
attractor tree. The lookup table is only used in initial
clustering process.
Here is an example of an index P-tree. Assume the
dataset size is 8, and an attractor tree contains the first
four points and the sixth point in the data set. The
corresponding bit index is (11110100), which is
converted into a P-tree as shown in Figure 9.
0
1
0
0
(9)
0
P-tree
0 1
bit index
11110100
where Px >c-r and Pxc+r are calculated by means of Ptree predicate operations above.
Figure 9.
Calculation of the EINring neighborhood: Let Pc,k be
the P-tree representing data points within EINring(c, k,
) = {x X | k< |c-x|  (k+1)}. In fact, EINring(c, k,
) neighborhood is the union of R(c, 0, (k+1)) and the
complement of R(c, 0, k). Hence
Pc,k = Pc, (k+1)  P’c, k
(10)
where P’c, k is the complement of Pc, k.
The count of 1’s in Pc,k, ||Pc,k||, represents the number of
data points within the EINring neighborhood, i.e.
||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a
specific neighbor point.
4.3
The P-tree for an attractor tree.
Creating an attractor tree (in CLA process): When a
steepest ascending path (SAP) from a point stops at a
new local maximal, we need to create a new attractor
tree. The stop point is the attractor. The average height
h = Ns / 2, where Ns is the number of steps in the
path. The average fan-out f = 1. The corresponding
index P-tree is built.
Updating an attractor tree (in CLA process): If the
SAP encounters a point in an attractor tree, the whole
SAP is inserted into the attractor tree. As the result, the
attractor tree needs to be updated. The attractor doesn’t
change. The new average height hnew and the new
average fan-out f new are calculated as follows:
Implementation of CAMP using P-trees
CAMP consists of two steps: clustering using
local attractor trees (CLA) and cluster merging process
(MP). The critical issues are computations of density
function and similarity function and manipulation of the
(virtual) attractor trees during clustering process. In
fact, similarity function is easy to compute, given the
summary information of two attractor trees. In this
section, we will focus on density function and the
attractor trees.
Computation of density function (in CLA process):
According to equation (2) and (10), the density function
is calculated using P-trees as follows:
hnew
=
m
h old * N old     m * l
2
N old  m
f new 
i
Nold
* f old  m
i
Nold
 m 1
(12)
(13)
where Nold is the size of the old attractor tree; m is the
number of points added to the tree; l represents the level
i
of the insertion point; N old
is the number of interior
nodes of the old attractor tree.
Merging two attractor trees (in MP process): When
two attractor trees are combined into a new virtual
attractor tree, the summary data are computed using
equations (4) – (6). A new P-tree is formed simply by
ORing two old P-trees, i.e. Pv= Pi  Pj.
where
 
G
n

k 1
5
Performance Study
To evaluate the efficiency, accuracy, and
robustness of our approach (CAMP), we used three
microarray expression datasets: DS1 and DS2 and DS3.
DS1 is the dataset used by CLICK [13]. It contains
expression levels of 8,613 human genes measured at 12
time-points. DS2 and DS3 were obtained from the
Michael Eisen's lab [11]. DS2 is a gene expression
matrix of 6221  80. DS3 is the largest dataset with
13,413 genes under 36 experimental conditions. The
raw expression data was first normalized [2]. Then the
datasets were then decomposed and converted to Ptrees. We implemented k-means [14], BIRCH, CAST
[2], and CAMP algorithms in C++ language on a
Debian Linux 3.0 PC with 1 GHz Pentium CPU and 1
GB main memory.
The total run times for different algorithms on
DS1, DS2 and DS3 are shown in Figure 10. Note that
our approach outperformed k-means, BIRCH and
CAST substantially when the dataset is large. In
particular, our approach performed almost 4 times faster
than k-means for DS3.
BIRCH
CAST
CAMP
G
n

2
0 .8
0 .6
0 .4
0 .2
0
DS1
DS2
DS3
0.446
0.302
0.285
B IRCH
0.51
0.504
0.435
CA ST
0.678
0.685
0.727
CA M P
0.785
0.625
0.667
K-means
 value comparisons
In summary, CAMP outperforms the other
methods in terms of execution time with high
scalability. Our clustering results are almost as good as
CAST which, however, is not scalable for large
datasets.
14000
12000
run time(s)
k
M = n (n - 1) / 2 and  is between [-1, 1].  is used to
measure the correlation between the similarity matrix X
and the adjacent matrix of the clustering results.
The best clustering qualities for different methods
on DS1, DS2 and DS3 are shown in Figure 11. From
Figure 11, it is obvious that our approach and CAST
have better clustering results than the other two
methods. For DS1, our approach has better results than
CAST.
Figure 11.
K-means
G
10000
8000
6000
4000
6
2000
Conclusion
0
DS1
Figure 10.
DS2
DS3
Run time comparisons
The clustering results are evaluated by means of
“Hubert’s  statistic” [15]. Given two matrixes X=[X(i,
j)] and Y=[Y(i, j)], where X(i, j) is a similarity matrix
between every pair of genes, and Y(i, j) is defined as
1
Y (i, j )  
0
if genes i and j are clustered in the same cluster
otherwise
Hubert’s  statistic indicates the point serial correlation
between two matrixes X and Y, and is computed as

1
M
n 1
 X (i, j )  X  Y (i, j )  Y


X
Y
j i 1

n
  
i 1




In this paper, we have proposed an efficient
comprehensive hierarchical clustering method using
attractor trees, CAMP, which combines the features of
both density-based clustering approach and similaritybased clustering approach. The combination of densitybased approach and similarity-based approach takes
consideration of clusters with diverse shapes, densities,
and sizes. A vertical data structure, P-trees, and
optimized P-tree operations are used to make the
algorithm more efficient by accelerating the calculation
of the density function. Experiments on common gene
expression datasets demonstrated that our approach is
more efficient and scalable with competitive accuracy.
As a result, our approach can be a powerful tool for
gene expression data analysis.
In the future, we will apply our approach to large
scale time series gene expression data, where the
efficient and scalable analysis approach is in demand.
We will also work on poster cluster analysis and result
interpretation. For example, we will explore to build
Bayesian network to model the potential pathway for
each discovered cluster and subcluster.
9.
Reference
11.
1.
2.
3.
4.
5.
6.
7.
8.
Arima, C and Hanai, T. “Gene Expression Analysis
Using Fuzzy K-Means Clustering”, Genome
Informatics 14, pp. 334-335, 2003.
Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering
gene
expression
patterns,”
Journal
of
Computational Biology, Vol. 6, 1999, pp. 281-297.
Cho, R. J. M. et al. “A Genome-Wide
Transcriptional Analysis of The Mitotic Cell
Cycle.” Molecular Cell, 2:65-73, 1998.
Ding, Q., Khan, M., Roy, A., and Perrizo, W., “The
P-Tree Algebra”, ACM SAC, 2002.
Eisen, M.B., Spellman, P.T., “Cluster analysis and
display of genome-wide expression patterns”.
Proceedings of the natinoal Academy of Science
USA, pp. 14863-14868, 1995.
Ester, M., Kriegel, H-P., Sander, J. And Xu, X. “A
density-based algorithm for discovering clusters in
large spatial databases with noise”. In Proceedings
of the 2nd ACM SIGKDD, Portland, Oregon, pp.
226-231, 1996.
Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard,
C.; Gaasenbeek, M. et al. “Molecular classification
of cancer: class discovery and class prediction by
gene expression monitoring”. Science 286, pp. 531537, 1999.
Guha, K. S. and Rastogi, S. R. CURE: An efficient
clustering algorithm for large databases. In
SIGMOD’98, Seattle, Washington, 1998.
10.
12.
13.
14.
15.
16.
Hartuv, E. and Shamir, R. A clustering algorithm
based on graph connectivity. Information
Processing Letters, 76(4-6):175-181, 2000.
Karypis, G., Han, E.-H. and Kumar, V.
CHAMELEON:
A
hierarchical
clustering
algorithm using dynamic modeling. IEEE
Computer, 32(8):68–75, August 1999.
Michael Eisen's gene expression data is available at
http://rana.lbl.gov/EisenData.htm
Perrizo, W., “Peano Count Tree Technology”.
Technical Report NDSU-CSOR-TR-01-1, 2001.
Shamir R. and Sharan R. CLICK: A clustering
algorithm for gene expression analysis. In
Proceedings of the 8th International Conference on
Intelligent Systems for Molecular Biology (ISMB
'00). AAAI Press. 2000.
Tavazoie, S. J. Hughes, D. and et al. “Systematic
determination of genetic network architecture”.
Nature Genetics, 22, pp. 281-285, 1999.
Tseng, V. S. and Kao, C. “An Efficient Approach
to Identifying and Validating Clusters in
Multivariate Datasets with Applications in Gene
Expression Analysis,” Journal of Information
Science and Engineering, Vol. 20 No. 4, pp. 665677. 2004.
Zhang, T., Ramakrisshnan, R. and Livny, M.
BIRCH: an efficient data clustering method for
very large databases. In Proceedings of of Int’l
Conf. on Management of Data, ACM SIGMOD
1996.