Download Mining Gene Expression Datasets using Density

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Mining Gene Expression Datasets using
Density-based Clustering
Seokkyung Chung, Jongeun Jun, Dennis McLeod
Department of Computer Science
and Integrated Media System Center
University of Southern California
Los Angeles, California 90089–0781, USA
[seokkyuc, jongeunj, mcleod]@usc.edu
ABSTRACT
We propose a mining framework that supports the identication of useful patterns based on data clustering. Given
the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In
particular, we are interested in mining a yeast cell cycle
dataset. In molecular biology, a set of co-expressed genes
tend to share a common biological function. Moreover, coexpressed genes can be further used to identify mechanisms
of gene regulation and interaction. Thus, it is essential to
develop an eective clustering algorithm to identify the set
of co-expressed genes.
Toward this end, we propose genome-wide expression clustering based on a k-nearest neighbor search. By addressing
the strengths and limitations of previous density-based clustering approaches, we present a novel density clustering algorithm, which utilizes a neighborhood dened by k-nearest
neighbors. Experimental results indicate that the proposed
method successfully identies co-expressed gene clusters for
a yeast cell cycle dataset.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Data mining;
I.5.3 [Pattern Recognition]: Clustering
General Terms
Algorithms
Keywords
Clustering, Bioinformatics, Density Estimation, Gene Expression Analysis
1. INTRODUCTION
With the recent advancement of DNA microarray technologies, the expression levels of thousands of genes can be mea-
sured simultaneously [9]. The obtained data are usually organized as a matrix (also known as a gene expression prole), which consists of n columns and m rows. The columns
represent genes (usually genes of the whole genome), and
the rows correspond to the samples (e.g. various tissues,
experimental conditions, or time points).
Given this rich amount of gene expression data, the goal
of microarray analysis is to extract hidden knowledge (e.g.,
similarity or dependency between genes) from this matrix.
The analysis of gene expression may identify mechanisms of
gene regulation and interaction, which can be used to understand a function of a cell [11]. Moreover, comparison
between expression in a diseased tissue and a normal tissue will further enhance our understanding in the disease
pathology [13]. Therefore, data mining, which transforms
a raw dataset into useful higher-level knowledge, becomes a
must in life science [21].
One of the key steps in gene expression analysis is to perform
clustering genes that show similar patterns. By identifying
a set of gene clusters, we can hypothesize that the genes
clustered together tend to be functionally related.
With the abundance of microarray data, genome-wide expression data clustering has received signicant attention
during the past few years in the bioinformatics research
community, ranging from hierarchical clustering [9, 22], selforganizing maps [25], neural networks [14], algorithms based
on Principal Components Analysis [31] or Singular Value
Decomposition [6, 9, 15], subspace clustering [27, 30], and
graph-based approach [29]. However, less clustering research
has been conducted in terms of a k-nearest neighbor density
estimation. In this paper, we propose a density-based clustering algorithm, which utilizes density of a neighborhood
dened by k-nearest neighbors. In addition, we explore optimization methods for the fast KNN (k-nearest neighbor)
density estimation.
1.1
Goal
Since gene expression datasets consist of measurements across
various conditions (or time points), they are characterized
by multi-dimensional, huge size of volumes, and a noisy
data. Thus, clustering algorithms must be able to address
and exploit such features of the datasets. Although many
clustering algorithms have been studied in statistics, data
mining, and machine learning in the past few decades, to
address special constraints in gene expression datasets, we
propose a new clustering algorithm that satises the following constraints.
1. Many clustering algorithms require a user to provide
the number of clusters, which is hard to be determined
beforehand. Thus, the algorithm should be equipped
with the ability to identify the number of clusters automatically.
2. Due to the high-dimensionality and extremely huge
volumes of gene expression datasets, successful clustering algorithms should be scalable with the large number of genes and dimensions (e.g., conditions).
3. Many clustering algorithms are very sensitive to noises
or outliers. Since microarray data imposes a signicant
amount of noises, clustering algorithms must be able
to identify noises and remove them if necessary.
4. As discussed in Jiang et al. [17], co-expressed gene
clusters may be highly connected by a large amount
of intermediate genes (i.e., genes located between one
cluster and another). Clustering algorithms should not
be confused by genes in a transition region. That is,
simply merging two clusters connected by a set of intermediate genes would be avoided. Thus, the ability
to detect the \genes in the transition region" would be
helpful.
To address the above requirements, we propose a novel clustering algorithm that is relevant for gene expression datasets.
Our clustering algorithm exploits density-based clustering,
which utilizes a neighborhood dened by k-nearest neighbors.
The proposed algorithm rst eciently identies a k-nearest
neighbor list for each point. Next, KNN density for each
point is dened by utilizing a k-nearest neighbor list of each
point. Based on the density of each point, core points (genes
with high density), border points (genes with medium density), and noise points (genes with low density) are identied. Since a core point has high KNN density, it is expected
to locate well inside the cluster (i.e., a representative of a
cluster). Thus, instead of performing clustering on whole
datasets, conducting clustering on core points set can produce a rough cluster structure. After that, border points
are used to rene cluster structure by assigning them to the
most relevant cluster.
Note that we do not aim to cluster whole genes (i.e., noise
points or points in a transition region may not be clustered
into any cluster). Since our goal is to identify a set of genes
with strong coherent patterns, it may be necessary to remove
many of the genes during the clustering process. While this
approach does not provide a complete organization of all
genes, it can extract the \essentials" of information in a
genome-wide expression data.
In this paper, we are mainly focused on time-course gene
expression data (i.e., expression levels of genes are monitored
during some time interval). In particular, we are focused on
Notation
n
m
X
xi
xij
Nk (xi )
P
jCi j
CP
C
K
Meaning
A total number of genes
A total number of time points
An m n gene expression prole matrix
i , th gene
j -th feature of i-th gene
The k-nearest neighbor list for xi
(excluding xi )
A set of core points
The size of Ci
A set of core clusters (before renement)
A set of clusters (after renement)
A number of clusters
A threshold that determines a core point
A threshold that determines a noise point
Table 1: Summary of notations
a yeast cell cycle dataset. However, the proposed algorithm
can be easily extended to other kinds of microarray datasets.
1.2
Our Contributions
In this paper, we present a clustering algorithm that addresses the constraints discussed in Section 1.1. Recent
database mining research has proposed density-based clustering algorithms like DBSCAN [10] or Shared Nearest Neighbors (SNN) clustering [8]. In addition to incorporating the
ideas (e.g., core points, border points, noise points) of these
approaches, by addressing the limitations of previous densitybased clustering methods, we present a novel KNN-density
estimation clustering algorithm that is relevant for producing co-expressed gene clusters.
One of the key limitations in the proposed method is high
computational complexity in KNN density estimation. That
is, since a k-nearest neighbor list need to be constructed for
each gene, the time-complexity of our approach is O(n2 )
where n is the number of genes. However, the complexity
of the algorithm can be reduced to O(nlogn) by utilizing a
dimensionality reduction scheme. We explore the details of
these optimizations in Section 4.
1.3
Organization
The remainder of this paper is structured as follows. We
present background of this paper in Section 2. In Section 3,
we explain the proposed clustering algorithm. Section 4
explores the dimensionality reduction step for an ecient
neighborhood search. In Section 5, we briey review the
related work, and highlight the strengths and weaknesses
of the previous approach in comparison with ours. Finally,
we conclude the paper and provide our future plans in Section 6. Table 1 illustrates the notations that will be used
throughout this paper.
2. BACKGROUND
Throughout this paper, we explain our methodology based
on Spellman et al.'s yeast cell cycle dataset (a.k.a. Spellman's dataset) [22]. Using cDNA arrays, Spellman et al.
measured the genome-wide mRNA levels for 6,108 yeast
ORFs simultaneously over approximately two cell cycle pe-
Sampling time interval 0-7 14-21 28-35 42 49-56 63-70 77-84 91-98 105 112-119
(Minute)
Cell cycle
M/G1 G1
S G2 M M/G1 G1
S
G2
M
Table 2: Illustration of cell cycle in Spellman's dataset
Among 6,108 genes, we removed the genes with missing values, and obtained 4,418 genes. Thus, the dataset is organized as an 18 4,418 matrix with equally spaced sampling
time points. In this paper, rather than trying to identify cellcycle regulated gene clusters (by relying on external knowledge), we perform unsupervised clustering on 4,418 genes.
Thus, non cell-cycle regulated gene clusters as well as cellcycle regulated gene clusters are expected to be discovered.
3. PROPOSED ALGORITHM
In Section 3.1, similarity metrics for density estimation are
described. Section 3.2 introduces how to dene density for
each gene. Section 3.3 explains how to specify rough cluster
structure, and Section 3.4 illustrates how to rene cluster
structure. In Section 3.5, we discuss about input parameters. Finally, in Section 3.6, we present experimental results.
3.1
Similarity Metric
The rst step in KNN density estimation is to decide the distance metric (or similarity metric). One of the most commonly used metrics to measure the distance between two
data items is Euclidean distance. The distance between xi
and xj in m-dimensional space is dened as follows:
v
u
uXm
d(xi ; xj ) = Euclidean(xi ; xj ) = t (xid , xjd )2
d=1
(1)
Since Euclidean distance emphasizes individual magnititudes
of each feature, it does not account for shifting or scaling
patterns very well. In gene expression datasets, the overall
shapes of gene expression patterns is more important than
magnititude. To address the shifting and scaling problem,
each gene can be standardized as follows:
4
2
3
1.5
2
1
Expression level
Among 6,108 genes, Spellman et al. identied 800 genes
whose expression is cell-cycle regulated. To nd a threshold
value that determines the signicance of the cell-cycle regulation, they utilized previously known gene sets and published dataset. Table 2 illustrates the cell cycle, which is
dened by Spellman et al. After 800 yeast genes were identied as cell-cycle regulated, clustering was performed to
classify the genes into dierent clusters (according to similarity of expression).
2.5
Expression level
riods in a yeast culture synchronized by factor relative
to a reference mRNA from an asynchronous yeast culture.
The yeast cells were sampled at 7 minute intervals for 119
minutes with a total of 18 time points after synchronization.
0.5
0
1
0
−0.5
−1
−1
−2
−1.5
−2
0
2
4
6
8
10
12
14
16
18
Time points
(a) High density gene
−3
0
2
4
6
8
10
12
14
16
18
Time points
(b) Low density gene
Figure 1: Plot of top k-nearest neighbors for a high
density gene and a low density gene (when k=30)
Another widely used metric for time-series similarity is Pearson's correlation coecient. Given two genes xi and xj ,
Pearson's correlation coecient r(xi ; xj ) is dened as follows:
Pm
(3)
r(xi ; xj ) = pPm d=1 (xid ,2pi )(Pxmjd , j )
d=1 (xid , i )
d=1 (xjd , j )2
Note that r(xi ; xj ) has a value between 1 (perfect positive
linear correlation) and -1 (perfect negative linear correlation). In addition, value 0 indicates no linear correlation.
If the data is standardized by subtracting o the mean and
dividing by the standard deviation, then we can show that
Euclidean distance is related with Pearson correlation coefcient as follows:
2 r(xi ; xj ) = 1 , d (x2im; xj )
(4)
Based on the above relation, eectiveness of a clustering algorithm is expected to be similar regardless of the similarity
metrics. Thus, throughout this paper, we will explain our
methodology by using Pearson correlation coecient. In addition, similarity and correlation are used interchangeably.
3.2
Density Estimation
One of the important steps in density-based clustering is
how to estimate density for each point. In DBSCAN [10],
the density of an object is dened by the number of the
objects in a region of specied radius around the point. This
approach is similar to a histogram-based method.
(2)
Another approach is the kernel-based method, which denes
weight to each point [7, 18]. That is, the points at the
edge of the search area are less inuenced to the density
estimator than the other points. A Gaussian kernel function
is normally used for this purpose.
where xi is the standardized vector of xi , i is the mean of
xi , and i is the standard deviation of xi , respectively.
In this paper, we are mainly focused on KNN density estimation. In the histogram-based approach or the kernel-based
xid = xid, i
i
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−4
−3
2
4
6
8
10
12
14
16
18
−4
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
2
4
6
8
10
12
14
16
18
−4
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−4
2
4
6
8
10
12
14
16
18
A rough cluster structure then can be derived by performing
clustering on core points. This step can be devised based on
the following two observations.
Since border and noise points are excluded in the rough
2
4
6
8
10
12
14
16
18
−3
2
(a) Core clusters
4
6
8
10
12
14
16
18
−4
(b) Coherent patterns
2
4
6
8
10
12
14
16
18
Figure 2: Sample examples of core clusters and the
corresponding coherent expression patterns
approach, the volume around point x is xed. On the other
hand, KNN density estimation xes the number of points k
in advance. Thus, the size of the volume around a point x
is adjusted to include k-nearest neighbors of x. Based on
this, probabilistic density function for x can be dened as
follows:
p(x) = k=n
V
(5)
where n is the total number of points, and V is the size of
volume that includes k nearest neighbor points of x. Hence,
in high density regions, the size of volume is expected to
be small while the size of volume is expected to be large
in low density regions. Another approach to dening KNN
density is to utilize the sum of distances between k-nearest
neighbors to x. In this paper, we use the second notion of
KNN density.
Figure 1 illustrates the intuition behind this approach. As
shown, with a high density gene, the sum of distance between nearest neighbors to x (or the size of volume) is relatively smaller than a low density gene.
3.3
greater than a user-dened threshold (). Similarly, a noise
point is referred to as a point whose KNN density is less than
a user-dened threshold ( ). Noise points are discarded in
the clustering process since we are mainly concerned with
highly co-expressed patterns. A non-core, non-noise point
is considered as a border point. We will discuss how to
determine appropriate values for and in Section 3.5.
Rough Cluster Identification based on Core
Points
In density-based clustering, clusters are dened as dense regions (i.e., a set of core points), and each dense region is
separated from one another by low density regions (i.e., a
set of border points). Thus, once density for each point is
estimated, the next step is to identify core, border and noise
points.
A core point is referred to as a point whose KNN density is
cluster identication step, each cluster is expected to
be well separated each other. That is, if two core points
belong to dierent clusters, then they are expected to
be far from each other. Moreover, if two points belong
to a same cluster, then the two points is expected to
be proximate each other since densities of the points
are high.
Due to transitivity, although xi and xj are similar,
and xj and xk are similar, xi and xk can be dissimilar
since similarity relation does not satisfy transitivity.
This can be partially addressed by adjusting xk using
k-nearest neighbors of xk . That is, the representative
for k-nearest neighbors of xk can be used for similarity
computation.
Based on the above observations, the algorithm for rough
cluster identication is outlined as follows:
Input: A set of core points (P )
Output: A set of core point clusters (CP )
1. Initially, the most high density gene x0 forms a singleton cluster C0 , and x0 is removed from P .
2. The next high density gene xi in P is chosen and xi is
removed from P . The similarity between xi and pregenerated clusters are computed by considering the
similarity between xi and the representative (e.g., center) of a cluster.
3. The cluster (Ci ), which has the maximum proximity
with xi , is identied.
4. If the similarity between xi and Ci exceeds a predened threshold (), then xi is assigned to Ci .
5. If not, xi is adjusted by k-nearest neighbors of xi , and
the similarity is recomputed between adjusted xi and
Ci . If the similarity between adjusted xi and Ci exceeds , then xi is assigned to the Ci . Otherwise, xi
forms a new singleton cluster (Cj ). Cj is added to CP .
6. Repeat 2-5 until P becomes empty.
Figure 2 plots sample clusters. Figure 2(a) shows sample
core clusters, and Figure 2(b) illustrates the corresponding
coherent patterns that characterize a trend of expression levels of genes within a cluster. A coherent pattern of a cluster
is dened by a medoid of the cluster. As illustrated, the rst
to the cluster. Otherwise, xi is identied as a transition
point, and not assigned to any cluster.
42
40
38
Toward this end, the set of candidate clusters (Cxi ) is identied by selecting the cluster that contains any gene belonging
to Nk (xi ). Subsequently, the cluster, which can host xi , is
identied by using one of the following two methods.
Expression level
36
34
32
30
28
26
24
22
0
100
200
300
400
500
600
700
800
900
1000
Genes
Figure 3: Plot of top 1000 lowest density genes
(when k=60)
55
54
Expression level
53
52
51
50
49
48
0
100
200
300
400
500
600
Genes
Figure 4: Plot of top 600 highest density genes
(when k=60)
top two plots show clear cell-cycle regulated patterns. That
is, the rst cluster contains a set of genes whose expression
values have a peak in G1 phase. Similarly, the second cluster
contains a set of genes whose expression values have a peak
in G2 phase. These clusters were also identied as cell-cycle
regulated patterns by Spellman et al.
On the other hand, as illustrated in the third cluster, non
cell-cycle regulated clusters were also identied. This can be
explained in dierent perspectives (e.g., external eects or
unrevealed reasons). For example, if DNA is damaged, then
it is necessary to block the cell cycle to repair DNA. Thus,
forming a cluster of non cell-cycle regulated genes (with similar expression patterns), and providing an interpretation on
this cluster (based on external knowledge) is an interesting
task.
3.4 Cluster Refinement based on Border Points
Once a rough cluster structure is obtained, the next step is
to identify relevant clusters that can host each border point.
The proposed clustering algorithm exploits a characteristic
of a neighborhood. That is, a label of an object is inuenced by the attributes of its neighbors. Examples of such
attributes are the labels of the neighbors, or the percentage
of neighbors that fulll a certain constraint. The above idea
can be translated into clustering perspective as follows: a
cluster label of an object depends on the cluster labels of its
neighbors.
To assign a gene (xi ) to the existing cluster, the cluster,
which can host xi , needs to be identied using the neighborhood of xi . If there exists such a cluster, then xi is assigned
1. M1: Considering the size of an overlapped region. Select the cluster that has the biggest number of its members in Nkc(xi ) where Nkc (xi ) is dened as follows:
NkC (xi ) = Nk (xi ) \ P
(6)
This approach only considers the number of genes in
the overlapped region, and ignores the proximity between neighbors and xi .
2. M2: Exploiting weighted voting. The similarities between each neighbor of xi and the candidate clusters
are measured. Then, the similarity values are aggregated using weighted voting. Thus, each neighbor can
vote for its class with a weight proportional to its proximity to xi .
Let wij be a weight for representing the proximity of
a neighbor to xi . Then, the most relevant cluster (Cl )
is selected based on the following formula:
X wij r(xj ; Ck ) (7)
Cl = argmaxCk 2Cxi
xj 2NkC (xi )
3.5
Parameterization
One of the key weaknesses in density-based clustering is how
to determine user-dened parameters. Many of the previously proposed density-based algorithms are known as sensitive to the input parameters. For instance, in DBSCAN [10],
SNN [8], or DHC [17], MinPts (the minimum number of
points within a neighborhood) should be determined in order to decide whether a given point is a core or not.
In our approach, there are three important parameters that
should be determined beforehand, (the user-dened threshold to cut o between core points and border points), (the
user-dened threshold to cut o between border points and
noise points), and k (length of a nearest neighbor list). In
this section, we discuss how to decide the value of these
parameters.
The neighborhood list size (k) may determine the granularity of the clusters. If k is too small, then the algorithm will
tend to identify a large number of small-size clusters. In
contrast, if k is too large, then a few large-size clusters are
formed. In what follows, we explain how to decide the value
of and when k is xed.
Sharp change in the slope of density values can be used to
identify the number of core and noise points. To this end,
we plot the density of the 1000 genes that have the lowest
density values (Figure 3), and the density of the 600 genes
that have the highest density values (Figure 4). As shown
in Figure 3, the density decreases sharply with 850-1000.
Thus, we can determine the number of noise points based
on Figure 3. However, the slope in Figure 4 decreases rather
C1
C2 C3 C4 C5
C6 C7
C1 1.00 -0.44 -0.70 0.36 0.16 0.67 -0.03
C2
1.00 0.32 0.02 0.26 -0.05 -0.08
C3
1.00 0.27 -0.27 -0.60 -0.41
C4
1.00 0.01 0.40 -0.67
C5
1.00 0.16 -0.07
C6
1.00 -0.08
C7
1.00
Table 3: Illustration of between-cluster similarity
before renement
C1
C2
C3
C4
C5
C6
C7
C1 1.00 -0.38 -0.63 0.34 0.14 0.57 -0.02
C2
1.00 0.28 0.12 0.25 0.12 -0.18
C3
1.00 0.36 -0.26 -0.56 -0.37
C4
1.00 -0.02 0.31 -0.63
C5
1.00 0.16 -0.09
C6
1.00 -0.05
C7
1.00
Table 4: Illustration of between-cluster similarity
after renement
smoothly. Consequently, it is dicult to identify the number
of core points based on only this plot.
To address the problem, we rst estimate the following probability.
(8)
Probc (xi ) = jNjkN(x(i )x\)jP j
k i
Assuming that the core point set (P ) is xed, Equation 8
measures the probability of how many k-nearest neighbors
of xi are core points. With a point that locates well inside a
cluster, Probc (xi ) is expected to be high. In contrast, with
other points (e.g., border point), Probc (xi ) is expected to
be relatively low. Thus, Probc (xi) measures actual coreness
for each point based on xi 's neighborhood. This probability
can be used as a guideline when we determine the value of
. For example, to avoid producing singleton clusters in the
rough cluster identication step (this is because a core point
cannot form a singleton cluster in a high-density region), we
choose a moderate value of such that Probc (xi ) 6= 0 for
all xi 2 P .
k = 20
k = 30
k = 40
k = 50
k = 60
k = 70
K -means
0.4027
-
M1
0.6176
0.5859
M2
0.6191
0.5862
0.5824
0.5682
0.5997
0.5828
0.5692
0.6007
0.6318 0.6328
Table 5: Evaluation of rened cluster structure
based on 1
k = 20
k = 30
k = 40
k = 50
k = 60
k = 70
K -means
0.2213
-
M1
0.2323
0.2544
M2
0.2320
0.2542
0.2708
0.2673
0.2678
0.2702
0.2668
0.2674
0.2307 0.2307
Table 6: Evaluation of rened cluster structure
based on 2
To compensate for this 1 's characteristic, we also use the
following criteria for measuring average between-cluster similarity for a cluster structure.
XK XK jR(Ci; Cj )j
2 = K12
i=1 j=1
(10)
Note that R(Ci ; Cj ) is dened as similarity between the centroid vectors of Ci and Cj . In addition, since identifying
anti-correlated genes is not our goal, the absolute value of
R is taken.
In sum, we favor a clustering solution with the largest value
of 1 , and the smallest value of 2 (i.e., maximize withincluster similarity and minimize between-cluster similarity).
Table 3 and Table 4 illustrate sample result on betweencluster similarity before and after renement, respectively
(due to the space limitation, we only show 7 clusters). As
shown in Table 3, since only core points are considered, similarity between two dierent clusters are low.
(9)
As illustrated in Table 4, since centroids for two clusters
are used to compute similarity, although border points are
added to the cluster, the values of similarity between clusters
do not necessarily increase. That is, if added border points
to Ci are located at the opposite direction to Cj , then those
border points are used to move the centroid of Ci far from
Cj . In contrast, if added border points to Ci are located at
Cj 's direction, then added border points can be used to move
the centroid of Ci to Cj , thus similarity value is increased.
In any case, we observed that the between-cluster similarity
is not signicantly changed before/after the renement step.
This supports the eectiveness of our renement step.
In non-uniformly distributed datasets, 1 favors a large
number of small-size clusters.
We also evaluated our algorithm in terms of 1 and 2 . K means clustering [7] was used as a baseline comparison. In
K -means, since the number of clusters should be determined
3.6
Experimental Results
For the empirical evaluation of the proposed clustering algorithm, we rst describe the evaluation criteria. In order
to measure within-cluster similarity for a cluster structure,
the sum of average pairwise similarities between genes (that
are assigned to a same cluster) is computed as follows:
K
X
1 = K1 ( jC1 j2
r
r=1
X
xi ;xj 2Cr
r(xi ; xj ) )
For each gene, since all pairwise distances need to be computed to the k nearest neighbors, the worst time complexity
of our clustering algorithm is O(n2 ) where n is the number
of genes. For low dimensional datasets, the time complexity of our method can be reduced to O(nlogn) if we utilize
spatial data structures [2, 19, 4].
2.8
Method1
Method2
K−means
2.7
2.6
Criteria value
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
20
25
30
35
40
45
50
55
60
65
70
Length of k−nearest neighbor list
Figure 5: Comparison of = 12
beforehand, we tried dierent values of K , and chose the
smallest K with the condition that increasing K did not
much decrease the average distance of points to their cluster
centroids. To be fair, we performed the K -means clustering
multiple times, and chose the best result.
We xed k = 40, and obtained the value of based on the
method discussed in Section 3.5. After then, we performed
clustering while changing the value of k (but is xed).
Table 5 compares the result based on 1 . As discussed in
Section 3.4, M1 is a method for considering the size of the
overlapped region and M2 is a method that utilizes weighted
voting in the renement step, respectively. As shown, M1
and M2 outperform K -means clustering. This is due to the
fact that K -means clustering is sensitive to noise (i.e., a
small amount of noise can signicantly inuence the centroid value of clusters). Moreover, K -means clustering can
only identify a spherical shape of cluster while the proposed
method can detect dierent shapes of clusters. In addition,
as the value of k is increased, 1 tends to be decreased since
1 favors small-size clusters. However, we observed that the
value of 1 is increased at k = 40. This supports our argument that the relationship between and k is correctly
determined.
Table 6 compares the result based on 2 . Since small local
variation in similarity is ignored with large k, 2 for M1 and
M2 tend to be increased as k is increased (and a number of
clusters is decreased). However, at k = 40, we observed that
2 drops down. This also supports the eectiveness of our
threshold strategy.
Figure 5 illustrates overall performance of algorithm. The xaxis represents the value of k, and y-axis represents = 12 .
Note that large value of implies a better clustering solution. As depicted, the graph has a peak at k=40. Therefore,
based on the above emperical observations, determining (when k is xed) using the methods discussed in Section 3.5
is shown to be eective. This is a signicant improvement
from the previous density-based clustering approaches since
previous work is known as sensitive to input parameters
(e.g., MinPts).
4. EFFICIENT DENSITY ESTIMATION
However, as discussed in Weber et al. [28], even with a
moderate dimensionality (e.g., 10), since the sequential scan
outperforms the best known indexing structures such as Xtrees [4] or SR-Trees [19], the number of dimensionality in
gene expression datasets is still too high. Therefore, dimensionality reduction on gene expression datasets needs to be
performed.
Section 4.1 presents a dimensionality reduction algorithm
based on Singular Value Decomposition. In Section 4.2, we
discuss why dimensionality reduction using Singular Value
Decomposition is eective in gene expression datasets.
4.1
Singular Value Decomposition Approach
SVD (Singular Value Decomposition) has been widely used
in time-series databases [20] and information retrieval [3].
The basic intuition behind SVD is to examine the entire
dataset and rotate the original axis to maximize variance
along the rst few dimensions. Thus, the dimensionality
reduction eect can be achieved by keeping the rst few
dimensions while losing the least information. The following
theorem provides the mathematical background of SVD [12].
Theorem 1. Given m n matrix X , we can decompose
X as follows.
X =U Vt
(11)
where U is a column orthonormal m r matrix (left singular
vector), r is the rank of X , is a diagonal r r matrix
with eigenvalues of X , and V is a column-orthonormal r n matrix (right singular vector).
Proof. Refer to [12].
Without loss of generality, we can assume that components
of (the eigenvalues i of X ) are arranged in decreasing
order. The beauty of SVD lies in the fact that the number
of dimensions can be reduced by discarding the insignicant
dimensions (i.e., less singular values). Hence, Xk can be
obtained by keeping the rst k singular values and discarding
r , k singular values and the corresponding left and right
singular vectors of A. The reduced ones are denoted as Xk ,
Uk , k , and Vk , respectively.
Theorem 2. The matrix C = X X t is a symmetric
matrix, which can be decomposed as follows:
C = U 2 U t
(12)
Left and right singular vectors of C correspond to left singular vectors of X (i.e., U ), respectively. In addition, eigenvalues of C corresponds to the squares of eigenvalues of X .
Proof. Refer to [12].
100
0.3
0.8
original
7 coeff
original
7 coeff
SVD
DWT w/Haar
90
0.6
0.2
0.4
0.1
80
Percent Error
70
Expression level
0.2
0
−0.2
60
−0.4
0
−0.1
−0.2
50
−0.6
40
−0.3
−0.8
−0.4
−1
30
−0.5
−1.2
20
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
16
Time points
16
Figure 7: Illustration of reconstructed SVD
10
0
0
0
2
4
6
8
10
12
14
16
Number of coefficients retained
Figure 6: Comparison of reconstruction error for
DWT and SVD
Given the fact that the number of genes (n can be more than
10,000) is much larger than the number of conditions or time
points (m is usually less than 100), instead of computing
SVD for m n matrix directly, based on Theorem 2, U and
are rst obtained by computing SVD of X X t . After
then, based on Theorem 1, V can be constructed as follows:
V = Xt U (13)
Since matrix multiplication between l m matrix and m n
matrix takes O(lmn), computing X X t and constructing V take O(m2 n) and O(nmr), respectively. Thus, the
complexity of SVD computation for X can be reduced to
O(m32 n + r3 + nmr). This is signicant improvement from
O(n ) since r; m n.
Once a matrix is decomposed, each gene (x) can be projected
onto a point in k-dimensional space (^x) as follows:
x^ = xt U k ,k 1
(14)
Thus, we can build multi-dimensional index structure using
x^.
4.2
Discussion
Although SVD has been utilized in gene expression clustering research, the main purpose of the previous approaches
was to preprocess the data before clustering [6, 9, 15]. In
contrast, our main aim here is to eciently support a similarity search in the truncated SVD space.
One of the most widely used techniques for dimensionality
reduction in time-series datasets is DWT (Discrete Wavelet
Transforms) [5]. By transforming time-series data into timefrequency domain, the rst few DWT coecients are indexed through multi-dimensional index structure. The basic
motivation behind this approach is that DWT can preserve
essentials of the data in the rst few coecients.
However, DWT is not eective in dimensionality reduction
of gene expression datasets. To observe this, we randomly
original
7 coeff
0.2
0.1
Expression level
Note that the time complexity for naive SVD computation
is O(nm2 + mn2 + n3 ) = O(n3 ) (since m n). However,
we can reduce computational complexity when dealing with
gene expression datasets.
0.3
0
−0.1
−0.2
−0.3
−0.4
−0.5
0
2
4
6
8
10
12
14
16
Time points
Figure 8: Illustration of reconstructed DWT when
the largest 7 coecients (in absolute value) are retained
selected 3,000 genes (that had no missing values) from the
Spellman's dataset. Since the length of time-series needs to
be the integral power of 2 in Haar wavelet, for simplicity,
instead of padding the overall time-series with zeros, we remove the last two time points for each time-series. Thus,
the size of gene expression prole (X ) becomes 16 3,000.
Figure 6 compares the average reconstruction error of X .
The x-axis represents the number of coecients retained,
and the y-axis represents the relative error, which is computed as jjXjj,XXjj jj 100 where X is the original time-series
data and X 0 is the reconstructed data from the compressed
one. Although no false dismissal is guaranteed (since DWT
and SVD are orthonormal transforms), if reconstruction error is large, then the number of false hits is increased. Thus,
reconstruction error needs to be minimized in order to reduce the cost of post-processing. As shown in Figure 6, SVD
outperforms DWT when the number of dimensions retained
corresponds to 1 through 11. Figure 7 also illustrates how
well SVD can retain basic shapes of time-series (up/down
peak).
0
Assuming that the transformation is orthonormal, in order
to minimize the reconstruction error, keeping the largest
i coecients (in terms of absolute value) is better than
keeping the rst few coecients. Figure 8 illustrates this.
However, since the index of retained coecients need to be
stored, keeping the largest coecients needs additional indexing structures other than the R-tree family. In addition,
the distance computation between two coecients sets becomes expensive since the coecients sets do not align with
each other.
Furthermore, a conventional DWT-based approach lacks the
capability of dealing with unevenly-spaced sampling time
points. For instance, using cDNA microarrays, Iyer et al. [16]
reported the physiological response of broblasts to serum
at 12 time points for 8,613 genes over 24 hour. The sampling
times are at 0, 0.15, 0.3, 1, 2, 4, 6, 8, 12, 16, 20, 24 hours
after serum stimulation. In this situation, unless we rely on
interpolation or lifting [24], DWT is not directly applicable
to unevenly spaced sampling time points datasets.
5. RELATED WORK
In this section, we briey review the previous gene expression clustering approach. Note that this section should not
be considered as a comprehensive survey of all published
gene expression clustering algorithms. It only aims to provide a concise overview of algorithms that are directly related with our approach. Jiang et al. [18] provides the comprehensive review on gene expression clustering. For details,
refer to that paper [18].
Partition-based clustering decomposes a collection of genes,
which is optimal with respect to some pre-dened function
such as center-based approach [11, 26]. Center-based algorithms nd the clusters by partitioning the entire dataset
into a pre-determined number of clusters [7, 11, 26]. Although the center-based clustering algorithms have been
widely used in gene expression clustering, there exist the
following drawbacks. First, the algorithm is sensitive to an
initial seed selection. Depending on the initial points, it
is susceptible to a local optimum. Second, as discussed in
Section 3.6, it is sensitive to noises. Third, the number of
clusters should be determined beforehand.
Hierarchical (agglomerative) clustering (HAC) nds the clusters by initially assigning each gene to its own cluster and
then repeatedly merging pairs of clusters until a certain stopping condition is met [7, 9, 22]. Thus, its result is in the
form of a tree, which is referred to as a dendrogram. Note
that the advantage of HAC lies in its ability to provide a
view of data at multiple levels of abstraction. However, a
user should determine where to cut the dendrogram to produce actual clusters. This step is usually done by human
visual inspection, which is a time-consuming and subjective
process. Moreover, the computational complexity of HAC
is expensive. HAC takes O(n3 ) if pairwise similarities between clusters are changed when two clusters are merged.
However, the complexity can be reduced to O(n2 logn) if a
priority queue is utilized.
The graph-based approach [29] utilizes graph algorithms
(e.g., minimum spanning tree or minimum cut) to partition the graph into connected subgraphs. However, due to
the points in the transition region, this approach may end
up with a highly connected set of genes.
To better explain time-course gene expression datasets, new
models are proposed to capture the relationships between
time-points [1]. However, they assume that the data ts
into a certain distribution, which does not hold in gene expression datasets as discussed in Yeung et al. [32].
As we have discussed, our work is motivated by previous
density-based clustering approaches such as DBSCAN [10]
or SNN [8]. However, since these approaches utilizes a notion of connectivity to build a cluster, it might not be relevant for the dataset with moderately dense transition regions. In addition, both approaches need user-dened parameters (e.g., MinPts), which are dicult to be determined in advance. Recently, density-based clustering algorithms have been applied to gene expression datasets [17,
18]. Both approaches are promising in that a meaningful hierarchical cluster structure (rather than a dendrogram) can
be built. However, these approaches have drawbacks in that
MinPts is determined beforehand [17], or computationally
expensive due to kernel density estimation [18].
6. CONCLUSION AND FUTURE WORK
We presented the mining framework that is vital to microarray data analysis. An experimental prototype system has
been developed, implemented, and tested to demonstrate
the eectiveness of the proposed model. In order to identify
co-expressed genes in a yeast cell cycle dataset, we developed
the clustering algorithm based on KNN density estimation.
For an ecient k-nearest neighbor search, we also explored
dierent dimensionality reduction methods that are relevant
for gene expression data.
We intend to extend this work into the following three directions. First, besides evaluating our approach with respect to K -means clustering, we plan to use other gene expression clustering methods for the comprehensive comparison. Second, as discussed by Shatkay et al. [23], two genes
with strong anti-correlation in their expression levels may
be functionally related each other. That is, a gene may be
strongly suppressed to allow another gene to be expressed.
Since these anti-correlated genes may be involved in the
same biological pathway, rather than separating those genes
into dierent clusters, it is essential to detect anti-correlated
gene clusters, and investigate functional similarity between
those clusters. Finally, in order to interpret obtained gene
clusters, external knowledge needs to be involved. Toward
this end, we plan to explore the relationship between clusters and known biology knowledge by utilizing gene ontology
(e.g., GeneOntology or MIPS) or published biomedical literature (e.g., PubMed).
7. ACKNOWLEDGMENTS
This research has been funded in part by the Integrated
Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC9529152.
8. REFERENCES
[1] Z. Bar-Joseph et al. A new approach to analyzing gene
expression time series data. In Proceedings of Annual
Conference on Research in Computational Molecular
Biology, 2002.
[2] N. Beckmann, H. P. Kriegel, R. Schneider, and B.
Seeger. The R*-tree: an ecient and robust access
method for points and rectangles. ACM SIGMOD
Record, 19(2):322-331, 1990.
[3] M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using
linear algebra for intelligent information retrieval.
SIAM Review, 37(4):573-595, 1995.
[4] S. Berchtold, D. A. Keim, and H. P. Kreigel. The
X-tree: an index structure for high dimensional data.
In Proceedings of the 22nd International Conference on
Very Large Data Bases, 1996.
[5] K. Chan, and A. W. Fu. Ecient time series matching
by wavelets. In Proceedings of IEEE International
Conference on Data Engineering, 1999.
[6] C. H. Q. Ding, X. He, H. Zha, and H. D. Simon.
Adaptive dimension reduction for clustering high
dimensional data. In Proceedings of the 2002 IEEE
International Conference on Data Mining, 2002.
[7] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classication (2nd Ed.). Wiley, New York, 2001.
[8] L. Ertoz, M. Steinbach, and V. Kumar. Finding
clusters of dierent sizes, shapes, and densities in noisy,
high dimensional data. In Proceedings of the SIAM
International Conference on Data Mining, 2003.
[9] M. B. Eisen et al. Cluster analysis and display of
genome-wide expression patterns. In Proceedings of
National Academy of Science, 95(25):14863-14868,
1998.
[10] M. Ester, H. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in large
spatial databases with noise. In Proceedings of the 2nd
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 1996.
[11] A. Gasch, and M. Eisen. Exploring the conditional
coregulation of yeast gene expression through fuzzy
k-means clustering. Genome Biology, 3(11):1-22, 2002.
[12] G. H. Golub et al. Matrix computations. North Oxford
Academic, Oxford, UK, 1996.
[13] T.R. Golub et al. Molecular classication of cancer:
class discovery and class prediction by gene expression
monitoring. Science, 286(15):531-537, 1999.
[14] J. Herrero et al. A hierarchical unsupervised growing
neural network for clustering gene expression patterns.
Bioinformatics, 17(2):126-136, 2001.
[15] D. Horn, and I. Axel. Novel clustering algorithm for
microarray expression data in a truncated SVD space.
Bioinformatics. 19(9):1110-1115, 2003.
[16] V. R. Iyer et al. The transcriptional program in the
response of human broblasts to serum. Science,
283(5398):83-87, 1999.
[17] D. Jiang, J. Pei, and A. Zhang. DHC: a density-based
hierarchical clustering method for time series gene
expression data. In Proceedings of the 3rd IEEE
International Symposium on BioInformatics and
BioEngineering, 2003.
[18] D. Jiang, J. Pei, and A. Zhang. Towards interactive
exploration of gene expression patterns. ACM SIGKDD
Explorations, 6(1):79-90, 2004.
[19] N. Katayama, and S. Satoh. The SR-tree: an index
structure for high-dimensional nearest neighbor queries.
In Proceedings of ACM SIGMOD International
Conference on Management of Data, 1997.
[20] F. Korn, H. V. Jagadish, and C. Faloutsos. Eciently
supporting ad hoc queries in large datasets of time
sequences. In Proceedings of the ACM SIGMOD
International Conference on Management of Data,
1997.
[21] S. Morishita, T. Hishiki, and K. Okubo. Towards
mining gene expression database. In Proceedings of
ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery, 1999.
[22] P. T. Spellman et al. Comprehensive identication of
cell cycle-regulated genes of the yeast saccharomyces
cerevisiae by microarray hybridization. Molecular
Biology of the Cell, 9(12):3273-3297, 1998.
[23] H. Shatkay, S. Edwards, and M. Boguski. Information
retrieval meets gene analysis. IEEE Intelligent Systems,
17(2):45-53, 2002.
[24] W. Sweldens, and P. Schroder. Building your own
wavelets at home. Wavelets in Computer Graphics,
ACM SIGGRAPH Course Notes, 1996.
[25] P. Tamayo et al. Interpreting patterns of gene
expression with self organizing maps. In Proceedings of
National Academy of Science, 96(6):2907-2912, 1999.
[26] S. Tavazoie et al. Systematic determination of genetic
network architecture. Nature Genetics, 22(3):281-285,
1999.
[27] H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering
by pattern similarity in large data sets. In Proceedings
of ACM SIGMOD International Conference on
Management of Data, 2002.
[28] R. Weber, H. J. Schek, and S. Blott. Quantitative
analysis and performance study for similarity-search
methods in high-dimensional spaces. In Proceedings of
the 24th International Conference on Very Large Data
Bases, 1998.
[29] Y. Xu, V. Olman, and D. Xu. Clustering gene
expression data using a graph-theoretic approach: an
application of minimum spanning trees. Bioinformatics,
18(4):536-545, 2002.
[30] J. Yang, H. Wang, W. Wang, and P. S. Yu. Enhanced
biclustering on expression data. In Proceedings of IEEE
International Symposium on BioInformatics and
BioEngineering, 2003.
[31] K. Y. Yeung, and W. Ruzzo. An empirical study on
principal component analysis for clustering gene
expression data. Bioinformatics, 17(9):763-774, 2001.
[32] K.Y. Yeung et al. Model-based clustering and data
transformations for gene expression data.
Bioinformatics, 17(10):977-987, 2001.