Download Mining Gene Expression Data Using PCA Based Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

NEDD9 wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Vol. 5, No. 1, January-June 2012, pp. 13-18,
Published by Serials Publications, ISSN: 0973-7413
Mining Gene Expression Data Using PCA Based Clustering
N.P. Gopalan1 and B. Sathiyabhama2*
1
2
Department of Computer Applications, National Institute of Technology, Tiruchirappalli, 627015,
Tamilnadu, India, E-mail: [email protected]
Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, 627015,
Tamilnadu, India, E-mail: [email protected]
ABSTRACT: As the amount of laboratory data in molecular biology and bioinformatics grows exponentially in each year due
to advanced technologies such as DNA Microarray, new efficient and effective clustering methods must be developed to process
this fast growing amount of biological data. Numerous clustering techniques have been applied in the analysis of gene expression
data to extract biologically significant patterns. But there are issues like clustering quality, high dimensionality of input data and
computational efficiency need to be addressed. A novel hybrid clustering algorithm is proposed, which is a blend of Principal
Component Analysis (PCA) and the enhanced correlation based clustering. PCA is a classical statistic technique for finding
patterns in data of high dimension. The empirical results show that this approach provides more stable clustering performance
in terms of quality and efficiency. The resulting clusters offer potential insight into gene function, molecular biological processes
and regulatory mechanisms.
Keywords: Clustering analysis; Bioinformatics; Gene expression data; Principal Component Analysis;
1. INTRODUCTION
DNA Microarray technology has now made it possible to
simultaneously monitor the expression levels of thousands
of genes during important biological processes across
collections of related samples. It has enormous promises in
areas such as revealing function of genes in various cell
populations, tumor classification, drug target identification,
understanding cellular pathways, and prediction of outcome
to therapy [1], [2]. A major application of microarray
technology is gene expression profiling to predict outcome
in multiple tumor types [3].
Data mining methods can be applied to various gene
expression data sets including cancer data sets in order to
identify distinct genes to classify tumors. Cluster analysis
is one of the data mining technique, seeks to partition a given
data set into groups based on specified features so that the
data points within a group are more similar to each other
than the points in different groups [2]. Clustering techniques
are useful in identifying (yet unknown) subclasses of tumors,
or identifying clusters of genes that are co-regulated or share
the same function [4]. These methods have been successful
in separating certain types of genes associated with different
types of leukemia and lymphoma [3]. The groupings of
biologically relevant clusters containing genes are having
similar expression patterns called co-expression genes.
Clustering technique has become an efficient and
mandatory tool for in-silico analysis of gene expression data
*
Corresponding Author: [email protected]
[5], [6], [7], [8], [9]. A variant of hierarchical clustering
algorithm is used by Eisen et al. [7] to identify groups of
co-expressed yeast genes. Two-way clustering technique [5]
is used to detect clusters of correlated genes and tissues. To
identify clusters in the yeast cell cycle data set and human
hematopoietic differentiation data set Self-Organizing Maps
(SOM) [9] is used. Biologically meaningful clusters of yeast
chodata have been determined by using genetic enhanced
K-Means clustering method [10]. Variety of clustering
validation measures are used in the literature to evaluate
the validity of clustering results [11], [12]. Numerous
validation indexes are used in practice like Jaccard
coefficient, Simple matching coefficient and Hubert’s
(gamma) statistic (HGS) [13] to evaluate the stability of
parameters and reliability of clustering algorithms and are
ingrained only on the phase of post-validation.
Clustering techniques have the drawbacks of poor
clustering quality and destabilization of clusters
[4],[14],[15],[16]. Vincent Tseng et al. have used correlation
based clustering algorithm for partitioning co-regulated
genes. To improve the quality of clustering, validation
technique is integrated in to the clustering process [13]. In
the initial stage of clustering this algorithm adds highly
negative correlated elements in addition ot positively
correlated elelments. In the later phase, exterminates the
cluster members that were inaccurately added. Hence it
consumes more computational resources. Recently the
authors have developed a variant of sparse matrices to
represent the gene expression similarity matrix [17], [18].
Sparse matrix is the suitable data structure for effective
14
N.P. Gopalan and B. Sathiyabhama
memory utilization. The authors also made improvement
on the validation statistic by substituting the fast heuristic
namely Enhanced HGS (EHGS) from the basic Hubert’s
statistic].
Computational intelligence [18], [19] is generally
accepted to include evolutionary computation and is used to
increase the precision of resolved structure. Genetic algorithm
(GA) has been proven to be a robust and effective search
method requiring very little information about the problem
to explore a large search space. Blend of computational
intelligence and clustering approaches endow with rapid,
automated, feature selection and pattern recognition for a wide
assortment of gene expression profile [19].
Most of the clustering algorithms suffer from high
dimensionality and huge size of the data. To analyze these
fast growing gene expression data sets efficiently and
effectively good clustering algorithm is required, but the
dimensionality and size of data impersonate challenging
problems in both computational and biomedical research,
and the difficult task ahead is transferring gene expression
data in to subject specific knowledge. Various methods have
been developed to reduce the size of the gene expression
data [20], [21], [22]. In the proposed work, clustering
algorithm is appropriately integrated with a dimensionality
reduction technique namely Principal Component Analysis
(PCA) whose goal is to reduce the dimensionality of the
data to facilitate visualization and additional analysis. PCA
is often used as a pre-processing step to the clustering
analysis of large data sets and are widely used in the gene
expression data.
2. RESEARCH METHODOLOGY
The high dimensionality of the gene expression data sets
and the high percentage of irrelevant or redundant genes
make it very difficult either to classify samples or pick out
substantial genes in a context where little domain knowledge
is available. To address this problem, PCA has been applied
to analyze gene expression data. PCA is a classical statistic
technique to reduce the dimensionality of the data by
transforming to a set of variables that summarize the features
of the data without much loss of information [22]. Principal
Components (PC’s) are uncorrelated and ordered. PCA is
closely related to a mathematical technique called Singular
Value Decomposition (SVD) and it is applied to the
algorithm before the clustering process. Hence, only the
relevant data is given to the clustering. SVD takes a gene
expression data matrix namely A of order n X p where n
rows represen t th e genes an d th e p columns (p is
approximately equal to n) represent the experimental
conditions. The SVD theorem is as follows:
AnXp = U nXn
T
SnXp VpXP
UTU = I nXn V T V
I pXp
(1)
(2)
U and V are orthogonal. U is the left singular gene
coefficient vectors and S has the same dimension as A. Now
SVD represents an expansion of the original data in a
coordinate system where the covariance matrix is diagonal.
SVD consists of finding eigen values and eigen vectors of
the following:
AAT and ATA
(3)
Depending on the eigen vectors, the components are
selected. These are forming a feature vector and it is the notion
of data compression. Eigen vector with the highest eigen value
is the PC’s of the data set. If eigen vectors with the largest
eigen value is one that pointed as middle of the data. This is
the most significant relationship between the data dimensions
and the least significant components are ignored.
The clustering algorithm is then forming the similarity
matrix from the PC’s only with the relevant biologically
significant data. Unlike the traditional clustering algorithms
the proposed approach uses the constraint based addition
procedure to add the elements to the clusters. It never
removes any element from the clusters once added and
outliers are filtered out during the initial phase of the
clustering process. Consequently, the stability and quality
of the clustering process is improved. To assess the
predictive power of the clustering algorithm and quality of
clustering results, combination of EHGS [17] and figure of
merit (FOM) is used [11].
A typical gene expression data set contains the
measurements of expression levels of ‘n’ genes measured
under ‘n’ experimental conditions. Apparently, the
expression levels of co-regulated genes will vary similarly
across the ‘n’ conditions. Consequently, clustering the genes
based on similarities among these expression level
measurements should isolate clusters of biologically related
genes. The EHGS is as follows:
n 1
n
( A(i, j )( B (i , j ))
2
M
n 1
n
A(i, j )
A* B
(4)
( B)
A clustering algorithm is said to have good predictive
power if genes in the same cluster tend to have similar
expression levels. In the set of experimental conditions, the
condition that is not used to produce the clusters is used as
leave one out condition and assumed as least significant
constraint [11]. With reference to the left out condition, the
clustering process is evaluated. The above illustration does
not provide any guarantee that the left out condition is the
appropriate one to determine the predictive power of the
clustering algorithm. There is no proof that the left out
condition is not biologically significant because there exists
an equal probability that every condition becomes a left out.
Hence the proposed algorithm uses a variant of FOM, a set
of scalar quantities that determine the predictive power of
clusters. This implies that a set of threshold parameters is
attached for every cluster produced for each pre-defined
biological condition, i.e. Ti, is the threshold parameter for
i 1 j i 1
i 1 j i 1
15
Mining Gene Expression Data Using PCA Based Clustering
the ith cluster (where 1 i k) and k is the number of
clusters. This heuristic helps in reducing the redundant
computations. The idea behind the FOM is that the data from
conditions 0, 1, 2,…., (m-1), are used to estimate the
predictive power of the algorithm. Suppose ‘k’ clusters, C1,
C2, …, Ck are obtained, with cluster sizes s1, s2, …, sk, such
that
si n. Let R(i, j) be the expression level of gene ‘i’
under condition ‘j’ in the similarity matrix. FOM (i, k) be
the FOM for k clusters and using condition ‘i’ as validation
along with a threshold value Ti. Thus FOM is defined as
1
Ti
n( R( X , e))
FOM (i, k )
(5)
Numerous strategies available in the literature to set
up the threshold value [23]. There is no theoretical proof
that the chosen value is appropriate for clustering. In the
proposed approach, a metric is appropriately assigned for
gene expression data clustering. The threshold value Ti, is
an expected average distance (according to the distance
metric) of objects is assigned to the cluster Ci. Mixture of
correlation based enhanced variant and FOM operators are
used to select the appropriate cluster for the gene expression
data and validate them simultaneously. Hence the proposed
algorithm Gene clustering using Correlation Search
Technique (G-CST) is scalable, efficient and resilient in
determining the biologically significant patterns.
pick an element c from C with maximum
neighbours;
remove c from C;
for j: = 1 to n do
a(i) = a(c, i);
Copen = {c};
FOM (i, k) =
1
Ti
n( R( X , e))
// Adding elements to the clusters
repeat
while Max_EHGS and FOM(i,k) do
begin
select c from C with highest a(i)
remove c from C;
SB = SB + |Copen|
SAB = SAB + a(u);
for i = 1 to n do
a(i) = a(i) + A(C, i);
end
Until all elements in A(i, j) are assigned;
3. Return the collection of clusters;
end
3. THE PROPOSED (G-CST) ALGORITHM
Figure 1: Pseudo Code for G-CST Clustering
Input: Gene expression data (n X n matrix).
Output: Biologically significant clusters.
begin
1.
Initialization:
A: input gene expression similarity matrix A of n X n
Perform PCA:
The algorithm consists of an initialisation and an
iterative step. In the initialization step, the algorithm first
computes the PCA by determining the eigen vectors. From
the feature vectors the similarity matrix is constructed. After
that the appropriate population is generated. The iterative
step successively selects elements and allocates to the
appropriate cluster.
AA T and A T A
M =
n 1
SA =
3.1 Clustering and Validation
n( n 1)
2
n
A(i, j )
i 1 j i 1
SB = 0
C =0
G = {1, 2, 3, ..., n}
CGST = 0
Max_EHGS =
2. while (G is not empty) do
begin
Copen = 0;
for i: = 1 to n do
a(i) = 0;
max
;
The input for the algorithm is a raw gene expression data
matrix. This is converted as sparse symmetric similarity
matrix of the gene expression data set. This algorithm is
constructing clusters one at a time. The current cluster is
denoted by Copen. Each cluster is started by a seed value and
constructed incrementally by adding items to Copen. The
addition of data items is computed using EHGS [17] and is
defined as add(k). The current maximum is represented as
. An element k is added if it has high positive correlation
max
i.e high similarity. Also it clusters low similarity gene data
items in different clusters according to the value. The value
of is between (–1, 1) and a higher value of represents
the best clustering quality. A data item is added to the cluster
if it satisfies the maximum neighbors’ criteria and a threshold
value. In general, the threshold value depends on the number
of patterns and the number of features in the data set. The
16
set of clusters is stabilized by consecutive addition
operations. To inaugurate a new cluster, a data item with
maximum number of neighbours or closest data items is
used. Also, a threshold value is used while adding an
element. This automatically filters out the outlier data items
and appropriately inserted in to the respective clusters. The
mixture of validation measures provide increased predictive
performance relative to other methods of pattern recognition.
These are the principal heuristics that have been attached
to this algorithm and are responsible for assigning clusters
to all the valid items. The added items need not be removed
from the cluster unlike correlation based clustering
algorithm devised by Vincent Tseng et al. [13].
4. EMPIRICAL RESULTS
To describe the performance of the proposed approach,
K-Means, E-CAST [23] and E-CST algorithms on the cancer
gene expression data sets are used. There are several other
algorithms are also available in the literature for comparing
the performance of the proposed algorithm. Due to relativity
these algorithms are used. Datasets [24] from breast cell
lines are used here to evaluate the proposed methodology.
To estimate the predictive power of the clustering algorithms
mixture of FOM and EGHS used. To obtain reliable
clustering results the proposed approach, K-Means, E-CAST
and Enhanced Correlation Search Technique (ECST)
algorithms are executed 25, 20, 25 and 20 times respectively.
Transfection with a single oncogene is expected to generate
similar expression profiles presumably, because only a few
genes are biologically influenced. Therefore, it is desirable
to see whether profiles of the different phenotypes can be
partitioned. Due to the presence of noise in the data and
similarity between the different samples, common clustering
techniques such as K-Means, and E-CAST failed to produce
good quality clusters.
Expression levels of the four cell lines were measured
in two separate sets of four measurements. These data sets’
cluster structures are determined in advance. From the given
data set, the users can set up some parameters for generating
various kinds of gene expression data sets with variation in
terms of the number of clusters and number of genes in each
cluster. First seed genes are generated and it must have the
same number of constraints for all the clusters. If the seed
gen es and the th reshold values are appropriately
incorporated and tested in the algorithm, then the clusters
having high intra cluster similarity and low inter cluster
similarity. During the initial phase of the clustering process
the outliers or noise are purged successfully.
The proposed approach (G-CST) is compared with the
other clustering algorithms. Table 1 provides the complete
detail about the data sets, cluster structure, clustering patterns
for the proposed approach, E-CAST, K-Means, E-CST and
their computational time (running time in Table 1). The newly
N.P. Gopalan and B. Sathiyabhama
designed algorithm outperforms quantitatively and
qualitatively in computational time and memory utilization.
Close to this, E-CST is performing better in accuracy. In
addition, the results illustrate that the quality of clustering
will be better in the proposed algorithm. This can provide
more accurate results and insight into molecular process,
morphological characteristics and gene control functions.
Figures 2 and 3 depict a large contiguous group of genes
sharing the similar expression patterns over set of conditions.
This type of clustering structure elaborates the biological
significance of the underlying genes.
The curved lines in the figures 2 and 3 represent the
sum of average FOM (i, k) and EHGS measures on the
experimental conditions. The newly designed algorithm and
the E-CST are very sensitive to outliers. The number of
clusters is very crucial parameter in the traditional clustering
algorithms, whereas in the proposed approach automatically
produces the clusters without any user input. The result of
this clustering analysis may be a group of co-regulated genes
(i.e. genes that exhibit similar experimental behavior) that
are placed in the same cluster. They express the relationships
between the clusters and the functional categories in
biological activities. The behaviors of the clustering
algorithms on data sets presented here demonstrate a feature
of gene expression that makes this method particularly
useful. It is known that genes expressed together share
common functions. Gene expression patterns suffice to
separate genes into functional categories across a relatively
small and redundant collection of conditions. It is been
observed that the addition of more and diverse conditions
can only enhance these observations.
The behaviour of the clustering algorithms on gene
expression data set 2 is very similar to that of the data set 1
which is shown in Table 1. When the number of clusters is
small, the E-CAST, K-Means algorithms have comparable
FOM and EHGS, which are lower than those of the new
approach and E-CST. When the number of clusters is large,
the proposed algorithm has comparable FOM and EHGS.
In Figure 2, there is a knee shaped structure in the curves
between one and two clusters portrays that cluster separation
is minimum for the data set 2. Data sets considered for
evaluation exhibit declining validity measures under all
algorithms as the number of clusters increases. Two factors
contribute to this. First, the algorithms may be finding higher
quality clusters, as they subdivide large, coarse clusters into
smaller, more homogeneous ones. Second, simply increasing
the number of clusters will tend to decrease the validity
measures. There is an obvious negative slope trend in both
figures, showing that clustering results with low values tend
to have high correspondence with the given functional
categorization. The mixture of FOM and EHGS provide a
meaningful estimate of cluster quality.
17
Mining Gene Expression Data Using PCA Based Clustering
Table 1
Experimental Results for 2 Gene Expression Data Sets
Algorithm
hm
No. of
Enhanced
Clusters Statistic
E-CST
63,40
E-CAST
58,23
K-means
Proposed
G-CST
No. of
outliers
(approx)
Running
Time
Patterns
0.72
200,10
O(n log n) 3000, 200
0.71
700,20
O(n2)
2500, 180
27,16
0.36
900,12
O(n2)
2000, 75
67,51
0.74
< 100, 10 >= O(n
log n)
> 3000,
200
The number of outliers is shown in the Table 1, which
may exist approximately in all the clusters together. In the
G-CST and clearly the outliers tremendously reduced, since
only relevant genes are considered for building clusters. All
the possible patterns which are biologically significant are
extracted from the clusters which are formed on the basis
of the constraints specified by the proposed algorithm.
The proposed approach is superior to the existing
approaches in quality and efficiency, stability and memory
utilization. It is understood from the Figures 2 and 3 this
emphasizes its supremacy of capturing sharp coherent
tendency among gene expression data. In addition, the
results of functionally enriched clusters highlight the fact
that these clusters carry significant biological meaning.
7. CONCLUSION AND FUTURE WORK
As the number of microarray experiments continues to
increase drastically and as these techniques are becoming
more and more a part of personalized healthcare,
computational methods to support this expansion must also
occur. Most of the clustering algorithms used in practice
are having certain inherent difficulties. This novel approach
clusters the gene expression data sets and produces good
results. This clustering process signifies great promise to
glean information from gene expression profile. To evaluate
the performance of this method, cancer gene expression data
sets have been used and it is compared with the E-CST, ECAST, K-Means clustering algorithms. It is clear that the
healthcare industry requires methods to rapidly transit
microarray data into practical use. Future work includes the
application of more real data sets and the theoretical analysis
of the determination of the threshold parameter. A key
roadblock remains the discovery of exact patterns and
predictive accuracy that still retains high accuracy in
clustering. Combinations of computational intelligence
approaches hold promise for rapid, automated and pattern
recognition for a wide assortment of data. Blend of parallel
approaches like genetic algorithm and characterization
guided clustering may improve the performance and it will
play an increasingly important role in the areas of gene
expression analysis.
References
Figure 2: Clustering behaviour of Data Set 1
[1] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders,
M.B. Eisen , P.O. Brown , D. Bots tein, an d B. Fucher,
“Comprehensive Identification of Cell Cycle- Regulated Genes
of th e Yea st Sa ccha romyces Cerevis ia e b y Microarra y
Hybridization”, Molecular Biology of the Cell, 9(12), pp. 32733297, 1998.
[2] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An Efficient
Data Clustering Methods for Very Large Databases”, Proc. ACM
SIGMOD Int’l Conf. Management of Data, pp. 103-114, 1996.
[3] Golub. T.R. Slonim, D.K. Slonim, D.K. Tamayo, P. Huard, C.
Gaasenbeek, M. Mesiroy, J.P. Coller, H. Loh, M. Downing, J.R.,
Caligiuri, M. et al., “Molecular Classification of Cancer: Class
Discovery an d Cla ss P rediction b y Gen e Express ion
Monitoring”, Science, 286: 531-537, 1999.
[4] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview
from a Database Perspective”, IEEE Trans. Knowledge and Data
Eng., 8(6), pp. 866-883, Dec. 1996.
Figure 3: Clustering Behavior of Data Set 2
[5] U. Alon, N. Barkai, D.A. Nottleman, k. Gish, S. Ybarra, D. Mack,
and A.J. Levine, “Broad Patterns of Gene Expression Revealed
by Clustering Analysis of Tumor and Normal Colon Tissues
Probed by Clustering Oligonucleotide Arrays”, Proc. Nat’l
Academy of Sciences, 96, pp. 6745-6750, 1999.
18
N.P. Gopalan and B. Sathiyabhama
[6] A.Ben-Dor and Z. Yakhini, “Clustering Gene Expression
Patterns”, J.Computational Biology, 6, pp. 281-297, 1998.
[16] T. Kohonen, “The Self-Organizing Map”, Proc.IEEE, 78(9), pp.
1464-1479, 1990.
[7] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Clustering
Analysis and Display of Genome Wide Expression Patterns”, Proc.
Nat’l Academy of Sciences, 95, pp. 14863-14868, 1998.
[17] B. Sathiyabhama, N.P. Gopalan, “Correlation Search Technique
for Clu sterin g Can cer Gene Exp res sion Data”, WSEAS
International Conferences Lisbon, Sep’ 2006.
[8] M.K. Kerr and G.A. Churchill, “Bootstrapping Cluster Analysis:
Assessing the Reliability of Conclusions from Microarray
Experiments”, Proc. Nat’l Academy of Science, 98(16), pp. 89618965, 2001.
[18] B. Sathiyabhama, N.P. Gopalan, “Enhanced Correlation Search
Technique for Clustering Cancer Gene Expression Data”, WSEAS
Transactions on Information Science and Applications 12, 3,
2006, pp. 2477-2484.
[9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns
of Gene Expression with Self-Organizing Maps: Methods and
Application to Hematopoietic Differentiation”, Proc. Nat’l
Academy of Sciences, 96(6), pp. 2907-2912, 1999.
[19] Fogel, G.B., Corne, D.W., “Evolutionary Computation in
Bioinformatics”, Morgan Kaufmann, San Francisco. 2002.
[10] N.P. Gopalan, B. Sathiyabhama, “Scalable Biclustering Gene
Expression Data using Genetic Enhanced K-Means Algorithm”,
Proc. National Conference on High Performance Computing VISION’06, pp. 494-498.
[21] Yeung, Ka Yee and Ruzzo, Walter L., “An Empirical Study on
Principal Component Analysis for Clustering Gene Expression
Data”, Technical Report UW-CSE-2000-11-03, Department of
Computer Science and Engineering, University of Washington,
2000.
[11] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, “Validating
Clustering for Gene Expression Data”, Bioinformatics, 17(4),
pp. 309-318, 2001.
[20] Xing E. P. an d Karp R.M., “Cliff: Clustering of High Dimensional Microarray Data via Iterative Feature Filtering
Using Normalized Cuts”, Bioinformatics, 17(4), 309-318, 2001.
[12] A.K. Jain and R.C. Dubes, “Algorithms for Clustering Data”,
Englewood Cliffs N.J.: Prentice Hall, 1988.
[22] Holter N.S., Mitra, M. Maritan, A., Cieplak, M., Banaver, J.R.
and Fedoroff, N.V., “Fundamental Patterns Underlying Gene
Expression Profiles: Simplicity from Complexity”, Proceedings
of the National Academy of Science USA, 97, 8409-8414, 2000.
[13] Vincent S. Tseng and Ching-Pin Kao, “Efficiently Mining Gene
Expression Data via a Novel Parameterless Clustering Method”,
IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 2(4), pp. 355-365, Dec. 2005.
[23] Abdelghani Bellaachia et. al., “E-CAST: A Data Mining
Algorithm for Gene Expression Data”, Proc. BIOKDD02:
Workshop on Data Mining in Bioinformatics (With SIGKDD02
conference), pp. 49-54.
[14] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient
Clustering Algorithm for Large Databases”, ACM Int'l Conf.
Management of Data, pp. 73-84, 1998.
[24] Kluger, H. Kacinski, B., Kluger, Y., Mironenko, Gilmore Hebert,
M., Chang, J., Perkins, A.S., and Sapi, E., “Microarray Analysis
of Invasive and Metastatic in a Breast Cancer Model”, In Poster
presented at the Gordon Conference on Cancer, Newport, RI,
2001.
[15] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes”, 15th Int’l Conf. Data Eng.,
pp. 512-521, 1999.