Download dna microarray data clustering using growing self organizing networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DNA MICROARRAY DATA CLUSTERING USING GROWING SELF
ORGANIZING NETWORKS
Kim Jackson and Irena Koprinska
School of Information Technologies, University of Sydney, Sydney, Australia
e-mail: {kj, irena}@it.usyd.edu.au
ABSTRACT
Recent advances in DNA microarray technology have
allowed biologists to simultaneously monitor the activities
of thousands of genes. To obtain meaning from these large
amounts of complex data, data mining techniques such as
clustering are being applied. This study investigates the
application of some recently developed incremental,
competitive and self-organizing neural networks (Growing
Cell Structures and Growing Neural Gas) for clustering
DNA microarray data, comparing them with traditional
algorithms.
1. INTRODUCTION
The recent advent of microarray technologies has enabled
biologists for the first time to simultaneously monitor the
activities of thousands of genes, producing large quantities
of complex data. Analysis of such data is becoming a key
aspect in the successful utilization of the microarray
technology.
Microarrays are small glass surfaces or chips, onto
which microscopic quantities of DNA are attached in a
grid layout. Each of the tiny spots of DNA relates to a
single gene. One of the most popular microarray
applications is to compare gene expression levels in two
different samples (e.g. healthy and diseased cells). RNA
from the cells in the two different conditions are extracted
and labeled with different fluorescent dyes (e.g. green for
healthy and red for diseased cells). Both RNA are washed
over the microarray. Gene sequences preferentially bind to
their complementary sequences. The dyes allow
measurement of the amount bound at each spot, in order to
estimate the presence of genes. The microarray images are
analysed and the intensities measured. Finally, a gene
expression matrix is obtained where rows correspond to
genes and columns represent samples (i.e. different
experimental conditions - stages, treatments, or tissues),
and the numbers are the expression level of the genes in
the respective samples.
In order to extract meaningful information from this
data, data mining techniques are being employed. One
goal in analysing microarray data is to find genes which
behave similarly over the course of an experiment by
comparing rows in the expression matrix. These genes
may be co-regulated or related in their function. Similar
genes can be found by clustering methods. Eisen et al. [1]
have developed a hierarchical clustering and visualization
package, which is frequently used by biologists. Other
techniques such as k-means and self-organizing maps are
emerging [7]. The goal of this paper is to study the
potential of competitive learning methods, particularly
growing, self-organizing neural networks (Growing Cell
Structures and Growing Neural Gas), for clustering of
DNA microarray data and to compare their performance
with the traditional clustering algorithms.
2. CLUSTERING METHODS
2.1. Traditional Clustering Methods
2.1.1. K-means
One of the most commonly used clustering methods is the
k-means algorithm. It starts with k (typically randomly
chosen) cluster centers. At each step, each pattern is
assigned to its nearest cluster center, and then the centers
are recomputed. This is repeated either for a given number
of iterations or until no patterns are reassigned to different
clusters. The k-means algorithm is simple and reasonably
fast, having a time complexity of O(n), where n is the
number of patterns. It can only produce circular, spherical
or hyperspherical clusters, however, and its performance is
very sensitive to the initial seeding. The other major
disadvantage is that the desired number of clusters must be
specified in advance. In most cases, the optimum value for
k is unknown, especially when the data is of high
dimensionality, as in the case of gene expression data.
2.1.2. Hierarchical clustering
Hierarchical clustering methods produce a hierarchy of
clusters from the input data typically displayed as a
dendrogram. An agglomerative algorithm begins by
treating each pattern as a cluster, and repetitively merging
the two most similar clusters until all patterns are in one
cluster. The similarity between clusters can be computed
using a number of different methods, the simplest of which
are single linkage (nearest neighbour) and complete
linkage (furthest neighbour): the distance between two
clusters is the minimum or maximum, respectively, of all
pairwise distances between patterns in the two clusters.
Complete linkage clustering tends to produce more
compact clusters than single linkage [4]. Hierarchical
clustering methods are easy to implement, and have an
advantage over `one-shot' methods like k-means in that
once an hierarchical clustering has been performed,
different granularities can be chosen to yield different
numbers of clusters. However, hierarchical methods are
very slow: they have O(n2) time and space requirements.
For large, high-dimensional datasets it becomes
prohibitive to maintain a similarity matrix.
the simplest structures for the given dimensionality m. For
m = 1 these are line segments; for m = 2, triangles; for m
= 3, tetrahedrons; and for m > 3 the structures are
hypertetrahedrons. The network is initialized with just one
of these simplex structures (i.e. m + 1 neurons), then
neurons are added to the structure during the selforganization process, in contrast to SOM which maintains
a fixed structure. Neurons are inserted in regions of high
error, which is approximated using accumulated local
error information.
2.2. Self Organizing Competitive Neural Networks
To use a competitive neural network for clustering,
generally each neuron is regarded as being responsible for
one cluster, so the number of clusters will be equal to the
number of neurons. Each cluster contains all the input
patterns which are closer to one particular neuron than to
all others. Competitive learning has two conflicting goals:
error minimization (minimizing distance between any
given input pattern and its corresponding neuron) and
entropy maximization (for any given input, each neuron
has an approximately equal probability of being the
winner, i.e. each neuron is responsible for an equal
number of input patterns, rather than an equal area of the
input space). For clustering, it is preferable to aim for
error minimization, since clustering of points should be
based on their spatial proximity, regardless of how many
points are in each cluster. All three of these networks are
able to learn both the topology and distribution of the data.
2.2.1. Self Organizing Maps
In the Self Organizing Map (SOM) [5] the neurons are
connected in a two-dimensional rectangular grid, the
dimensions of which are chosen at the outset. This
topology does not change during the learning of the
network. Both the winner and its neighbourhood are
adapted. The adaptation function ensures that the winner
will be adapted the most, and neurons further away from
the winner will be adapted less. The neighbourhood
shrinks after each iteration until it includes only one
neuron, and the learning rate decreases for stability.
2.2.2. Growing Cell Structures
The Growing Cell Structures (GCS) network is an
extension of the two-dimensional SOM to arbitrary
dimensionality [2]. The cell structure topology consists of
Figure 1. The Growing Cell Structures simulation on
four square shaped data
Another important difference between the two models
is that GCS has no ``cooling schedule'' as in SOM, where
neighbourhood size and learning rate decrease with time.
In GCS, only the winner and its direct topological
neighbours are adapted. There are two learning rates, one
for the winning neuron and another (typically much
smaller) for the neighbours, and these learning rates
remain constant over the course of the self-organization
process. The GCS simulation on four-square shaped data
(1000 points) with uniform probability distribution is
given in Figure 1.
2.2.3. Growing Neural Gas
The Growing Neural Gas (GNG) network is described in
[3]. It is based on the neural gas model combined with
competitive Hebbian learning [6].
for each input pattern. Neurons are inserted in a similar
fashion to GCS, using accumulated local error information
to estimate the region of highest error. Figure 2 illustrates
the algorithm.
The main advantage of GCS and GNG over SOM is
that there is no need to explicitly specify the number of
neurons in the network or the connections between them;
the network will grow to an optimal size. Also, there is no
decay schedule, so there are no parameters for which we
need to specify initial and final values. Whereas SOM and
GCS have fixed dimensionality, the topology of GNG has
no dimensionality restriction.
3. EXPERIMENTAL EVALUATION
In this section, the performance of various clustering
algorithms is evaluated on benchmark data, and real DNA
microarray data.
3.1. Performance Measures
A classic quantitative measure of the quality of a
clustering is the squared-error measure. This is defined as
the sum of the square of the distance from each point to
the centroid of the cluster to which it belongs. The
squared-error measures the spread of points within clusters
(i.e. intra-cluster variation). Another important factor is
the separation of clusters (i.e. inter-cluster variation). A
simple way to measure the cluster separation is the mean
distance between centroids. In general we would like to
minimize the intra-cluster variation and maximize the
inter-cluster variation.
3.2. Performance on Benchmark Data
The first experiment (see Table 1) was conducted using
the well-known Iris dataset of 150 flowers classified into
three different species of iris. All the algorithms were run
100 times, and the results averaged.
Table 1. Performance on iris data
Figure 2. The Growing Neural Gas simulation on four
square shaped data
GNG grows the network by inserting neurons in
regions of high error, and uses competitive Hebbian
learning to build up a topology between the neurons.
Competitive Hebbian learning inserts an edge connecting
the two closest neurons (measured by Euclidean distance)
k-means
single-link
complete-link
SOM
GCS
GNG
misclassifications
23
48
24
14
39
26
squared
error
92.3
142.6
89.6
83.9
109.6
97.4
centroid
separation
3.28
4.19
3.03
3.13
3.89
3.40
SOM performed best in terms of classification and
mean squared error. Single-link has obtained the best
centroid separation score, yet performs worst in terms of
actual classification, suggesting that, for this particular
dataset, the squared-error is a more accurate performance
measure than the centroid separation. The performance of
the growing self-organizing networks is encouraging.
GNG in particular scored only slightly lower than k-means
and complete-link. GCS obtained the second best centroid
separation.
others. In general this is not strictly true. While k-means is
certainly the fastest (being the simplest) of all the methods
presented, hierarchical methods are considered the most
prohibitive in terms of time and space requirements.
Because they require quadratic time and space,
hierarchical methods can become very slow for large
datasets. The dataset used for this experiment was not
large enough for this phenomenon to manifest itself.
3.3. Performance on Microarray Data
4. CONCLUSIONS
The dataset used for the second experiment was a subset
of measurements from various experiments on the yeast
Saccharomyces cerevisiae gene expression matrices
studied at Stanford University [1]. 10 measurements of
each of 500 genes were used as the dataset, i.e. 500 10dimensional expression patterns.
Firstly, GCS and GNG were each run 50 times, and
the results averaged. The following parameters were used:
GCS: m=4, εb=0.1, εn=0.005, λ=111, α=0.5, β=0.05,
τins=0.1, η=0.09; GNG: εb=0.1, εn=0.005, λ=111, amax=17,
α=0.5, β=0.05, τins=0.1, η=0.09. Based on the number of
clusters found by the growing algorithms, parameters for
the other methods were chosen. The SOM was a 6 x 10
grid, trained for 4 epochs with a learning rate decreasing
from 0.9 to 0.02. For the k-means and hierarchical
methods, k was 60.
The results are summarized in Table 2. In terms of
squared-error, the complete-link hierarchical and k-means
algorithms performed the best in the second experiment.
The single-link algorithm obtained the best centroid
separation, but had a very poor squared-error value. It is
clear from these results that single-linkage and completelinkage hierarchical algorithms have opposite goals.
According to our performance measures, the competitive
learning methods scored slightly worse than the other
methods, but not by too great a margin. It should be noted,
however, that the growing algorithms performed
significantly better than SOM.
Table 2. Performance on yeast Saccharomyces cerevisiae
data
k-means
single-link
completelink
SOM
GCS
GNG
squared
error
175.3
372.6
161.7
centroid
separation
2.04
3.33
2.93
epochs
-
running
time
1.8
8.9
8.7
296.4
234.3
211.4
1.26
1.59
1.49
4
40
15
28.9
23.3
35.1
The running times shown seem to indicate that the
competitive learning methods are much slower than the
With the techniques commonly in use for clustering DNA
microarray data, choice of the number of clusters is a
problem with no easy solution other than trial and error.
The GCS and GNG networks provide clustering methods
which do not require the selection of number of clusters,
but converge to a suitable number through a process of
addition and removal of neurons. They also have other
desirable qualities which the non-neural methods do not.
GSC and GNG develop a topology preserving mapping
from gene expressions to neurons, and the topology built
up between neurons is such that clusters represented by
neighbouring neurons will contain similar genes.
The experimental results show that GCS and GNG
outperform SOM but score slightly worse than the kmeans and hierarchical clustering. Despite this, GCS and
GNG have properties which give them potential for
fruitful application to clustering DNA microarray data.
5. ACKNOWLEDGMENTS
We are very grateful to Dr. Fred Hamker for providing the
GCS and GNG implementations.
6. REFERENCES
[1] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster
Analysis and Display of Genome-Wide Expression Patterns”,
PNAS USA, 95:14863-14868, 1998.
[2] B. Fritzke, “Growing Cell Structures - a Self-Organizing
Network for Unsupervised and Supervised Learning”, Neural
Networks, 7(9), pp.1441-1460, 1994.
[3] B. Fritzke. “A Growing Neural Gas Network Learns
Topologies”, Advances in Neural Information Processing
Systems, v. 7, pp. 625-632, 1995.
[4] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A
Review”, ACM Computing Surveys, vol. 31 no. 3, pp. 264-323,
1999.
[5] T. Kohonen, Self-Organizing Maps, Springer-Verlag, 2001.
[6] T. Martinetz and K. Schulten, “Topology Representing
Networks”, Neural Networks, 7, pp. 507--522, 1994.
[7] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E. Lander, and T. Golub, “Interpreting Patterns of
Gene Expression With Self-Organizing Maps”, Proc. Natl.
Acad. Sci. USA, 96:2907-2912, 1999.