* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download dna microarray data clustering using growing self organizing networks
Survey
Document related concepts
Transcript
DNA MICROARRAY DATA CLUSTERING USING GROWING SELF ORGANIZING NETWORKS Kim Jackson and Irena Koprinska School of Information Technologies, University of Sydney, Sydney, Australia e-mail: {kj, irena}@it.usyd.edu.au ABSTRACT Recent advances in DNA microarray technology have allowed biologists to simultaneously monitor the activities of thousands of genes. To obtain meaning from these large amounts of complex data, data mining techniques such as clustering are being applied. This study investigates the application of some recently developed incremental, competitive and self-organizing neural networks (Growing Cell Structures and Growing Neural Gas) for clustering DNA microarray data, comparing them with traditional algorithms. 1. INTRODUCTION The recent advent of microarray technologies has enabled biologists for the first time to simultaneously monitor the activities of thousands of genes, producing large quantities of complex data. Analysis of such data is becoming a key aspect in the successful utilization of the microarray technology. Microarrays are small glass surfaces or chips, onto which microscopic quantities of DNA are attached in a grid layout. Each of the tiny spots of DNA relates to a single gene. One of the most popular microarray applications is to compare gene expression levels in two different samples (e.g. healthy and diseased cells). RNA from the cells in the two different conditions are extracted and labeled with different fluorescent dyes (e.g. green for healthy and red for diseased cells). Both RNA are washed over the microarray. Gene sequences preferentially bind to their complementary sequences. The dyes allow measurement of the amount bound at each spot, in order to estimate the presence of genes. The microarray images are analysed and the intensities measured. Finally, a gene expression matrix is obtained where rows correspond to genes and columns represent samples (i.e. different experimental conditions - stages, treatments, or tissues), and the numbers are the expression level of the genes in the respective samples. In order to extract meaningful information from this data, data mining techniques are being employed. One goal in analysing microarray data is to find genes which behave similarly over the course of an experiment by comparing rows in the expression matrix. These genes may be co-regulated or related in their function. Similar genes can be found by clustering methods. Eisen et al. [1] have developed a hierarchical clustering and visualization package, which is frequently used by biologists. Other techniques such as k-means and self-organizing maps are emerging [7]. The goal of this paper is to study the potential of competitive learning methods, particularly growing, self-organizing neural networks (Growing Cell Structures and Growing Neural Gas), for clustering of DNA microarray data and to compare their performance with the traditional clustering algorithms. 2. CLUSTERING METHODS 2.1. Traditional Clustering Methods 2.1.1. K-means One of the most commonly used clustering methods is the k-means algorithm. It starts with k (typically randomly chosen) cluster centers. At each step, each pattern is assigned to its nearest cluster center, and then the centers are recomputed. This is repeated either for a given number of iterations or until no patterns are reassigned to different clusters. The k-means algorithm is simple and reasonably fast, having a time complexity of O(n), where n is the number of patterns. It can only produce circular, spherical or hyperspherical clusters, however, and its performance is very sensitive to the initial seeding. The other major disadvantage is that the desired number of clusters must be specified in advance. In most cases, the optimum value for k is unknown, especially when the data is of high dimensionality, as in the case of gene expression data. 2.1.2. Hierarchical clustering Hierarchical clustering methods produce a hierarchy of clusters from the input data typically displayed as a dendrogram. An agglomerative algorithm begins by treating each pattern as a cluster, and repetitively merging the two most similar clusters until all patterns are in one cluster. The similarity between clusters can be computed using a number of different methods, the simplest of which are single linkage (nearest neighbour) and complete linkage (furthest neighbour): the distance between two clusters is the minimum or maximum, respectively, of all pairwise distances between patterns in the two clusters. Complete linkage clustering tends to produce more compact clusters than single linkage [4]. Hierarchical clustering methods are easy to implement, and have an advantage over `one-shot' methods like k-means in that once an hierarchical clustering has been performed, different granularities can be chosen to yield different numbers of clusters. However, hierarchical methods are very slow: they have O(n2) time and space requirements. For large, high-dimensional datasets it becomes prohibitive to maintain a similarity matrix. the simplest structures for the given dimensionality m. For m = 1 these are line segments; for m = 2, triangles; for m = 3, tetrahedrons; and for m > 3 the structures are hypertetrahedrons. The network is initialized with just one of these simplex structures (i.e. m + 1 neurons), then neurons are added to the structure during the selforganization process, in contrast to SOM which maintains a fixed structure. Neurons are inserted in regions of high error, which is approximated using accumulated local error information. 2.2. Self Organizing Competitive Neural Networks To use a competitive neural network for clustering, generally each neuron is regarded as being responsible for one cluster, so the number of clusters will be equal to the number of neurons. Each cluster contains all the input patterns which are closer to one particular neuron than to all others. Competitive learning has two conflicting goals: error minimization (minimizing distance between any given input pattern and its corresponding neuron) and entropy maximization (for any given input, each neuron has an approximately equal probability of being the winner, i.e. each neuron is responsible for an equal number of input patterns, rather than an equal area of the input space). For clustering, it is preferable to aim for error minimization, since clustering of points should be based on their spatial proximity, regardless of how many points are in each cluster. All three of these networks are able to learn both the topology and distribution of the data. 2.2.1. Self Organizing Maps In the Self Organizing Map (SOM) [5] the neurons are connected in a two-dimensional rectangular grid, the dimensions of which are chosen at the outset. This topology does not change during the learning of the network. Both the winner and its neighbourhood are adapted. The adaptation function ensures that the winner will be adapted the most, and neurons further away from the winner will be adapted less. The neighbourhood shrinks after each iteration until it includes only one neuron, and the learning rate decreases for stability. 2.2.2. Growing Cell Structures The Growing Cell Structures (GCS) network is an extension of the two-dimensional SOM to arbitrary dimensionality [2]. The cell structure topology consists of Figure 1. The Growing Cell Structures simulation on four square shaped data Another important difference between the two models is that GCS has no ``cooling schedule'' as in SOM, where neighbourhood size and learning rate decrease with time. In GCS, only the winner and its direct topological neighbours are adapted. There are two learning rates, one for the winning neuron and another (typically much smaller) for the neighbours, and these learning rates remain constant over the course of the self-organization process. The GCS simulation on four-square shaped data (1000 points) with uniform probability distribution is given in Figure 1. 2.2.3. Growing Neural Gas The Growing Neural Gas (GNG) network is described in [3]. It is based on the neural gas model combined with competitive Hebbian learning [6]. for each input pattern. Neurons are inserted in a similar fashion to GCS, using accumulated local error information to estimate the region of highest error. Figure 2 illustrates the algorithm. The main advantage of GCS and GNG over SOM is that there is no need to explicitly specify the number of neurons in the network or the connections between them; the network will grow to an optimal size. Also, there is no decay schedule, so there are no parameters for which we need to specify initial and final values. Whereas SOM and GCS have fixed dimensionality, the topology of GNG has no dimensionality restriction. 3. EXPERIMENTAL EVALUATION In this section, the performance of various clustering algorithms is evaluated on benchmark data, and real DNA microarray data. 3.1. Performance Measures A classic quantitative measure of the quality of a clustering is the squared-error measure. This is defined as the sum of the square of the distance from each point to the centroid of the cluster to which it belongs. The squared-error measures the spread of points within clusters (i.e. intra-cluster variation). Another important factor is the separation of clusters (i.e. inter-cluster variation). A simple way to measure the cluster separation is the mean distance between centroids. In general we would like to minimize the intra-cluster variation and maximize the inter-cluster variation. 3.2. Performance on Benchmark Data The first experiment (see Table 1) was conducted using the well-known Iris dataset of 150 flowers classified into three different species of iris. All the algorithms were run 100 times, and the results averaged. Table 1. Performance on iris data Figure 2. The Growing Neural Gas simulation on four square shaped data GNG grows the network by inserting neurons in regions of high error, and uses competitive Hebbian learning to build up a topology between the neurons. Competitive Hebbian learning inserts an edge connecting the two closest neurons (measured by Euclidean distance) k-means single-link complete-link SOM GCS GNG misclassifications 23 48 24 14 39 26 squared error 92.3 142.6 89.6 83.9 109.6 97.4 centroid separation 3.28 4.19 3.03 3.13 3.89 3.40 SOM performed best in terms of classification and mean squared error. Single-link has obtained the best centroid separation score, yet performs worst in terms of actual classification, suggesting that, for this particular dataset, the squared-error is a more accurate performance measure than the centroid separation. The performance of the growing self-organizing networks is encouraging. GNG in particular scored only slightly lower than k-means and complete-link. GCS obtained the second best centroid separation. others. In general this is not strictly true. While k-means is certainly the fastest (being the simplest) of all the methods presented, hierarchical methods are considered the most prohibitive in terms of time and space requirements. Because they require quadratic time and space, hierarchical methods can become very slow for large datasets. The dataset used for this experiment was not large enough for this phenomenon to manifest itself. 3.3. Performance on Microarray Data 4. CONCLUSIONS The dataset used for the second experiment was a subset of measurements from various experiments on the yeast Saccharomyces cerevisiae gene expression matrices studied at Stanford University [1]. 10 measurements of each of 500 genes were used as the dataset, i.e. 500 10dimensional expression patterns. Firstly, GCS and GNG were each run 50 times, and the results averaged. The following parameters were used: GCS: m=4, εb=0.1, εn=0.005, λ=111, α=0.5, β=0.05, τins=0.1, η=0.09; GNG: εb=0.1, εn=0.005, λ=111, amax=17, α=0.5, β=0.05, τins=0.1, η=0.09. Based on the number of clusters found by the growing algorithms, parameters for the other methods were chosen. The SOM was a 6 x 10 grid, trained for 4 epochs with a learning rate decreasing from 0.9 to 0.02. For the k-means and hierarchical methods, k was 60. The results are summarized in Table 2. In terms of squared-error, the complete-link hierarchical and k-means algorithms performed the best in the second experiment. The single-link algorithm obtained the best centroid separation, but had a very poor squared-error value. It is clear from these results that single-linkage and completelinkage hierarchical algorithms have opposite goals. According to our performance measures, the competitive learning methods scored slightly worse than the other methods, but not by too great a margin. It should be noted, however, that the growing algorithms performed significantly better than SOM. Table 2. Performance on yeast Saccharomyces cerevisiae data k-means single-link completelink SOM GCS GNG squared error 175.3 372.6 161.7 centroid separation 2.04 3.33 2.93 epochs - running time 1.8 8.9 8.7 296.4 234.3 211.4 1.26 1.59 1.49 4 40 15 28.9 23.3 35.1 The running times shown seem to indicate that the competitive learning methods are much slower than the With the techniques commonly in use for clustering DNA microarray data, choice of the number of clusters is a problem with no easy solution other than trial and error. The GCS and GNG networks provide clustering methods which do not require the selection of number of clusters, but converge to a suitable number through a process of addition and removal of neurons. They also have other desirable qualities which the non-neural methods do not. GSC and GNG develop a topology preserving mapping from gene expressions to neurons, and the topology built up between neurons is such that clusters represented by neighbouring neurons will contain similar genes. The experimental results show that GCS and GNG outperform SOM but score slightly worse than the kmeans and hierarchical clustering. Despite this, GCS and GNG have properties which give them potential for fruitful application to clustering DNA microarray data. 5. ACKNOWLEDGMENTS We are very grateful to Dr. Fred Hamker for providing the GCS and GNG implementations. 6. REFERENCES [1] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns”, PNAS USA, 95:14863-14868, 1998. [2] B. Fritzke, “Growing Cell Structures - a Self-Organizing Network for Unsupervised and Supervised Learning”, Neural Networks, 7(9), pp.1441-1460, 1994. [3] B. Fritzke. “A Growing Neural Gas Network Learns Topologies”, Advances in Neural Information Processing Systems, v. 7, pp. 625-632, 1995. [4] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A Review”, ACM Computing Surveys, vol. 31 no. 3, pp. 264-323, 1999. [5] T. Kohonen, Self-Organizing Maps, Springer-Verlag, 2001. [6] T. Martinetz and K. Schulten, “Topology Representing Networks”, Neural Networks, 7, pp. 507--522, 1994. [7] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub, “Interpreting Patterns of Gene Expression With Self-Organizing Maps”, Proc. Natl. Acad. Sci. USA, 96:2907-2912, 1999.