Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Clustering of High-Dimensional Gene Expression Data with Feature Filtering Methods and Diffusion Maps Rui Xu1, Steven Damelin2, Boaz Nadler3, and Donald C. Wunsch II1 1 Applied Computational Intelligence Laboratory Department of Electrical and Computer Engineering Missouri University of Science and Technology Rolla, MO 65409-0249 USA 2 Department of Mathematical Sciences Georgia Southern University Statesboro, GA 30460-8093 USA 3 Department of Computer Science and Applied Mathematics Weizmann Institute of Science, Rehovot, 76100, Israel [email protected], [email protected], [email protected], [email protected] Abstract – The importance of gene expression data in cancer diagnosis and treatment has become widely known by cancer researchers in recent years. However, one of the major challenges in the computational analysis of such data is the curse of dimensionality because of the overwhelming number of variables measured (genes) versus the small number of samples. Here, we use a two-step method to reduce the dimension of gene expression data. First, we extract a subset of genes based on statistical characteristics of their corresponding gene expression levels. Then, for further dimensionality reduction, we apply diffusion maps, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set in order to obtain efficient representation of data geometric descriptions. Finally, a neural network clustering theory, Fuzzy ART, is applied to the resulting data to generate clusters of cancer samples. Experimental results on the small round blue-cell tumor (SRBCT) data set, compared with other widely-used clustering algorithms, demonstrate the effectiveness of our proposed method in addressing multidimensional gene expression data. Keywords – Clustering, Diffusion maps, Feature filtering, Fuzzy ART, Gene expression data List of Figures and Tables Fig. 1. Expression levels of four pairs of the most correlated genes. Fig. 2. Expression levels of four pairs of genes with the highest variance. Fig. 3. Histogram of four genes that show double hump density. Fig. 4. Topological structure of Fuzzy ART. Figure 5. Diffusion map results when working with all 2,308 variables. Figure 6. Diffusion map results when working only with the genes selected in the array idx. Fig. 7. The best clustering scores of the Rand index for the SRBCT data set. Fig. 8. The vigilance parameter vs. the number of clusters and the Rand index score for the reduced SRBCT data set with the thirty selected genes. Table 1. Performance results of diffusion maps and Fuzzy ART on the entire SRBCT data set. Table 2. Performance results of diffusion maps and Fuzzy ART on the reduced SRBCT data set with 30 genes. I. Introduction Cancers of various types have been a leading cause of death in the world for many decades. According to the report released by the National Center for Health Statistics, cancer accounted for 22.9% (550,270) deaths in the United States in 2004, only less than the number caused by heart diseases [1]. Given the tremendous complexity of various types of cancers, it is believed that the single most important indicator for surviving cancer is and will be early diagnoses and treatment. In recent years, the importance of gene expression data from DNA microarrays [2-3] in cancer diagnosis, together with their advantage over the traditional, morphological, appearance-based cancer classification methods, has been widely known by cancer researchers [4-6]. In this context, different cancer types or subtypes are discriminated through their corresponding gene expression profiles [4-11]. Computational methods, such as multi-layer perceptrons [12], naïve Bayes [13], support vector machines [13-14], semi-supervised Ellipsoid ARTMAP [15], k-Top Scoring Paris [16], and fuzzy neural networks [17], to name a few, have also been applied to cancer diagnosis-oriented gene expression data analysis [18]. Here, we consider the situation in which the labels for the cancer samples are not available. This assumption is reasonable because of the requirement for discovering unknown and novel cancer types or subtypes. In this case, unsupervised learning or cluster analysis, which attempts to explore the underlying data structure of the corresponding data [18], provides an effective way for cancer researchers to expose meaningful insights into the classification information of cancer samples. Given N tumor samples measured over D genes, the corresponding microarray data matrix is usually written as X = {xij}, 1 i N, 1 j D, where xij represents the expression level of gene j in tissue sample i. The goal of our work is to generate a K-partition C = {Ck}, 1 k K, such that Ck , K k 1 C k C , and C k Cl , k , l 1,..., K and k l . One of the major challenges of microarray data analysis is the overwhelming number of variables measured (gene expression levels) compared with the small number of cancer samples, the latter due to factors such as the difficulty of sample collection and the cost of experiments. Learning in such settings typically proves difficult due to the curse of dimensionality, which has two different aspects. The first is an exponential increase, as a function of the number of variables, in the computational complexity (such as finding an optimal group of variables of size k). The second aspect, directly relevant to our work here, is the almost non-informative nature of global distances between samples, which may lead to severe deterioration in the performance of clustering, classification and regression algorithms. Specifically in our setting, while different cancer classes may be quite easily distinguished on a (small) subset of genes, on the vast majority of genes the behavior of the different classes is roughly the same; thus, these features can be regarded as noise. From a statistical point of view, the existence of numerous irrelevant and redundant features or non-informative genes may even totally impair the effective discovery of the cancer clusters. Thus, feature selection or extraction is critically important for dimensionality reduction and further analysis. Here, feature selection refers to the process of choosing distinguishing features from a set of candidates, while feature extraction utilizes some transformations to generate useful and novel features from the original ones. Note that in the literature, these two terms sometimes are used interchangeably without further identifying the difference. Many methods have been proposed to address this problem, such as principal component analysis [20], signal-to-noise ratio [4], t-statistics scores [21], TNoM scores [22], eigengene-based linear discriminant analysis [23], sparse logistic regression [24], and support vector machine-recursive feature elimination algorithms [25]. However, most of these methods work in a supervised way, and the lack of labeled training data makes the problem more difficult. We remark that in recent years various unsupervised methods to detect bi-clusters also have been developed (see [35,36] and references therein). In our previous research [26-27], we use diffusion maps to address the high-dimensionality problem. Diffusion maps consider the eigenfunctions of Markov matrices as a system of coordinates on the original data set in order to obtain efficient representation of data geometric descriptions [28-30]. Here, we show that the performance of diffusion maps can be further significantly improved by removing those non-informative genes based on statistical characteristics of their corresponding gene expression measurements, such as high correlation coefficients to other genes, large variance, and a bimodal probability density distribution. The reduced data are then clustered with a neural network cluster theory, Fuzzy ART (FA) [31], to generate a partition of the cancer samples. FA can learn arbitrary input patterns in a stable, fast, and self-organizing way, thus overcoming the effect of learning instability that plagues many other competitive networks. We compared the performance of the proposed methods with those of diffusion maps (FA, hierarchical clustering algorithms, and K-means) on the small round blue-cell tumor (SRBCT) data set. The experimental results demonstrate the effectiveness of our proposed method in addressing multidimensional gene expression data and, ultimately, identifying corresponding cancer types. The remainder of this paper is organized as follows. Sections II and III discuss diffusion maps and the feature selection methods, respectively. Section IV presents an introduction to FA. The experimental results are presented and discussed in section V, and section VI concludes the paper. II. Diffusion maps Given a data set X = {xi, i = 1,…,N} on a d-dimensional data space, a finite graph with N nodes corresponding to N data points can be constructed on X as follows. Every two nodes in the graph are connected by an edge weighted through a non-negative, symmetric, and positive definite kernel w: X × X → (0, ∞). Typically, a Gaussian kernel of the form x x i j w(xi , x j ) exp 2 2 , 2 (1) where σ is the kernel width parameter, is used to construct the similarity matrix. The kernel reflects the degree of similarity between xi and xj, and |||| is the Euclidean norm in d. Let d (xi ) w(x , x x j X i j ) (2) be the degree of xi; the Markov or affinity matrix P is then constructed by calculating each entry as p (xi , x j ) w(xi , x j ) d ( xi ) . (3) From the definition of the weight function, p(xi, xj) can be interpreted as the transition probability from xi to xj in one time step. This idea can be extended further by considering pt(xi, xj) in the tth power Pt of P as the probability of transition from xi to xj in t time steps [28]. Therefore, the parameter t defines the granularity of the analysis, with larger values of t revealing larger scale structures in the data. Considering different values of t thus makes it possible to control the generation of more specific or broader clusters. Since the matrix P is adjoint to a symmetric matrix, its spectrum is composed of real eigenvalues, and the corresponding right and left eigenvectors form a basis of N. Assuming the Markov matrix P is irreducible (e.g., the graph is connected), its largest eigenvalue is 1 with multiplicity one. We denote the eigenvalues of P by 1 = λ1 λ2 … λN, and the corresponding right eigenvectors by {φj, j = 1,…,N}, Pt φ j t φ j . (4) As described in [37], given the definition of a random walk on the graph of the data, one can define two key concepts. The first is a diffusion distance between the nodes on the graph that captures their dynamic proximity, as follows, Dt (xi , x j ) pt (xi , ) pt (x j , ) 1/ 0 , (5) where φ0 is the unique stationary distribution 0 (x) d (x) , x d , d ( xi ) (6) xi X The second is a diffusion map, which is a mapping of the nodes of the graph from the original data space into an Ldimensional Euclidean space L. This is done via the eigenvectors of P, viewed as a new set of coordinates on the data set, as follows Ψt : xi 1t φ1 (xi ),..., Lt φ L (xi ) . T (7) The relationship between the two concepts is that the Euclidean distance between all N eigenvector coordinates is equal to the diffusion distance, Dt (xi , x j ) Ψ t (xi ) Ψ t (x j ) , (8) where |||| is the Euclidean norm in L. The diffusion distance captures the dynamic proximity of nodes on the graph because the more paths that connect two nodes, the smaller their diffusion distance is. Further, since the eigenvalues of P decay to zero, one can approximate the diffusion distance with relatively few eigenvector coordinates (L << N). The kernel width parameter σ represents the rate at which the similarity between two points decays. There is no good theory for the guidance of the choice of σ. While several heuristics have been proposed, they boil down to trading off sparseness of the kernel matrix (small ) with adequate characterization of true affinity of two points. One of the main reasons for using spectral clustering methods is that, with sparse kernel matrices, long range affinities are accommodated through the chaining of many local interactions as opposed to standard Euclidean distance methods (e.g. correlation) that impute global influence into each pairwise affinity metric, causing longrange interactions to dominate local interactions. III. Gene filtering As aforementioned, we consider a setting where the vast majority of the original variables (genes) are noninformative for discrimination of cancer types. To remove these genes before applying diffusions is important because the distance between samples in the representation that takes all features into account contains a lot of noise and makes distances almost totally un-informative. Even in the supervised case, noise and high-dimensional data may have a detrimental effect on standard statistical methods of classification and regression, leading to error terms of the form (variance)*d / N. Thus, when d / N is large, errors can be quite large [33]. In the rest of the section, we will discuss three methods for selecting informative genes, using the SRBCT data set as an example, which is introduced in Section V. Figure 1 here The first is the correlation coefficient of a single variable to other variables. The correlation coefficient captures the statistical strength of a linear relationship between variables. For gene expression data sets, there are usually quite a few genes that are highly correlated. In our analysis, we assume that highly-correlated genes can be useful for distinguishing different types of cancer. Fig. 1 depicts four of the most correlated pairs of genes in the SRBCT data set. Their correlation coefficients are as follows: (4, 58, 0.9507), (187, 509, 0.9455), (1915, 1916, 0.9035), and (1237, 2273, 0.8933). It can be seen that all of these gene pairs can be beneficial for separating one group of samples from the others. This also makes sense because we expect one class of samples to behave differently on more than a single feature. Therefore, it should not be surprising that such genes are correlated. Figure 2 here Large variance is another important statistical method that indicates the usefulness of the corresponding feature in discriminating between different categories. Discriminating genes exhibit quite different behaviors in different cancer categories, leading to large variance. Fig. 2 shows a plot of 8 of the features with the highest variance, grouped in pairs for visualization purposes. Even though these variables were chosen without knowledge of the cancer labels, it is clear that these features are very informative in disclosing the category structure. Figure 3 here A third statistical method is to select variables with a bi-modal (double hump) probability density. Fig. 3 illustrates the histogram of four features that display double hump densities, which resemble the sum of two well-separated Gaussians. Similar to the previous two methods, such features are also useful in clustering. IV. Fuzzy ART FA is based on Adaptive Resonance Theory (ART) [32], which was inspired by neural modeling research and was developed as a solution to the plasticity-stability dilemma: how adaptable (plastic) should a learning system be so that it does not suffer from catastrophic forgetting of previously learned rules (stability). FA incorporates fuzzy set theory into ART and extends the ART family by allowing stable recognition clusters in response to both binary and real-valued input patterns with either fast or slow learning [31]. FA belongs to the category of the constructive clustering algorithms, which have the advantage of dynamically determining the number of clusters during clustering without the requirement of identifying such a parameter in advance, such as K-means requires. FA has many other desirable characteristics, such as fast and stable learning, a transparent learning paradigm, atypical pattern detection, and easy implementation. Figure 4 here The basic FA architecture consists of two-layer nodes or neurons, the feature representation field F1, and the category representation field F2, as shown in Fig. 4 The neurons in layer F1 are activated by the input pattern, while the prototypes of the formed clusters are stored in layer F2. The neurons in layer F2 that are already being used as representations of input patterns are said to be committed. Correspondingly, the uncommitted neuron encodes no input patterns. The two layers are connected via adaptive weights wj, emanating from node j in layer F2. After an input pattern is presented, the neurons in layer F2 (including a certain number of committed neurons and one uncommitted neuron) compete by calculating the category choice function Tj xwj wj , (9) where is the fuzzy AND operator defined by x y i min xi , yi , (10) and α > 0 is the choice parameter to break the tie when more than one prototype vector is a fuzzy subset of the input pattern, based on the winner-take-all rule, TJ max{T j } . (11) j The winning neuron J then becomes activated, and an expectation is reflected in layer F1 and compared with the input pattern. The orienting subsystem with the pre-specified vigilance parameter ρ (0 ≤ ρ ≤ 1) determines whether the expectation and the input pattern are closely matched. The larger the value of ρ, the fewer mismatches will be tolerated; therefore, more clusters are likely to be generated. If the match meets the vigilance criterion, x wJ x , (12) weight adaptation occurs, where learning starts and the weights are updated using the following learning rule, w J (new) x w J (old) (1 )w J (old) , (13) where β [0,1] is the learning rate parameter. This procedure is called resonance, which suggests the name of ART. On the other hand, if the vigilance criterion is not met, a reset signal is sent back to layer F2 to shut off the current winning neuron, which will remain disabled for the entire duration of the presentation of this input pattern, and a new competition is performed among the remaining neurons. This new expectation is then projected into layer F1, and this process repeats until the vigilance criterion is met. In the case that an uncommitted neuron is selected for coding, a new uncommitted neuron is created to represent a potential new cluster. A practical problem in applying FA is the possibility of cluster proliferation, which occurs as a result of an arbitrarily small norm of input patterns [31]. Since the norm of weight vectors does not increase during learning, many low-valued prototypes may be generated without further access. The solution to the cluster proliferation problem is to normalize the input pattern x so that, x , 0 . (14) In FA, a normalization rule known as complement coding is used to normalize input patterns while maintaining the amplitude information. Specifically, an input pattern d-dimensional x = (x1,…,xd) is expanded as a 2d-dimensional vector x* x, x c x1 , , xdc , , xd , x1c , (15) where xci = 1 – xi for all i. A direct mathematical manipulation shows that input patterns in complement coding form are automatically normalized, x* x, xc xi xic d d i 1 d d i 1 i 1 i 1 xi d xi d . (16) V. Experimental results We applied the proposed method to the data set on the diagnostic research of the small round blue-cell tumors (SRBCTs) of childhood. The SRBCT data set consists of 83 samples from four categories, known as Burkitt lymphomas (BL), the Ewing family of tumors (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS) [13]. Gene expression levels of 6,567 genes were measured using cDNA microarray for each sample, 2,308 of which passed the filter that requires the red intensity of a gene to be greater than 20 and were kept for further analyses. The relative red intensity (RRI) of a gene is defined as the ratio between the mean intensity of that particular spot and the mean intensity of all filtered genes, and the ultimate expression level measure is the natural logarithm of RRI. The data are expressed as a matrix X = {xij}832,308. In our further analysis, an additional logarithm was taken to linearize the relations between different genes and to make very high expression levels not as high. We compare the results of performing a diffusion map analysis on all features (genes) versus a diffusion map analysis only on those features suspected to be informative (in our case – highly correlated with others, having large variance, and showing double hump density). Consequently, thirty genes were selected based on the criteria aforementioned, with indexes listed as follows: idx = [4, 33, 58, 107, 129, 187, 509, 672, 735, 819, 989, 1237, 1263, 1594, 1750, 1769, 1781, 1803, 1834, 1890, 1915, 1916, 2046, 2050, 2060, 2086, 2211, 2214, 2273, 2290]. The results of the projection into the resulting diffusion map coordinates for the two cases are presented in Figs. 5 and 6. When all 2,308 variables are used, the first two coordinates capture variability in the data that has nothing to do with the classes that we seek. Only diffusion map coordinates 3-4 and 5-6 are somewhat useful. Figure 5 here In the case of the reduced data set with thirty genes, the leading diffusion maps ψ1,…,ψ4 are very informative for class separation, while ψ5 and ψ6 are not. In addition, the cluster represented in red is now clearly separated from the others, and the consequent clustering algorithms should be able to identify it as a coherent cluster of its own. Figure 6 here According to our experiments, we find that the clustering results are not sensitive to the category choice parameter α, which is then set as 0.1 for our further study. We adjusted the kernel width parameter σ and vigilance parameter ρ, and observed the performance of the proposed method. Because we already have a pre-specified partition H of the data set, which is also independent from the clustering structure C resulting from the use of a clustering algorithm, the performance can be evaluated by comparing C to H in terms of external criteria, such as the Rand index [34]. Considering a pair of tissue samples xi and xj, there are four different cases based on how xi and xj are placed in C and H. Case 1: xi and xj belong to the same clusters of C and the same category of H. Case 2: xi and xj belong to the same clusters of C but different categories of H. Case 3: xi and xj belong to different clusters of C but the same category of H. Case 4: xi and xj belong to different clusters of C and different categories of H. Correspondingly, the number of pairs of samples for the four cases are denoted as a, b, c, and d, respectively. The Rand index used in our analysis can then be defined as follows: R (a d ) /(a b c d ) ; (17) As can be seen from the definition, the larger the values of R, the more similar are C and H. Fig. 7 shows the best Rand index scores for diffusion maps and FA on the entire data set and the reduced data set with features selected by the methods discussed in Section III. For the purpose of comparison, we also illustrate the best results with a hierarchical clustering (HC) algorithm (single linkage) and the K-means (KM) algorithm with random initialization. For both algorithms, the Euclidean distance is used as the distance function. HC algorithms with different between-cluster distance functions, such as complete linkage, group average linkage, and centroid linkage, are also used. The results for these algorithms do not show significant improvement or deterioration over those of the single linkage algorithm and are not reported here. From the figure, it can be seen that, by filtering the features in advance, the clustering performance consistently improves. Also, diffusion maps are important in exposing the data structure; the performance of all three clustering algorithms without diffusion maps deteriorates dramatically, especially for the hierarchical clustering algorithm. Another observation from Fig. 7 is that FA can achieve better partitions of the given samples than the other two methods. Figure 7 here Tables 1 and 2 further summarize the results of diffusion maps and FA on the original and reduced data set. We investigate this method’s performance by selecting the dimensions of the transformed space at 5, 10, 15, 20, and 50 when the entire data set is used, and 5, 8, 10, and 15 when the 30-gene subset is used. For each designated dimension, we adjusted the kernel width parameter σ and vigilance parameter ρ. As shown in the tables, both parameters play an important role in the sample partition. However, there is still no effective criterion to decide these parameters, and their selection is based on cross validation. Again, the effectiveness of feature selection before applying diffusion maps is demonstrated. Table 1 here Table 2 here Fig. 8 shows the influence of the vigilance parameter on the number of generated clusters and the Rand index score. The values of the kernel width parameter σ and the new dimension L are set as 6 and 10, respectively. As aforementioned, when ρ increases, the number of generated clusters increases as well. On the other hand, the largest value of the Rand index is achieved at ρ = 0.3. When performing a stricter vigilance test, the samples that belong to the same category are divided into more small categories, causing the value of the Rand index to decrease. Figure 8 here VI. Conclusions Cancer classification based on gene expression profiles provides a promising method of cancer diagnosis and treatment. Here, we propose to use feature selection methods and diffusion maps to address the problem of high dimensionality, a major challenge in gene expression data analysis. Fuzzy ART is then used to form the clusters of cancer samples. The experimental results on the SRBCT data set demonstrate the potential of the proposed methods to achieve useful information from the high-dimensional gene expression data. Acknowledgment Partial support for this research from the National Science Foundation, and from the M.K. Finley Missouri endowment, is gratefully acknowledged. References [1] A. Miniño, M. Heron, and B. Smith, Deaths: Preliminary data for 2004, National Vital Statistics Reports 54, (2006). [2] M. Schena, D. Shalon, R. Davis, and P. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270 (1995) 467-470. [3] R. Lipshutz, S. Fodor, T. Gingeras, and D. Lockhart, High density synthetic oligonucleotide arrays, Nature Genetics 21 (1999) 20-24. [4] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander, Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531-537. [5] A. Alizadeh, M. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, H. Sabet, T. Tran, X. Yu, J. Powell, L. Yang, G. Marti, T. Moore, J. Hudson, Jr, L. Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. Brown, and L. Staudt, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403 (2000) 503-511. [6] R. Shyamsundar, Y. Kim, J. Higgins, K. Montgomery, M. Jorden, A. Sethuraman, M. van de Rijn, D. Botstein, P. Brown and J. Pollack, A DNA microarray survey of gene expression in normal human tissues, Genome Biology 6 (2005) R22. [7] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96 (1999) 6745-6750. [8] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. BenDor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, N. Hayward, and J. Trent, Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406 (2000) 536540. [9] L. Dyrskjøt, T. Thykjaer, M. Kruhøffer, J. Jensen, N. Marcussen, S. Hamilton-Dutoit, H. Wolf, and T. Ørntoft, Identifying distinct classes of bladder carcinoma using microarrays, Nature Genetics 33 (2003) 90-96. [10] Y. Wang, J. Klijn, Y. Zhang, A. Sieuwerts, M. Look, F. Yang, D. Talantov, M. Timmermans, M. Gelder, J. Yu, T. Jatkoe, E. Berns, D. Atkins, and J. Foekens, Gene-expression profiles to predict distant metastasis of lymph-nodenegative primary breast cancer, Lancet 365 (2005) 671-679. [11] M. Garber, O. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. Pacyna-Gengelbach, M. Rijn, G. Rosen, C. Perou, R. Whyte, R. Altman, P. Brown, D. Botstein, and I. Petersen, Diversity of gene expression in adenocarcinoma of the lung, Proc. Natl. Acad. Sci. USA 98 (2001) 13784-13789. [12] J. Khan, J. Wei, M. Ringnér, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. Antonescu, C. Peterson, and P. Meltzer, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine 7 (2001) 673-679. [13] T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20 (2004) 2429-2437. [14] A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics 21 (2005) 631-643. [15] R. Xu, G. Anagnostopoulos, and D. Wunsch II, Multi-class cancer classification using semi-supervised Ellipsoid ARTMAP and particle swarm optimization with gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2007) 65-77. [16] A. Tan, D. Naiman, L. Xu, R. Winslow, and D. Geman, Simple decision rules for classifying human cancers from gene expression profiles, Bioinformatics 21 (2005) 3896-3904. [17] L. Wang, F. Chu, and W. Xie, Accurate cancer classification using expressions of very few genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2007) 40-53. [18] R. Xu and D. Wunsch II, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks 16 (2005) 645-678. [19] R. Bellman, Adaptive control processes: A guided tour (Princeton University Press, Princeton, 1961). [20] K. Yeung and W. Ruzzo, Principal component analysis for clustering gene expression data, Bioinformatics 17 (2001) 763-774. [21] D. Nguyen and D. Rocke, Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics 18 (2002) 1216-1226. [22] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, Tissue classification with gene expression profiles, in Proceedings of the Fourth Annual International Conference on Computational Molecular biology (2000) 583-598. [23] R. Shen, D. Ghosh, A. Chinnaiyan, and Z. Meng, Eigengene-based linear discriminant model for tumor classification using gene expression microarray data, Bioinformatics 22 (2006) 2635-2642. [24] G. Cawley and N. Talbot, Gene selection in cancer classification using sparse logistic regression with Bayesian regularization, Bioinformatics 22 (2006) 2348-2355. [25] Y. Tang, Y. Zhang, and Z. Huang, Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2007) 365-381. [26] R. Xu, S. Damelin, and D. Wunsch II, Applications of diffusion maps in gene expression data-based cancer diagnosis analysis, in Proceedings of the 29th Annual International Conference of IEEE Engineering in Medicine and Biology Society (2007) 4613-4616. [27] R. Xu, S. Damelin, and D. Wunsch II, Clustering of cancer tissues using diffusion maps and fuzzy ART with gene expression data, in Proceedings of World Congress on Computational Intelligence 2008 (2008) 183-188. [28] R. Coifman and S. Lafon, Diffusion maps, Applied and Computational Harmonic Analysis 21 (2006) 5-30. [29] S. Lafon and A. Lee, Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006) 1393-1403. [30] S. Lafon, Y. Keller, and R. Coifman, Data fusion and multicue data matching by diffusion maps, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006) 1784-1797. [31] G. Carpenter, S. Grossberg, and D. Rosen, Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system, Neural Networks 4 (1991) 759-771. [32] G. Carpenter and S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine, Computer Vision, Graphics, and Image Processing 37 (1987) 54-115. [33] B. Nadler and R. Coifman, The prediction error in CLS and PLS: The importance of feature selection prior to multivariate calibration, Journal of Chemometrics 19 (2005) 107-118. [34] A. Jain and R. Dubes, Algorithms for clustering data (Prentice Hall, Englewood Cliffs, 1988). [35] A. Tanay, R. Sharan, and R. Shamir, Biclustering algorithms: A survey, in: S. Aluru, ed., Handbook of Computational Molecular Biology (Chapman & Hall/CRC, 2006). [36] S. Madeira and A. Oliveira, Biclustering algorithms for biological data analysis: A survey, IEEE Transactions on Computational Biology and Bioinformatics 1 (2004) 24-45. [37] B. Nadler, S. Lafon, R. Coifman and I. Kevrekidis, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, in: Y. Weiss, B. Schölkopf, and J. Platt, eds., Advances in Neural Information Processing Systems 18 (MIT Press, Cambridge, 2006) 955-962. 2 4 2 58 509 0 -2 -4 -4 0 -2 -2 0 -4 -5 2 0 187 4 0 0 2273 2 1916 2 5 -2 -4 -4 -2 -2 0 1915 2 -4 -4 -2 0 2 1237 Fig. 1. Expression levels of four pairs of the most correlated genes. The different point types correspond to the four categories. Fig. 2. Expression levels of four pairs of genes with the highest variance. The different point types correspond to the four categories. Fig. 3. Histogram of four genes that show double hump density. Reset … Layer F2 W … Layer F1 ρ Orienting Subsystem Input Pattern I Fig. 4. Topological structure of Fuzzy ART. Layers F1 and F2 are connected via adaptive weights W. The orienting subsystem is controlled by the vigilance parameter ρ. Figure 5. Diffusion map results when working with all 2,308 variables. The different point types correspond to the four categories. Figure 6. Diffusion map results when working only with the genes selected in the array idx. The different point types correspond to the four categories. Fig. 7. The best clustering scores of the Rand index for the SRBCT data set. The dimension is set as 2,308, 15, 10, and 5, respectively. The order for the bars is FA with gene selection (GS), FA, HC, KM, from left to right. The 0.9 50 0.85 40 0.8 30 0.75 20 0.7 10 0.65 0.1 0.2 0.3 0.4 0.5 0.6 Vigilance Parameter 0.7 0.8 Number of Clusters Rand Index Score result for FA+GS is unavailable for the entire data set. 0 0.9 Fig. 8. The vigilance parameter vs. the number of clusters and the Rand index score for the reduced SRBCT data set with the thirty selected genes. Table 1. Performance results of diffusion maps and Fuzzy ART on the entire SRBCT data set. The corresponding ρ is indicated in the parentheses. RI (ρ) L=5 L=10 L=15 L=20 L=50 σ=22 0.7417 (0.5) 0.8569 (0.3) 0.8795 (0.35) 0.8707 (0.25) 0.8437(0.3) σ=24 0.7661 (0.5) 0.8260 (0.35) 0.8290 (0.35) 0.8346 (0.4) 0.8149 (0.35) σ=26 0.7802 (0.5) 0.8187 (0.45) 0.8560 (0.35) 0.8795 (0.35) 0.8175 (0.5) σ=28 0.7761 (0.5) 0.9019 (0.2) 0.8431 (0.2) 0.8284 (0.25) 0.8137 (0.55) σ=30 0.7743 (0.45) 0.8601 (0.3) 0.8619 (0.3) 0.8578 (0.4) 0.8354 (0.35) σ=32 0.7708 (0.6) 0.8760 (0.2) 0.8322 (0.25) 0.8160 (0.45) 0.8196 (0.6) RI: Rand index Table 2. Performance results of diffusion maps and Fuzzy ART on the reduced SRBCT data set with 30 genes. The corresponding ρ is indicated in the parentheses. σ=1 L=5 0.5498 (0.6) L=8 0.6100 (0.2) L=10 0.8149 (0.45) L=15 0.9124 (0.5) RI: Rand index σ=2 0.9227 (0.5) 0.8751 (0.45) 0.9042 (0.2) 0.9427 (0.3) RI (ρ) σ=4 0.8490 (0.55) 0.8598 (0.5) 0.8554 (0.5) 0.8842 (0.4) σ=6 0.8554 (0.45) 0.8463 (0.2) 0.8845 (0.3) 0.9042 (0.35) σ=8 0.8110 (0.6) 0.8516 (0.2) 0.8422 (0.45) 0.8516 (0.5) σ=10 0.8237 (0.35) 0.8422 (0.45) 0.8287 (0.35) 0.7749 (0.4)