Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random projections versus random selection of features for classification of high dimensional data Sachin Mylavarapu Ata Kaban School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Abstract—Random projections and random subspace methods are very simple and computationally efficient techniques to reduce dimensionality for learning from high dimensional data. Since high dimensional data tends to be prevalent in many domains, such techniques are the subject of much recent interest. Random projections (RP) are motivated by their proven ability to preserve inter-point distances. By contrary, the random selection of features (RF) appears to be a heuristic, which nevertheless exhibits good performance in previous studies. In this paper we conduct a thorough empirical comparison between these two approaches in a variety of data sets with different characteristics. We also extend our study to multi-class problems. We find that RP tends to perform better than RF in terms of the classification accuracy in small sample settings, although RF is surprisingly good as well in many cases. Keywords—random projections, random subspace, dimensions, dimensionality reduction, classification I. high INTRODUCTION Increasingly there are a number of application domains which emphasize the need for analysing high dimensional data. Bioinformatics, e-commerce etc., are a few domains in which data is very high dimensional. Gene expression microarray data sets could contain tens of thousands of dimensions, each of which corresponds to a gene expression level in some experimental condition. Hundreds of thousands of products may need to be considered in consumer purchase behaviour data sets. There is a growing interest in industry and academia to analyse such datasets [2]. One of the main challenges associated with analysing high dimensional data is the curse of dimensionality. That is, the complexity of many existing data mining algorithms increases exponentially with the increase in the number of dimensions. With increasing dimensions the algorithms become computationally out of control and thus inapplicable in many applications. In this work we focus on supervised learning for classification of high dimensional data. Supervised learning for classification is a powerful data analysis tool. The objective of classification is to build a prediction model that captures inherent relations between the attributes and the class identity so that an unknown object can be accurately classified [18]. The classifier is built using a labeled training set and its accuracy may be validated using a test set. There are several algorithms in use for supervised learning for classification Naïve Bayes, Neural Networks, Decision Trees, Support Vector Machines (SVM), Fisher Linear Discriminant Analysis (LDA) and others [18]. When the data is high dimensional, dimensionality reduction techniques are required. Feature selection is a way to reduce dimensionality so that only a subset of features, which are deemed to be important, is selected for the classification. Another way of dimensionality reduction is feature construction, e.g. by a projection of the feature space into a lower dimensional space. An example of the latter is to perform Principal Component Analysis (PCA) on the high dimensional data and subsequently perform classification in the space spanned by the prominent principal components identified. These techniques are computationally expensive in general, and in some cases they may lead to inaccurate classification [1]. Thus there is a need to find alternative solutions to efficiently and accurately classify high dimensional data. A computationally inexpensive technique of dimensionality reduction is the so-called Random Projections (RP) method. This is a non-adaptive feature construction approach that received much attention in recent years from rigorous theoretical studies to experimental investigations. An experimental study by Fradkin and Madigan [7] compared the performance and efficiency of PCA and RP in conjunction with a number of classifiers, for 2-class problems: C4.5, 1Nearest Neighbor, 5-Nearest Neighbor and SVM. They found that classification performance in a random projection space is not far away from that obtained on the full data. The theoretical justification of RP has initially been the JohnsonLindenstrauss Lemma, which guarantees a global preservation of inter-point distances. Stronger guarantees have been obtained more recently for the Fisher Linear Discriminant classifier (FLD) [14], which more closely resembles the good performance observed in practice. On a more heuristic basis, a random selection of features (RF) is also in use in practice [23][24]. However not much is known about the comparative strengths and weaknesses of these two seemingly analogous methods. This paper we are interested to fill this gap. II. BACKGROUND AND RELATED WORK A. Dimensionality Reduction Dimensionality reduction techniques preserve the structure of the data as much as possible while reducing the number of dimensions. Thus in the reduced space the execution time of the algorithms is reduced since we have lower dimensions. Also since the structure is preserved the results obtained can be a reliable approximation of the original data space. 1) Principal Component Analysis Principal Component Analysis (PCA) is a widely used and popular technique for dimensionality reduction. PCA finds the directions of maximal variance in the data. PCA is done with the hope that the components with the lowest variation in the data are irrelevant or represent noise and hence are better discarded [10]. One of the drawbacks of PCA is that maximal variation does not always preserve the class structure. Even when it does, the eigenvalue decomposition of the covariance matrix is a time consuming task, which is an imperative step for performing PCA. The runtime complexity of PCA is O (np2) + O(p3), for a p-dimensional data set with n points, which may make it infeasible for data with very high dimensionality [7]. 2) Random Projections Another method for dimensionality reduction is Random Projections. Random Projections is a very simple yet powerful technique for dimensionality reduction. In this method the data is projected on to a random subspace, which preserves the approximate Euclidean distances between all pairs of points after the projection [5]. The JohnsonLindenstrauss lemma (JLL) guarantees that for a set of N points in p dimensions there is a linear transformation to a q dimensional random subspace that preserves the Euclidean distances between any two data points up to a factor of 1 if the number of projected dimensions where is a small constant such that 0 1 [10]. This result implies that the original dimensionality is irrelevant as far as the distance preservation is concerned. What matters is the number of points that get projected and the accuracy with which we want to preserve the distances. An important thing to note is that the bounds provided by Johnson-Lindenstrauss are rather loose, and in practice the number of dimensions to project to in order to preserve the relevant distances may be much lower [14,7]. The original result of Johnson & Lindenstrauss was an existence result that did not say how to get the linear transform. Later work by Dasgupta [6] has shown that certain random matrices fulfil the JLL guarantee with high probability, and there are several ways to generate a random projection matrix [10]. The method used in this paper is to generate a random matrix with i.i.d. Gaussian entries. There are certain properties of this matrix, which may help us intuitively understand this: Any two rows in the random projection matrix are approximately orthogonal to each other, and have approximately the same length. In essence the random projection is an approximate isometry. Thus, there is no need to normalize the vector to unit length or to orthogonalise the random projection matrix in practice [10]. The following are the steps to reduce the dimensionality of the data by random projections: Suppose that we have a data set X where , , each data point is a p dimensional vector such that and we need to reduce the data to a q dimensional space such that 1 . matrix where p 1) Arrange the data into a is the dimensionality of the data and n is the number of data points. random projection matrix 2) Generate a R* using the MATLAB randn (q, p) function. 3) Multiply the random projection matrix with the original data in order to project the data down into a random projection space. Thus we can see that transforming the data to a random projection space is a simple matrix multiplication with the guarantees of distance preservation. Hence, the random projection technique is much more efficient than PCA with a run time complexity of only O (pqn). 3) Random Selection of Features Another technique explored in this work is random feature selection (RF). This means, the features are selected uniformly at random out of the original features, and the classification algorithm is run in the resulting smaller feature space. This method has also been used in the literature as a means of building an ensemble of classifiers [23][24], each of which receives a random subset of the original features. However in this work we aim to assess the method of RFs in a single classifier in order to compare it to the RP approach. B. Supervised learning Supervised learning is the task of learning to predict the class labels of unclassified objects based on their attributes (features). Usually a supervised learning algorithm is trained with finite number of labeled examples that are available, and it is expected to generalize on the underlying distribution of the data such that it can perform prediction on previously unseen data. Formally, given a set of training examples of the form {(x1, y1),….., (xn, yn)}, a supervised algorithm seeks a hypothesis h: XÆY, by training the classifier where X is the input space and Y the output space. For an unlabeled example xk the prediction of the class label is performed by presenting the input vector xk to the hypothesis to get h (xk)Æ yk. a) The Fisher Linear Discriminant Analysis classifier Fisher’s Linear discriminant analysis (FLD, LDA) first introduced by Fisher is a machine-learning algorithm which aims at finding the linear combination of features which best separates two or more classes of objects [16]. LDA is simple, easy to implement and is efficient. We have chosen to employ it in this work, and this decision was driven by its simplicity and the detailed theoretical understanding of its working in the RP space [14]. FLD was found to give comparable results to that of Support Vector Machines (SVM) in some cases [15]. In the next section we will review the basics of Fisher linear discriminant analysis for binary and multi-class problems. Classification with two-class LDA is a decision on a threshold constant c. For any vector xk it is given by Two - Class LDA. Multi-Class LDA. Suppose that we have a set of n, p-dimensional vectors x1, x2, …..,xn where xi is given by xi = (xi1,…..,xip) and each point belongs to one of two different classes namely c1 or c2. The scatter matrices for the two classes are given by Multiple discriminant analysis is a natural extension of Fisher linear discriminant analysis for cases where the number of classes is more than two. The data is projected from a high dimensional space to a lower dimensional space where the transformation maximizes the ratio of inter class scatter to the intra class scatter. In this case the maximization will lead us to decide among several competing classes. When there are c classes, the intra class scatter matrix is calculated similar to the two-class case [15]: where class ci contains ni points, and its center or mean is: Under the assumption that both classes have the same covariance shape, the choice of the hyper- plane would be between the projections of two class means and [9]. So the parameter c is given by: /2 1 Thus the total intra class scatter is given by ∑ ∑ (1) The inter class scatter matrix is given by The inter class scatter matrix is given by (2) The objective is to obtain a scalar y by projecting the vectors onto the direction w [16],[4]. Fisher’s criterion suggested the linear transformation w to maximize the ratio of the inter class scatter matrix of the projected samples to the intra class scatter matrix of the projected samples [15][16]. (3) where ni is the number of training samples for each class, is the mean for each class and is the total mean vector, i.e. the average of the entire data set. The total mean vector is given by 1 and inter class scatter have been Once the intra class obtained the task is again to maximize the Rayleigh quotient given below [15] Maximising equation (3) can be solved as a generalized eigenvalue problem and w is given by the eigenvectors of matrix 1 [15]. We take derivative and equate to zero. This leads to [4] The transformation w can be obtained by solving the generalized eigenvalue problem [15] (4) The solution [4][16] to the generalized eigenvalue problem of equation (4) is the following: Classification with multi-class LDA is then performed in the transformed space based on cosine measure given by [15]: , ∑ 1 ∑ ∑ A new instance xk is classified as: arg min where , is the mean of the c-th class. Handling the Singularity of the Inter Class Scatter Matrix. In situations where the number of available points is much smaller than the number of dimensions, the inter class scatter matrix becomes singular [15]. There are several applications where the number of data points is much less than the number of dimensions. Examples include e-commerce data, gene expression data, and text categorization data etc. In these applications the inter class scatter matrix is usually singular. There are various methods to deal with singularity. Regularization was the method we used in this work to deal [15]. with the singularity of the inter- class scatter matrix Regularization is achieved by biasing the diagonal elements of the inter class scatter matrix by adding a small parameter : where is a relatively small constant which makes the inter class scatter matrix positive definite [15]. In this work was chosen in two ways was a small 1) For binary classification constant chosen empirically -- e.g. 0.01, which was based on observations from the experiments, when varying the regularization parameters. 2) For multi-class classification was chosen as the average of the diagonal elements of multiplied by a small constant e.g. 0.01, as suggested in [15]. Covariance Structures for LDA. As already mentioned, LDA makes the assumption that the classes have the same shape, and the shape of a class may be characterised by a covariance matrix. There are three types of covariance in use: full, diagonal and spherical. The full covariance is of course the least restrictive; however in order to estimate it accurately we would need a larger number of points than what is available, hence a more restrictive form is sometimes preferable. C. Related Work We already mentioned the experimental comparison of classifiers working in RP spaces by Fradkin and Madigan [7] where they compare PCA and RP using C4.5, 1NN, 5NN and SVM classifiers. We will use the data sets from this work in our study. The outcome of their experiments was that PCA tends to perform better than RP when the target dimension is chosen to be small. But it was also observed that PCA is computationally expensive as compared to RP. Deegalla and Bostrom [12] compared PCA and RP with respect to the performance of nearest neighbor classifier on five image data sets and five micro array data sets to investigate the effectiveness of either technique with the classifier. Their experiment results indicate that PCA performed better than RP for all the data sets used in the study. Their experiments also show that PCA is sensitive to the number of reduced dimensions chosen. They notice that when the number of dimensions is increased in PCA the accuracy increases but after reaching a certain point the performance begins to deteriorate. In contrast they observed that in RP space the accuracy increased with an increase in dimensions. Bingham and Mannila [5] compared RP with several other dimensionality reduction techniques such as PCA, Singular Value Decomposition (SVD), Latent Semantic Indexing (LSI) and Discrete Cosine Transform (DCT). The data sets used were image and text data. They also experimented to determine the effects of noisy and noiseless data. They found that RP is not sensitive to noisy data. The criterion was to measure the amount of distortion caused by each method and they found that the distortion caused by RP and PCA were nearly the same on the data sets tested. Fern and Brodley [3] investigated how RP can be used in conjunction with clustering of high dimensional data. In their approach a single run of clustering is performed after projecting the data into a random subspace and clustering the data in the random subspace using EM. Multiple such runs are performed and the results are aggregated. Clustering is now performed for the aggregated results to produce the final clusters. This approach is a cluster ensemble framework. The results show that the cluster ensemble approach outperformed both individual runs of random projection/clustering and EM clustering with PCA reduction techniques for all the three data sets used in the study. Several other works have been conducted using the RP technique. Kaski [11] employed RP in the context of WEBSOM. Dasgupta [6] has shown that data for a mixture of Gaussians can be projected into just O(log #clusters) dimensions while still maintaining the approximate level of separation between the clusters. Dasgupta also shows that when the data clusters are eccentric in nature, RP works consistently better that PCA and PCA is less reliable when the data clusters are eccentric in nature. Likewise, [14] have shown that O(log #classes) is sufficient for FLD to achieve good classification. III. EXPERIMENTAL SETUP A. Data Sets We use eight data sets. Five of these represent binary classification problems and three others are multi-class problems. The data sets we use for binary classification are taken from [7]. For multi-class problems we employ the data sets in Dettling [17] who used different classifiers on the same data set in data space. A brief description of these data is given below. We have chosen these data so that they cover a range of different settings: The first two data sets have fewer points than dimensions. The next two have more points than dimensions, and one other has a large number of points and large number of dimensions. The first of these categories turns out to be the hardest, and the one where RP and RF can help substantially. The multi-class data sets were therefore chosen from this category. Leukemia. The leukemia data set contains expression levels of 7129 genes taken over 72 samples. Labels indicate which of two variants of leukemia is present in the sample (AML, 25 samples, or ALL, 47 samples). (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS) [21]. Colon. The Colon data set contains expression levels of 2000 genes taken in 62 different samples. For each sample it is indicated whether it came from a tumor biopsy or not. The data set contains 40 cancerous and 22 non-cancerous colon tissue samples [19]. Spambase. Spam email is a well-known concept where emails are received with advertisements for products/web sites, make money fast schemes, chain letters, pornography etc. The data set is a collection of such spam and non-spam emails. It contains 4601 instances with 57 attributes and the last attribute is the class attribute. Of the 4601 samples 1813 are spam emails and 2788 are non-spam emails [19]. Ionosphere. The Ionosphere data set is a binary classification problem where two types of electrons are targeted in the ionosphere by the radar signals. Those that show some structure are classified as good and those that do not are classified as bad. The data set contains 351 instances with 34 attributes classified as good or bad. Of the 351 instances 126 instances are bad instances and 225 instances are good instances [19]. Internet Ads. This dataset represents a set of possible advertisements on Internet pages. The attributes are the geometry of the image as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text. There are two class labels: advertisement ("ad") and not advertisement ("nonad"). The data set contains 3279 instances and 1558 attributes. 2821 instances of all are non-ads and 458 are ads [19]. Lymphoma. The lymphoma dataset consists of 62 samples of 4206 gene expressions. Of the 62 samples, 42 are of diffuse large B-cell lymphoma (DLBCL), 9 samples are of follicular lymphoma (FL), and 11 samples are of chronic lymphocytic leukemia (CLL) [22]. Brain Tumor. The Brain Tumor data set contains 42 tissue samples divided in 5 classes. Primitive neuroectodermal tumors (PNETs): 8 samples, atypical teratoid/rhabdoid tumors (Rhad): 10 samples, malignant gliomas (Mglio): 10 samples, medulloblastomas (MD): 10 samples and normal tissues (Ncer): 4 samples. There are a total of 5597 genes in the data set [20]. Srbct. The Small Round Blue Cell Tumors dataset contains information of 63 samples and 2308 genes. The samples are distributed in four classes as follows: 8 Burkitt Lymphoma C. Experimental protocol We split the data randomly into a training set and a test set, in the same way as it was done in [7]. For each experiment, we repeat this 100 times in order to get reliable results. The algorithms we used are detailed below. Algorithm 1 and 2 gives the full procedure for both RP-based and RF-based classification respectively, for the sake of completeness and reproducibility of our results. B. Data pre-processing For the binary classification data sets there was no preprocessing done on the data (as the ranges of features appeared to be fairly similar) with an exception for the Internet Ads data set, which had 4 attributes with missing values, and these attributes were removed from the data set. For the multi-class data, standardisation was done by subtracting the mean and dividing by the standard deviation of the features. Algorithm 1 (Evaluating RP-based classification) Requires: Dataset D, original dimensions d, projection dimensions k 1. For i = 1:100(splits) 2. Split D in to training set and test set at random 3. Train and test the data in the data space and obtain accuracy 4. Generate a (k, d) random matrix R using MATLAB randn (k, d) function 5. Project the training and test data into the random projection space R*D’ 6. Train and test the data in the reduced space and obtain accuracy 7. end for Algorithm 2 (Evaluating RF-based classification) Requires: Dataset D, original dimensions d, random dimensions k 1. For i = 1:100(splits) 2. Split D in to training set and test set at random 3. Train and test the data in the data space and obtain accuracy 4. Select a random set of features from the feature space 5. Train and test the data in the reduced space and obtain accuracy 6. end for The next section will summarise and discuss the results obtained, followed by the conclusions of our study. IV. RESULTS We are interested to answer the following questions: • How is the classification performance affected by the choice of the target dimension (k)? • To what extent does the regularization parameter affect the performance, as the target dimension (k) varies? • How does the form of covariance (i.e. full, diagonal, spherical) affect the performance in the reduced space? • How do the two different dimensionality reduction techniques, i.e. RP and RF, compare in terms of the achieved classification accuracy? To answer these questions we carried out a set of experiments, and the results are summarised in the Figures 1-3. Figure 1 shows results on the high dimensional small sample size data sets ‘leukaemia’ and ‘colon’. The leftmost column shows the sensitivity of full-covariance-FLD to the regularization parameter for RP-space classifiers in various target dimensions (k). We see that working in lower dimensional projections makes FLD less sensitive to the choice of this parameter. This is not surprising, if we think about the effect of RP geometrically: The smaller the target dimension, the more spherical the covariance becomes in the projected space. Therefore the projected covariance has a good chance to be non-singular, and hence insensitive to the regularisation parameter. In turn, the full-covariance classifier in the high dimensional data space is extremely sensitive, and a bad choice of this parameter value can lead to poor performance close to a random guessing. The rightmost three columns show the classification performance (average and one standard error over 100 independent splits of the data into training and testing sets) comparatively for data-space-FLD, RP-FLD and RF-FLD in the case of three covariance types (full, diagonal, spherical). We see that RP tends to outperform RF in most cases and RPFLD can achieve comparable performance to a full data-space FLD, despite it works in a much reduced dimensional space. The type of covariance has relatively little effect on these results. This is most likely because the number of data points is so small that an accurate estimation of covariances between the original features is hardly possible and it is either the explicit regularisation that keeps the accuracy from falling down, or the implicit regularisation produced by the random projections. The random feature approach also seems to have a regularisation effect albeit there are less guarantees on this. We should comment about the drop and then raise of the performance at around k=50 dimensions with the full covariance approach (second column on Fig.1). This is not an artefact but a systematic effect. This happens exactly when k equals the rank of the maximum likelihood covariance estimate. Although a full explanation of why it happens this is beyond the scope of this paper, the theoretical studies of [8] and [13] imply that there is a systematic underestimation of the smallest eigenvalue of this sample-covariance, which becomes the worst when the dimensionality approaches the sample size. Once this point is passed and we increase k further, the underestimation of the smallest eigenvalue remains unchanged, but the distance between class centers gets larger, and this explains why the performance raises back at larger values of k. Figure 2 presents results for data sets that contain more points than dimensions. Since we now have a sufficient number of points, we can use a full covariance model more safely, and we see very little sensitivity to the choice of the regularisation parameter. More interestingly, on these data sets the random selection of features turns out to be no worse than random projection. Finally, Figure 3 gives classification results for three multi-class data sets. Again, with sufficient dimensions that are still significantly less than the original dimensionality, the classification performance approaches that of the full dataspace classifier. Interestingly, and perhaps contrary to expectation, the RF approach preforms just as well as RP. The reasons for this remain for further investigations, although we suspect this could be due to our use of relatively large values for the regularisation parameter. V. CONCLUSIONS We empirically investigated random projections and random feature selection methods for their ability to serve as a means of dimensionality reduction for classification. These techniques are computationally efficient and hold the promise to enable learning from high dimensional data sets. We found that RP tends to perform better than RF in terms of the classification accuracy in settings where the dimensionality far exceeds the number of data points. Quite interestingly, RF has been also quite competitive and perhaps surprisingly good in some cases. ACKNOWLEDGMENT This work was performed as part of the first author’s MSc summer project at the School of Computer Science at Birmingham. REFERENCES [1] [2] [3] [4] T. Birch, Supervised classification of high-dimensional gene expression data via random projection and boosting.MSc Project Report, University of Birmingham, 2009, unpublished. W. Wang, J. Yang, Mining high dimensional data,.Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kulwer Academic Publishers, 2005. X.Z.Fern, C.Brodley, Random projection for high dimensional data clustering: A cluster ensemble approach. In proceedings of the Twentieth International Conference of Machine Learning, 2003. K. Chen, Lecture notes, unpublished. Retrieved June 25, 2012, from http://www.cs.manchester.ac.uk/pgt/COMP61021/lectures/LDA.pdf [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] E. Bingham, H. Mannila. Random projection in dimensionality reduction: applications to image and text data. KDD'01 Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 245-250, 2001. S. Dasgupta, Experiments with random projection. Proceedings of the 16th Conference on Uncertainty in Artifical Intelligence, 143-151, 2000. D. Fradkin, and D. Madigan, Experiments with random projections for machine learning, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , 517-522, 2003. D. Hoyle, Accuracy of pseudo-inverse covaraince learning: A random matrix theory analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions, pp. 1470-1481, 2011. T. Hofmann. Introduction to machine learning: lecture notes, 2003, unpublished. Retrieved June 21, 2012, from http://users.cecs.anu.edu.au/~daa/courses/GSAC6017/CS195-6a.pdf A.K. Menon, Random projections and applications to dimensionality reduction. Honours Thesis, University of Sydney, 2007. S. Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proceedings of IJCNN'98 International Joint Conference on Neural Networks, 1, pp. 413-418, 1998. S. Deegalla, H. Bostrom, Reducing high-dimensional data by principal component analysis versus random projection for nearest neighbor classification. In ICMLA' 06: Proceedings of the 5th International Conference on Machine Learning and Applications, pp. 245-250, 2006. S.R. Raudys, Expected classification error of the Fisher linear classifier with psuedo-inverse covaraince matrix. Pattern Recognition Letters, pp. 385-392, 1998. R.J Durrant, A. Kaban. Compressed fisher linear discrminant analysis: classification of randomly projected data. Proceedings of KDD'10 , pp. 1119-1128, 2010. T. Li, S. Zhu, M. Ogihara, Using discriminant analysis for multi-class classification. Proceedings of the Third IEEE International Conference on Data Mining (IEEE ICDM 2003), pp. 589-592. [16] T. Hastie, R. Tibshirani. The elements of statistical learning: Data Mining, Inference and Prediction. Springer, 2011. [17] M. Dettling. BagBoosting for tumor classification with gene expression data. BioInformatics, pp. 3583-3593, 2005. [18] T. Mitchell. Machine Learning WCB McGraw Hill, 1997. [19] K. Bache, and M. Lichman, (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [20] J. Wang, T.H. Bo, I. Jonassen, O. Myklebost, E. Hovig.. Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data. BMC Bioinformatics 4:60, 2003. [21] Khan et al, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine 7(6). [22] D. Chung, and S. Keles, Sparse partial least squares classification for high dimensional data, Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17, 2010. http://artax.karlin.mff.cuni.cz/r-help/library/spls/html/lymphoma.html [23] C.P. Kamath, L.O. Hall, T. Yeatman, S. Eschrich. Multivariate Feature Selection using Random Subspace Classifiers for Gene Expression Data. Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2007, October 14-17, 2007. [24] G. Armano, C. Chira, B. Hatami. A New gene selection method based on random subspace ensemble for microarray cancer classification. Pattern Recognition in Bioinformatics, LNCS Vol. 7036, pp. 191-201, 2011. Fig. 1. Results on high dimensional small sample size data. With lower target dimension (k) the classifier becomes less sensitive to the regularization parameter. RP-FLD tends to outperform RF-FLD. Fig. 2. Results on data with more points than dimensions. In the first column of plots, the legend is the same as in Figure 1. Interestingly, RPFLD and RF-FLD perform similarly. Fig. 3. Results on multi-class data sets.