Download An Evaluation of Gene Selection Methods for Multi

Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification Hong Chai and Carlotta Domeniconi Information and Software Engineering Department George Mason University Fairfax, VA 22030 [email protected] [email protected] Abstract The fundamental power of microarrays lies in the ability to conduct parallel surveys of gene expression patterns for tens of thousands of genes across a wide range of cellular responses, phenotypes and conditions. Thus microarray data contain an overwhelming number of genes relative to the number of samples, presenting challenges for meaningful pattern discovery. This paper provides a comparative study of gene selection methods for multi-class classification of microarray data. We compare several feature ranking techniques, including new variants of correlation coefficients, and Support Vector Machine (SVM) method based on Recursive Feature Elimination (RFE). The results show that feature selection methods improve SVM classification accuracy in different kernel settings. The performance of feature selection techniques is problem-dependent. SVM-RFE shows an excellent performance in general, but often gives lower accuracy than correlation coefficients in low dimensions. 1 Introduction Gene expression profiling studies via DNA microarrays offer unprecedented opportunities for advancing fundamental biological research and clinical practice. Microarray technology allows researchers to simultaneously measure the expression level of tens of thousands of genes, creating a comprehensive overview of exactly which genes are being expressed in a specific tissue under various conditions. These studies produce a massive amount of data which poses challenging problems for the discovery of informative patterns. A microarray can contain up to 20,000 features, each of which recognizes mRNA from a single gene, and a relatively small number of experiments or samples (in the order of hundreds or less). As a consequence, the identification of discriminant genes to classifying tissue types, e.g., presence of cancer, is of fundamental and practical interest. Such genetic markers, in fact, can be found of value in further investigation of the disease and in future therapies. From a machine learning standpoint, the reduction of dimensionality of the data avoids the possibility of overfitting. The classification of a few dozens of points lying in a space with thousands dimensions is a hopeless task due to the curse-of-dimensionality. One can easily find a (linear) decision boundary that perfectly separates the training data. However, such a classifier will perform poorly on previously unseen data. In other words, it won’t generalize well. Regularization techniques can prevent the overfitting of the data, without performing dimensionality reduction [20]. Support Vector Machines (SVMs) are such an example [6]. Nevertheless, our experimental results show that also SVMs can benefit from the reduction of dimensionality due to feature selection. Another solution to high dimensional settings consists in projecting the data onto the principal components, obtained as linear combinations of the input features [9]. A disadvantage is that none of the original features can be discarded. Moreover, the new dimensions can be difficult to intepret, making it hard to understand the groups of data in relation to the original space. SVMs themselves have been used successfully for gene selection (SVM-RFE) [12]. The weights that multiply the inputs in the solution boundary are used as feature ranking coefficients. In general, feature ranking techniques via correlation coefficients can be particularly useful for gene selection. One can consider only top ranked genes (above a certain score threshold, or a fixed number) for further analysis, or to train a classifier. While binary (two-class) classification has been extensively studied over the past few years [1, 3, 4, 8, 12], the multi-class classification case has received little attention [13, 16, 5]. In this paper, we 7 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics focus on multi-class classification, and compare several gene ranking methods, including new variants of correlation coefficients, using different microarray datasets. In analogy with the Structural Risk Minimization principle [20], for each ranking method, we construct nested subsets of features F 1 ⊂ F2 ⊂ . . . ⊂ F . Given a classification model (SVMs with different kernel functions in our experiments), one can select the subset of features that gives the best cross-validated accuracy. Our main findings are: 1. SVMs classification benefits from gene selection; 2. Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; 3. Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent; 4. For a given classification model, different gene selection methods reach the best performance for different feature set sizes; 5. Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; 6. The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene ranking methods is highest for this dataset (ca. 11.5%). 2 Problem Description and Related Work In a classification problem, we are given C classes and m training observations. The training obervations consist of n feature measurements x = (x1 , . . . , xn )t ∈ <n , and the known class labels y = 1, . . . , C. The goal is to predict the class label of a given query q. For the problem we consider here, the features are gene expression coefficients, and the observations correspond to samples or patients. Thus, n >> m. Traditional gene selection methods keep the genes that individually best discriminate between training data of different classes. Such methods make use of correlation coefficients and expression ratios, and usually are defined for two-class classification problems. Let us consider class labels y ∈ {−1, +1}. The correlation coefficient used in [11] as ranking criterion is defined as: wj = µ+ −µ− j j , j = 1, . . . , n, σj+ +σj− σj− are + − the respective where µ+ j (µj ) is the mean value of gene j for the + (−) class. Similarly, σ j and standard deviations. Large positive wj values indicate a strong correlation with the positive class. Large negative wj values indicate a strong correlation with the negative class. The objective in [11] is to select an equal number of genes j with large positive and large negative correlation coefficient. In [10], the authors consider |wj | as ranking criterion. In [15], the Fisher criterion score [7] is used: wj = (µ+ −µ− )2 j j (σj+ )2 +(σj− )2 , j = 1, . . . , n, which gives higher scores to genes whose means differ greatly between the two classes, relative to their variances. Correlation scores assume that features are independent. Each feature is analyzed in isolation, without taking into consideration the mutual information across features. This fact implies that redundant genes may be selected, and genes individually not important, but complementary to each other, may not be selected. Another technique for feature ranking uses the concept of Shannon entropy. Given entropy E as a measure of impurity in a set S of training examples, it is possible to define a measure of the effectiveness of a feature (or attribute) A in classifying the training data. The measure, called Information Gain, is simply the expected reduction in entropy caused by partitioning the data according to this feature [14]. More precisely, the information gain I(S, A) of a feature A, relative to a set of data S, is defined P v| as I(S, A) = E(S) − v∈V (A) |S |S| E(Sv ), where V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v. The first term is the entropy of the entire set PC |Ci | i| S: E(S) = i=1 − |C |S| log2 |S| , where |Ci | is the number of training data in class Ci , and |S| is the cardinality of the entire set S. The definition of information gain can be extended to handle continuous 8 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics valued features. This is achieved by searching for candidate thresholds sorting the points according to the continuous feature, and identifying adjacent points that differ in their classification label [14]. The Chi-squared is another method that evaluates features individually with respect to the classes. The range of continuous valued features needs to be discretized into intervals. A matrix A is then formed, where Aij is the number of samples of the Ci class within the jth interval. Let CIj be the number of samples in the jth interval. The expected frequency of Aij is Eij = CIj |Ci |/m. The Chi-squared statistic PC PI (A −E )2 of a feature is then defined as χ2 = i=1 j=1 ijEij ij , where I is the number of intervals. The larger the χ2 value, the more informative the corresponding feature is. In another effort, the authors in [5] evaluated the discriminatory power of a gene with test statistics such as the ANOVA F , the Brown-Forsythe, the Cochran, and the Welch test statistics. These are extensions of the t-statistic used in the two-class classification problem. Linear SVMs have been used successfully for gene selection [12]. The squared weights, w j2 , that multiply the inputs in the solution boundary f (x) = w · x + b are used as feature ranking coefficients. The underlying idea is that features with largest weights influence most the classification decision. Thus, if the classifier performs well, those features with the largest weights are the most informative. As a result, an iterative procedure (Recursive Feature Elimination, or RFE) trains the SVM classifier, computes the ranking wj2 for all features, and removes the feature with smallest ranking criterion. The procedure is iterated until a certain number of selected features is obtained. To speed up the process, several features may be removed at each iteration. In contrast to feature ranking using correlation coefficients, the RFE method is a multivariate approach which evaluates the relevance of multiple features simultaneously. 3 Feature Correlation Scores for Multiclass Problems In this section we introduce several correlation scores for feature ranking to handle multi-class classification scores. For each class i and each feature j, we define: 1 X xj . |Ci | µj,i = (1) x∈Ci µj,i represents the mean value of feature j for class Ci . We also define the total mean along feature j: µj = 1 X xj . m x (2) Using equations (1) and (2), we provide a measure of the between-class scatter along feature j: Bj = C X |Ci |(µj,i − µj )2 . (3) i=1 This leads to the following score function Bj BScatterj = PC i=1 σji j = 1, . . . n, (4) where σji is the standard deviation of class i along feature j. This score is related to Fisher discriminant analysis for multiple classes [7] under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter. Let us consider: µj,max = maxl µj,l , µj,min = minl µj,l . The second score function we define is M inM axj = µj,max − µj,min PC i=1 σji j = 1, . . . n. (5) This score function favors features along which the farthest mean-class difference is large, and the within class variance is small. 9 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics For each feature j, we sort the C values µj,i in non-decreasing order: µj,1 ≤ µj,2 ≤ . . . ≤ µj,C . Let us define now bj,l = |µj,l − µj,l+1 |, 1 ≤ l ≤ C − 1. bj,l measures the distance between adjacent mean class values along feature j. The third score function rewards the features with large distances between adjacent mean class values: PC−1 bj,l bSumj = Pl=1 j = 1, . . . n. (6) C i=1 σji In addition, we consider two variants of the previous function. The following one rewards features j with a large between-neighbor-class mean difference: maxl bj,l bM axj = PC i=1 σji j = 1, . . . n. (7) Alternatively, we can favor the features with large smallest between-neighbor-class mean difference: minl bj,l bM inj = PC i=1 σji j = 1, . . . n. (8) Finally, we consider a score function which combines MinMax and bMin: Combj = 4 minl (bjl )(µj,max − µj,min ) PC i=1 σji j = 1, . . . n. (9) Experimental Analysis The Datasets. We used the following four datasets. The MLL dataset consists of gene expression profiles of three classes of leukemia and is available at http://research.nhgri.nih.gov/microarray/Supplement. It was first studied by Scott et al. [18] in proposing that a distinct disease type, MLL, can be clearly separated from conventional acute lymphoblastic leukemia (ALL) and acute myelogenous leukemias (AML). The dataset includes 12582 probe sets from the Affymetrix chip, and contains 72 samples. The numbers of samples in the three classes are balanced, 24 in ALL, 20 in MLL, and 28 in MLL. There are no missing values in this dataset. The Lymphoma dataset covers 9 classes, 96 malignant and normal lymphocyte samples. There are 4026 genes. It was published in [1] and is available at http://llmpp.nih.gov/lymphoma. Classes that contain less than 5 samples were removed in our experiments, and hence 6 classes remained. The numbers of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in RAT, and 6 in TCL (normal samples). The expression data from budding yeast Saccharomyces cerevisiae consists of 80-sample gene expression vectors for 6221 genes. The samples contain 3 subtypes. The dataset is available at http://rana.lbl.gov/ EisenData.htm. The NCI60 dataset was first studied in [17]. cDNA microarrays were used to examine the variation in gene expression among the 60 cell lines from the National Center Institutes anticancer drug screen. The dataset contains 9 classes and can be downloaded from http://genome-www.stanford.edu/nci60. There are missing values in the Lymphoma, Yeast and NCI60 datasets. We deleted genes in the Yeast dataset that have above 20 missing expression values, and genes in the NCI60 datasets that have more than 50 missing values. Then, we used the K-nearest neighbor method to impute the remaining missing values in each dataset. For a gene with missing values, the K nearest neighbors are identified from the subset of genes that have complete expression values (K = 7 in our experiments). The average of the neighbors’ values is used to substitute a missing value [19]. Table 1 summarizes the characteristics of the four datasets used in our experiments. 4.1 Experimental Design We scaled each dataset so that every gene’s expression values are ranged between [-1,1]. Feature selection methods are performed on each dataset to obtain subsets of top-ranked genes. These include our six variants of correlation scores (BScatter, MinMax, bSum, bMax, bMin, Comb), as well 10 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics Table 1: The characteristics of the four datasets. Dataset MLL Lymphoma Yeast NCI60 #samples 72 88 80 61 #genes 12582 4026 5775 1155 #classes 3 6 3 8 as Chi-squared, Information Gain, and SVM-RFE methods. The six correlation scores are implemented with Python, while the latter three methods are performed using the Weka software available at http://www.cs.waikato.ac.nz/ml/weka. In SVM-RFE, we set the percentage to drop at each iteration to be 10, until 20% of the total number of features is eliminated. Successively, one single feature is dropped at each iteration. Weka handles multi-class problems by ranking attributes for each class separately using a one-vs-all method, and then dealing from the top of each pile to give a final ranking. We compared the above nine feature selection methods by considering different feature set sizes. Each of the resulting feature set was used to train an SVM classifier with different kernel functions: linear, 2-degree polynomial, and Gaussian. Due to the small number of samples available, leave-one-out cross validation was performed to assess classification performance, using a fixed set of features previously selected with the whole training set. We cross-validated the soft-margin parameter and the width of the Gaussian kernel testing values between [1,100] and [0.001, 2], respectively. The SVM classifications were performed using LIBSVM, which implements the one-vs-one method when classifying multiple categories. The LIBSVM software can be downloaded from http://www.csie.ntu.edu.tw/∼cjlin/libsvm. 4.2 Results The performance of the nine ranking methods are shown in Tables 2-5. In the tables, the first row represents the number of selected features. Each cell contains the leave-one-out classification accuracy of SVM, achieved using the corresponding number of top genes ranked by one of the feature selection method. It is seen from the experiments that: 1. SVM classification benefits from feature selection methods. In almost all situations, highest prediction rates are achieved for each selection method when using a number of top ranked genes much smaller than the full dimensionality. The accuracy of classification does depend on the use of feature selection methods. 2. The best performing correlation score varies from problem to problem. Among all feature ranking methods, our proposed function MinMax gives highest accuracy using 2 through 20 or 100 topselected genes when classifying the Lymphoma dataset. Overall, gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. In higher dimensions, the correlation coefficients may select redundant genes. This is likely the reason why SVM-RFE shows a more robust behavior, in general, in higher dimensions. 3. SVM-RFE yields perfect prediction on the MLL, Lymphoma, and the Yeast datasets. Correlation coefficient scores and information gain, however, also achieves 100% accuracy on these datasets. Chi-squared is the only method that never achieves perfect prediction. Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent. 4. For a given classification model, different gene selection methods reach the best performance at different feature set sizes (as highlighted by the values in boldface in the tables). Very high accuracy was achieved on all the datasets studied here. Perfect accuracy (based on leave-one-out error) was achieved in many cases. 5. The NCI60 dataset shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene ranking methods is highest for this dataset (ca. 11.5%). 11 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics 6. For the NCI60 dataset, each of our proposed method gives above 80% prediction accuracy, except in two cases (73.77% and 75.41%). This compared higher to the result in [13], where the best reported performance was 67% from SVM classification using the gene selection program Rankgene. Table 2: 5 Classification accuracy (%) for MLL data: SVM with linear, polynomial, and Gaussian kernels. SVM Linear BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 86.11 84.42 86.11 77.78 83.33 83.33 91.67 91.67 86.11 5 91.67 91.67 91.67 87.5 90.28 91.67 93.05 97.22 94.44 10 95.83 93.05 93.05 95.83 88.89 93.05 90.28 91.67 100 20 93.05 95.83 94.44 98.61 94.44 93.05 97.22 95.83 100 50 95.83 95.83 95.83 95.83 91.66 93.05 97.22 95.83 100 100 100 97.22 97.22 100 95.83 93.05 97.22 97.22 100 200 100 98.61 98.61 100 97.22 95.83 97.22 100 100 500 97.22 97.22 97.22 100 97.22 95.83 97.22 97.22 100 1000 97.22 97.22 97.22 98.61 95.83 95.83 97.22 97.22 100 5000 97.22 97.22 97.22 97.22 97.22 97.22 97.22 97.22 100 SVM Polynomial BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 84.72 87.5 84.72 79.17 83.33 84.72 91.67 91.67 87.5 5 93.05 91.66 91.67 87.5 90.28 91.67 91.67 97.22 95.83 10 94.44 93.05 93.05 95.83 91.67 93.05 91.67 93.05 98.61 20 91.67 93.06 91.67 98.61 94.44 93.05 95.83 94.44 100 50 95.83 97.22 97.22 98.61 95.83 93.05 95.83 93.05 100 100 98.61 95.83 95.83 98.61 95.83 94.44 97.22 98.61 100 200 98.61 98.61 98.61 100 95.83 94.44 97.22 98.61 100 500 98.61 97.22 97.22 100 94.44 94.44 95.83 98.61 100 1000 95.83 97.22 95.83 98.61 94.44 95.83 95.83 95.83 100 5000 95.83 97.22 97.22 97.22 97.22 97.22 97.22 95.83 100 SVM Gaussian BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 87.5 87.5 87.5 76.38 84.72 84.72 91.67 91.67 87.5 5 93.05 90.28 90.28 88.89 90.28 93.05 91.67 97.22 95.83 10 94.44 93.05 93.05 95.83 91.67 93.05 91.67 93.05 100 20 94.44 95.83 94.44 98.61 95.83 94.44 95.83 95.83 100 50 97.22 97.22 97.22 98.61 93.05 93.05 95.83 93.05 100 100 98.61 97.22 95.83 100 95.83 93.05 97.22 97.22 100 200 100 98.61 97.22 100 95.83 97.22 97.22 100 100 500 97.22 97.22 97.22 97.22 97.22 90.28 94.44 93.05 100 1000 97.22 97.22 97.22 97.22 95.83 79.16 80.55 88.89 100 5000 90.28 93.05 93.05 90.28 90.28 70.83 70.83 86.11 100 Conclusions and Future Work This paper provides a comparative study on feature selection for multi-class classification of microarray data. The results show that feature selection methods improve SVM classification accuracy in different kernel settings. In many datasets perfect prediction was achieved using one or more subsets of top features. SVM-RFE shows an excellent performance in general, but it gives lower accuracy than correlation coefficients, information gain, and Chi-squared method in low dimensions. The selection of a fixed set of features over the whole training set induces a bias in the results. Valuable suggestions on how to assess and correct the selection bias are discussed in [2], and will be considered in our future experiments. Our proposed score functions did not take into consideration the fact that the correlation between any pair of selected features should be low. In fact, there may exist redundant genes in a given subset. In future work, our ranking method will be modified so that selected genes have correlation no larger than a certain threshold. In addition, top-ranked genes selected by the proposed score functions will be compared to marker genes identified in other benchmark studies. Acknowledgements. This research is in part supported by a 2004 R. E. Powe Junior Faculty Award. References [1] Alizadeh, A., Eisen, M., et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403, 2000. [2] Ambroise, C., McLachlan, G. J., Selection bias in gene extraction on the basis of microarray gene-expression data, National Academy of Sciences, 99(10):6562-6566, 2002. [3] Ben-Dor, A., Friedman, N., Yakhini, Z., Scoring Genes for Relevance, Technical Report of the Leibniz Center, 2000-38, 2000. 12 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics Table 3: Classification accuracy (%) for Lymphoma data: SVM with linear, polynomial, and Gaussian kernels. SVM Linear BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 60.23 84.09 82.95 59.09 71.59 65.91 73.86 76.13 73.50 5 77.27 90.91 90.91 71.59 76.12 81.82 90.91 88.63 82.95 10 92.04 97.72 96.59 84.09 94.32 95.45 94.32 92.04 93.18 20 95.45 98.86 98.86 94.32 97.72 97.72 97.72 97.72 96.59 50 96.59 98.86 98.86 96.59 97.72 98.86 96.59 96.59 100 100 98.86 100 100 97.72 98.86 100 98.86 96.59 100 200 97.72 98.86 98.86 97.72 98.86 100 97.72 97.72 100 500 97.72 98.86 98.86 97.72 98.86 97.72 97.72 97.72 100 1000 97.72 97.72 97.72 97.72 98.86 97.72 97.72 97.72 100 4026 97.72 97.72 97.72 97.72 97.72 97.72 97.72 97.72 97.72 SVM Polynomial BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 60.23 84.09 84.09 62.5 71.09 67.04 77.27 75.0 77.27 5 78.41 92.04 92.04 72.72 77.27 81.82 90.91 87.5 81.82 10 92.04 97.72 95.45 82.89 94.32 94.32 94.32 90.91 90.91 20 96.59 98.86 97.72 89.77 95.45 95.45 96.59 97.72 94.32 50 96.59 98.86 98.86 95.45 96.59 98.86 97.72 95.45 100 100 98.86 100 100 97.72 98.86 100 98.86 95.45 100 200 97.72 98.86 98.86 97.72 98.86 100 97.72 96.59 100 500 97.72 98.86 98.86 98.86 98.86 97.72 97.72 97.72 100 1000 97.72 98.86 98.86 97.72 98.86 98.86 97.72 97.72 100 4026 97.72 97.72 97.72 97.72 97.72 97.72 97.72 97.72 97.72 SVM Gaussian BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 60.23 84.09 84.09 61.36 72.72 65.91 77.27 77.27 77.27 5 82.95 92.04 90.91 72.73 78.41 80.68 90.91 86.36 84.09 10 90.91 96.59 95.45 82.95 89.77 95.45 96.59 93.18 92.04 20 97.72 98.86 98.86 94.32 95.45 96.59 97.72 97.72 97.72 50 97.72 98.86 98.86 95.45 93.18 96.59 96.59 96.59 98.86 100 97.72 100 100 97.72 98.86 98.86 97.72 97.72 100 200 97.72 98.86 97.72 97.72 100 98.86 98.86 98.86 98.86 500 97.72 96.59 98.86 92.05 88.63 92.04 92.04 83.18 98.86 1000 98.86 98.86 97.72 98.86 96.59 98.86 98.86 98.86 98.86 4026 86.36 86.36 86.36 86.36 86.36 86.36 86.36 86.36 86.36 [4] Brown, M., Grundy, W., Lin, D., Cristianini, N., Sunet, C., Haussler, D., Knowledge-based Analysis of Microarray Gene Expression Data By Using Support Vector Machines, National Academy of Sciences, 97:262267, 2000. [5] Chen, D., Hua, D., Reifman, J., Cheng, X., Gene Selection for Multi-class Prediction of Microarray Data, Computational Systems Bioinformatics, 2003. [6] Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines and Other Kernel-based Methods, Cambridge University Press, 2000. [7] Duda, R. O., Hart, P., Pattern classification and scene analysis, Wiley, 1973. [8] Eisen, M., et al, Cluster analysis and display of genome-wide expression patterns, National Academy of Sciences, 95, 1998. [9] Fukunaga, K., Introduction to Statistical Pattern Recognition, Academic Press, 1990. [10] Furey, T., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16, 2000. [11] Golub, T.R., Slonim, D.K., Tamayo, P., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, 1999. [12] Guyon, I., Weston, J., Barnill, S., Gene Selection for Cancer Classification Using Support Vector Machines, Machine Learning, 46:389-422, 2002. [13] Li, T., Zhang, C., Ogihara, M., A Comparative Study of Feature Selection and Multiclassification Methods for Tissue Classification Based on Gene Expression, Bioinformatics, 2004 (In Press). [14] Mitchell, T. M., Machine Learning, McGraw-Hill, 1997. [15] Pavlidis, P., Weston, J., Cai, J., Grundy, W. N., Gene Functional Analysis from Heterogeneous Data, Research in Computational Molecular Biology (RECOMB), 2001. 13 Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics Table 4: Classification accuracy (%) for Yeast data: SVM with linear, polynomial, and Gaussian kernels. SVM Linear BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 87.5 90.0 88.75 86.25 81.25 82.5 80.0 87.5 78.75 5 86.25 91.25 91.25 86.25 85.0 88.75 88.75 91.25 95.0 10 90.0 95.0 95.0 87.5 92.5 91.25 92.5 93.75 96.25 20 96.25 96.25 95.0 86.25 95.0 93.75 95.0 95.0 98.25 50 92.5 96.25 95.0 88.75 95.0 95.0 92.5 96.25 100 100 93.75 96.25 95.0 96.25 96.25 96.25 95.0 96.25 100 200 95.0 96.25 96.25 96.25 96.25 96.25 95.0 95.0 100 500 95.0 97.5 96.25 96.25 97.5 96.25 96.25 96.25 100 1000 95.0 97.5 97.5 96.25 97.5 96.25 95.0 96.25 100 5775 96.25 96.25 96.25 96.25 96.25 96.25 96.25 96.25 96.25 SVM Polynomial BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 87.5 90.0 92.5 86.25 83.75 82.5 91.25 91.25 81.25 5 86.25 92.5 91.25 96.25 85.0 87.5 90.0 92.5 96.25 10 90.0 95.0 95.0 86.25 88.75 90.0 93.75 96.25 96.25 20 90.0 95.0 95.0 86.25 93.75 93.75 95.0 93.75 98.75 50 95.0 96.25 96.25 87.5 95.0 93.75 96.25 96.25 98.25 100 95.0 96.25 96.25 96.25 96.25 96.25 95.0 96.25 100 200 96.25 96.25 96.25 95.0 96.25 96.25 96.25 96.25 100 500 96.25 96.25 9.25 96.25 96.25 97.5 96.25 96.25 97.5 1000 96.25 96.25 96.25 96.25 96.25 96.25 96.25 97.5 96.25 5775 96.25 96.25 96.25 96.25 96.25 96.25 96.25 96.25 96.25 SVM Gaussian BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 87.5 91.25 88.75 86.25 82.5 82.5 90.0 91.25 80.0 5 86.25 92.5 92.5 86.25 83.75 90.0 92.5 93.75 95.0 10 90.0 96.25 95.0 86.25 87.5 91.25 93.75 95.0 95.0 20 88.75 95.0 95.0 88.75 93.75 93.75 95.0 95.0 98.75 50 95.0 96.25 96.25 95.0 96.25 95.0 95.0 96.25 100 100 96.25 96.25 96.25 96.25 93.75 95.0 96.25 96.25 95.0 200 96.25 96.25 96.25 96.25 96.25 96.25 92.5 93.75 76.25 500 96.25 96.25 96.25 96.25 95.0 95.0 86.25 86.25 87.5 1000 93.75 93.75 93.75 93.75 93.75 93.75 85.0 86.25 85.0 5775 85.0 85.0 85.0 85.0 85.0 85.0 85.0 85.0 85.0 Table 5: Classification accuracy (%) for NCI60 data: SVM with linear, polynomial, and Gaussian kernels. SVM Linear BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 29.51 24.59 24.59 24.59 32.78 29.51 42.62 37.70 34.43 5 29.51 22.95 19.67 19.67 34.42 65.57 62.29 63.93 59.01 10 52.46 44.26 45.90 52.46 62.29 65.57 68.85 78.69 68.85 20 77.04 70.49 65.57 57.38 65.57 72.13 81.97 70.49 78.69 50 78.68 81.97 81.97 73.77 68.85 77.05 81.97 81.97 91.80 100 81.97 78.68 78.68 81.97 80.33 81.97 85.24 83.61 96.72 200 81.97 81.97 81.97 81.97 78.69 83.60 83.61 83.61 96.72 500 81.97 81.96 81.97 80.33 78.69 77.05 78.68 78.69 93.44 800 80.33 78.69 78.68 78.69 75.41 75.41 80.33 80.33 83.60 1155 75.41 75.41 75.41 75.41 75.41 75.41 75.41 75.41 75.41 SVM Polynomial BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 29.50 21.31 22.95 21.31 31.14 39.34 49.18 45.90 34.43 5 34.46 22.95 26.23 19.62 42.62 75.41 67.21 65.57 52.46 10 57.37 42.62 44.26 49.18 55.74 67.21 67.21 78.69 65.57 20 77.05 68.85 67.21 68.85 60.65 75.41 78.58 75.41 83.60 50 81.96 85.25 83.16 68.85 70.49 80.33 81.96 85.25 93.44 100 78.68 80.33 80.33 81.97 80.33 78.69 85.25 85.25 95.08 200 81.96 80.33 80.33 81.97 78.69 78.69 85.25 83.61 95.08 500 81.96 83.16 83.61 81.97 80.33 80.33 83.61 80.33 91.80 800 81.96 81.97 81.96 77.05 78.69 77.05 77.05 77.05 86.88 1155 73.77 73.77 73.77 73.77 73.77 73.77 73.77 73.77 73.77 SVM Gaussian BScatter MinMax bSum bMax bMin Comb Chi-squared Info Gain SVM-RFE 2 27.89 26.23 24.59 24.59 36.06 31.14 49.18 47.54 34.42 5 29.50 22.95 22.95 21.31 37.71 47.54 62.29 70.49 57.37 10 54.09 37.70 44.26 52.46 50.82 50.82 73.77 73.77 72.13 20 77.05 72.13 72.13 59.01 59.02 68.85 73.77 77.04 80.33 50 80.33 80.32 83.16 52.46 70.49 73.77 80.33 80.33 93.44 100 77.05 83.60 83.16 67.21 75.41 81.96 83.16 83.60 95.08 200 80.33 77.05 83.16 73.77 73.77 77.05 81.96 78.68 77.05 500 81.97 78.69 78.68 70.49 70.49 68.85 70.49 62.29 77.05 800 68.85 70.49 70.49 67.21 63.93 60.65 63.93 65.57 72.13 1155 60.65 60.65 60.65 60.65 60.65 60.65 60.65 60.65 60.65 [16] Ramaswamy, S., et al., Multiclass cancer diagnosis using tumor gene expression signatures, National Academy of Sciences, 98:26, 2001. [17] Ross, D. T., et al., Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics, 24(3):227-235, 2000. [18] Scott, A., Armstrong, et al., MLL Translocations Specify A Distinct Gene Expression Profile that Distinguishes A Unique Leukemia. Nature Genetics, 30:41-47, January 2002. [19] Szymkowiak, A. et al., Imputing Missing Values in Diary Records in Sun-Exposure Study, IEEE Workshop on Neural Networks for Signal Processing, 489-498, 2001. [20] Vapnik, V. N., Statistical Learning Theory, Wiley Interscience, 1998. 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An Evaluation of Gene Selection Methods for Multi