Download An Evaluation of Gene Selection Methods for Multi

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai Outline • • • • • • • Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work Microarray • Measures gene expression levels across different conditions, times or tissue samples • Gene expression levels inform cell activity and disease status • Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc. Microarray Data • A matrix of measurements: rows are gene expression levels; columns are samples/conditions. Example – Lymphoma Dataset Microarray data analysis • Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. • Classification builds model to predict diseased samples. Diagnostic value. Classification Problem • Large number of genes (features) - may contain up to 20,000 features. • Small number of experiments (samples) – hundreds but usually less than 100 samples. • The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer feature selection Our Focus • Binary classification and feature selection methods extensively studied; Multi-class case received little attention. • Practically many microarray datasets have more than two categories of samples • We focus on multi-class gene ranking and selection. Related Work Some criteria used in feature ranking • • • • Correlation coefficient Information gain Chi-squared SVM-RFE Notation • Given C classes • m observations (samples or patients) • n feature measurements (gene expressions) x  ( x1 ,..., xn )t  R n • class labels y= 1,...,C Correlation Coefficient • Two class problem: y = {-1,+1} • Ranking criterion defined in Golub:  wj    j  j    j  j • where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature Fischer’s score • Fisher’s criterion score in Pavlidis: wj  (   )  j  2 j  j 2 ( )  ( )  j 2 Assumption of above methods • Features analyzed in isolation. Not considering correlations. • Assumption: independent of each other • Implication: redundant genes selected into a top subset. Information Gain • A measure of the effectiveness of a feature in classifying the training data. • Expected reduction in entropy caused by partitioning the data according to this feature. | Sv | I (S , A)  E (S )   vV ( A) E (S v ) S • V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v Information Gain • E(S) is the entropy of the entire set S. E (S )   C i 1 | Ci | | Ci |  log 2 |S| |S| • wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S. Chi-squared • Measures features individually • Continuous valued features discretized into intervals • Form a matrix A, where Aij is the number of samples of the Ci class within the j-th interval. • Let CIj be the number of samples in the j-th interval Chi-squared • The expected frequency of Aij is Ei , j  C Ij | Ci | / m • The Chi-squared statistic of a feature is defined as    2 C i 1 I j 1 ( Aij  Eij ) 2 Eij • Where I is the number of intervals. The larger the statistic, the more informative the feature is. SVM-RFE • • Recursive Feature Elimination using SVM In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features. SVM-RFE • After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen. Other criteria • The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) • PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.) Our Ranking Methods • • • • • • BScatter MinMax bSum bMax bMin Combined Notation • For each class i and each feature j, we define the mean value of feature j for class Ci:  j ,i 1   xCi x j | Ci | • Define the total mean along feature j  j 1  xj m x Notation • Define between-class scatter along feature j C B j   | Ci | (  j ,i   j ) i 1 2 Function 1: BScatter • Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter BScatter j  Bj   ji C i 1 • where σji is the standard deviation of class i along feature j Function 2: MinMax • Favors features along which the farthest meanclass difference is large, and the within class variance is small.  j ,max   j ,min MinMax j  C  i 1 ji Function 3: bSum • For each feature j, we sort the C values μj,i in non-decreasing order: μ j1 <= μj2…<= μ jC • Define bj,l = μ j1+1 - μ j1 • bSum rewards the features with large distances between adjacent mean class values: bSum j   Cl11b j ,l   ji C i 1 Function 4: bMax • Rewards features j with a large between-neighborclass mean difference bMax j  max l b j ,l   ji C i 1 Function 5: bMin • Favorsthe features with large smallest betweenneighbor-class mean difference bMin j  min l b j ,l   ji C i 1 Function 6: Comb • Considers a score function which combines MinMax and bMin Combj  min l (b j ,l )(  j ,max   j ,min )   ji C i 1 Datasets Dataset MLL sample genes classes Comment 72 Lymphoma 88 Yeast NCI60 80 61 12582 3 Available at http://research.nhgri.nih.gov/micr oarray/Supplement 4026 6 Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in 3 RAT, and 6 in TCL (normal samples). available at http://llmpp.nih.gov/lymphoma 8 Available at http://rana.lbl.gov/ 5775 1155 Experiment Design • Gene expression scaled between [-1,1] • Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) • Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) • Leave-one-out cross validation due to small sample size • One-vs-one multi-class classification implemented on LIBSVM Result – MLL Dataset Result – Lymphoma Dataset Conclusions • SVMs classification benefits from gene selection; • Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; • Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent; Conclusions • For a given classification model, different gene selection methods reach the best performance for different feature set sizes; • Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; • The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVMRFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%). Limitations & Future Work • The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. • Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. • Evaluate top-ranked genes in our research against marker genes identified in other studies.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An Evaluation of Gene Selection Methods for Multi