* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An Evaluation of Gene Selection Methods for Multi
Gene nomenclature wikipedia , lookup
Public health genomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene desert wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
The Selfish Gene wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai Outline • • • • • • • Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work Microarray • Measures gene expression levels across different conditions, times or tissue samples • Gene expression levels inform cell activity and disease status • Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc. Microarray Data • A matrix of measurements: rows are gene expression levels; columns are samples/conditions. Example – Lymphoma Dataset Microarray data analysis • Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. • Classification builds model to predict diseased samples. Diagnostic value. Classification Problem • Large number of genes (features) - may contain up to 20,000 features. • Small number of experiments (samples) – hundreds but usually less than 100 samples. • The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer feature selection Our Focus • Binary classification and feature selection methods extensively studied; Multi-class case received little attention. • Practically many microarray datasets have more than two categories of samples • We focus on multi-class gene ranking and selection. Related Work Some criteria used in feature ranking • • • • Correlation coefficient Information gain Chi-squared SVM-RFE Notation • Given C classes • m observations (samples or patients) • n feature measurements (gene expressions) x ( x1 ,..., xn )t R n • class labels y= 1,...,C Correlation Coefficient • Two class problem: y = {-1,+1} • Ranking criterion defined in Golub: wj j j j j • where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature Fischer’s score • Fisher’s criterion score in Pavlidis: wj ( ) j 2 j j 2 ( ) ( ) j 2 Assumption of above methods • Features analyzed in isolation. Not considering correlations. • Assumption: independent of each other • Implication: redundant genes selected into a top subset. Information Gain • A measure of the effectiveness of a feature in classifying the training data. • Expected reduction in entropy caused by partitioning the data according to this feature. | Sv | I (S , A) E (S ) vV ( A) E (S v ) S • V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v Information Gain • E(S) is the entropy of the entire set S. E (S ) C i 1 | Ci | | Ci | log 2 |S| |S| • wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S. Chi-squared • Measures features individually • Continuous valued features discretized into intervals • Form a matrix A, where Aij is the number of samples of the Ci class within the j-th interval. • Let CIj be the number of samples in the j-th interval Chi-squared • The expected frequency of Aij is Ei , j C Ij | Ci | / m • The Chi-squared statistic of a feature is defined as 2 C i 1 I j 1 ( Aij Eij ) 2 Eij • Where I is the number of intervals. The larger the statistic, the more informative the feature is. SVM-RFE • • Recursive Feature Elimination using SVM In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features. SVM-RFE • After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen. Other criteria • The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) • PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.) Our Ranking Methods • • • • • • BScatter MinMax bSum bMax bMin Combined Notation • For each class i and each feature j, we define the mean value of feature j for class Ci: j ,i 1 xCi x j | Ci | • Define the total mean along feature j j 1 xj m x Notation • Define between-class scatter along feature j C B j | Ci | ( j ,i j ) i 1 2 Function 1: BScatter • Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter BScatter j Bj ji C i 1 • where σji is the standard deviation of class i along feature j Function 2: MinMax • Favors features along which the farthest meanclass difference is large, and the within class variance is small. j ,max j ,min MinMax j C i 1 ji Function 3: bSum • For each feature j, we sort the C values μj,i in non-decreasing order: μ j1 <= μj2…<= μ jC • Define bj,l = μ j1+1 - μ j1 • bSum rewards the features with large distances between adjacent mean class values: bSum j Cl11b j ,l ji C i 1 Function 4: bMax • Rewards features j with a large between-neighborclass mean difference bMax j max l b j ,l ji C i 1 Function 5: bMin • Favorsthe features with large smallest betweenneighbor-class mean difference bMin j min l b j ,l ji C i 1 Function 6: Comb • Considers a score function which combines MinMax and bMin Combj min l (b j ,l )( j ,max j ,min ) ji C i 1 Datasets Dataset MLL sample genes classes Comment 72 Lymphoma 88 Yeast NCI60 80 61 12582 3 Available at http://research.nhgri.nih.gov/micr oarray/Supplement 4026 6 Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in 3 RAT, and 6 in TCL (normal samples). available at http://llmpp.nih.gov/lymphoma 8 Available at http://rana.lbl.gov/ 5775 1155 Experiment Design • Gene expression scaled between [-1,1] • Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) • Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) • Leave-one-out cross validation due to small sample size • One-vs-one multi-class classification implemented on LIBSVM Result – MLL Dataset Result – Lymphoma Dataset Conclusions • SVMs classification benefits from gene selection; • Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; • Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent; Conclusions • For a given classification model, different gene selection methods reach the best performance for different feature set sizes; • Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; • The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVMRFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%). Limitations & Future Work • The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. • Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. • Evaluate top-ranked genes in our research against marker genes identified in other studies.