Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University Outlines Introduction to feature selection Motivation Problem statement Key research issues Application in genomic data analysis Overview of data mining for microarray data Gene selection A case study Current research directions 2 Motivation An active field in Pattern recognition Machine learning Data mining Statistics Goodness F1 F2 . . . FN C I1 f11 f12 . . . f1N c1 I2 f21 f22 . . . f2N c2 . . . . . . . . . . . . . . . . . . IM fM1 fM2 . . . fMN cM Reducing dimensionality Improving learning efficiency Increasing predicative accuracy Reducing complexity of learned results 3 Problem Statement A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996) 4 An Example for the Problem Data set F1 F2 F3 F4 F5 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 0 C 0 1 1 1 0 1 1 1 Five Boolean features C = F1∨F2 F3 = ┐F2 , F5 = ┐F4 Optimal subset: {F1, F2} or {F1, F3} Combinatorial nature of searching for an optimal subset 5 Subset Search An example of search space (Kohavi and John, 1997) 6 Evaluation Measures Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive Filter model Separating feature selection from classifier learning Relying on general characteristics of data (distance, correlation, consistency) No bias toward any learning algorithm, fast 7 A Framework for Algorithms Original Set Subset Generation Candidate Subset Subset Evaluation Current Best Subset No Stopping Criterion Yes Selected Subset 8 Feature Ranking Weighting and ranking individual features Selecting top-ranked ones for feature selection Advantages Efficient: O(N) in terms of dimensionality N Easy to implement Disadvantages Hard to determine the threshold Unable to consider correlation between features 9 Applications of Feature Selection Text categorization Yang and Pederson, 1997 (CMU) Forman, 2003 (HP Labs) Image retrieval Swets and Weng, 1995 (MSU) Dy et al, 2003 (Purdue University) Gene expression microarrray data analysis Xing et al, 2001 (UC Berkeley) Lee et al, 2003 (Texas A&M) Customer relationship management Ng and Liu, 2000 (NUS) Intrusion detection Lee et al, 2000 (Columbia University) 10 Microarray Technology Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment Providing new opportunities and challenges for data mining Gene M23197_at U66497_at M92287_at . . . Value 261 88 4778 . . . 11 Two Ways to View Microarray Data Gene M23197_at U66497_at M92287_at ... Class Sample Sample 1 261 88 4778 ... ALL Sample 2 101 74 2700 ... ALL Sample 3 1450 34 498 ... AML . . . . . . . . . . . . ... ... ... . . . 12 Data Mining Tasks Data points are: Genes Samples Grouping similar genes Clustering together to find co-regulated genes Classification Grouping similar samples together to find classes or subclasses Building a classifier to predict the classes of new samples 13 Gene Selection Data characteristics in sample classification High dimensionality (thousands of genes) Small sample size (often less than 100 samples) Problems Curse of dimensionality Overfitting the training data Traditional gene selection methods Within the filter model Gene ranking 14 A Case Study (Golub et al., 1999) AML ALL Leukemia data 7129 genes, 72 samples Training: 38 (27 ALL, 11 AML) Test: 34 (20 ALL, 14 AML) Normalization Mean: 0 Standard deviation: 1 Correlation measure -3 0 3 Normalized Expression 15 Case Study (continued) Performance of selected genes Accuracy on training set: 36 out 38 (94.74%) correctly classified Accuracy on test set: 29 out 34 (85.29%) correctly classified Limitations Domain knowledge required to determine the number of genes selected Unable to remove redundant genes 16 Feature/Gene Redundancy Examining redundant genes Two heads are not necessarily better than one Effects of redundant genes How to handle redundancy A challenge Some latest work • MRMR (Maximum Relevance Minimum Redundancy) (Ding and Peng, CSB-2003) • FCBF (Fast Correlation Based Filter) (Yu and Liu, ICML-2003) 17 Research Directions Feature selection for unlabeled data Common things as for labeled data Difference Dealing with different data types Nominal, discrete, continuous Discretization Dealing with large size data Comparative study and intelligent selection of feature selection methods 18 References G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994. L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003. T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999. C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003. J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-ShavlikPage.ppt 19