Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 5, Ver. I (Sep – Oct. 2014), PP 108-111 www.iosrjournals.org Feature subset selection for high dimensional data with domain analysis using Semantic Mining Abdul Majeed K. M, Pallavi K.N, Tanvir Habib Sardar 1,3 Dept of ISE, PACE Mangalore, 2Dept of CSE, NMAMIT Nitte Abstract: Feature subset selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Current existing algorithms for feature sub set selection works only based on conducting statistical test like Pearson test or symmetric uncertainty test to find the correlation between the features and apply threshold to filter redundant and irrelevant features. FAST proposed by Qinbao Song [9] uses symmetric uncertainty test for feature subset selection. In this work we extend the FAST algorithm by applying the domain analysis using semantic Mining to improve the relevance of the feature subset selection. Index Terms: Feature subset selection, filter method, feature clustering, graph-based clustering I. Introduction Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features. Feature selection is important in many pattern recognition problems for excluding irrelevant and redundant features. It allows reducing system complexity and processing time and often improves the recognition accuracy. With the aim of choosing a subset of good features with respect to the target concepts, feature subset selection is an effective way for reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Many feature subset selection methods have been proposed and studied for machine learning applications. They can be divided into four broad categories: the Embedded, Wrapper, Filter, and Hybrid approaches. The embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories. Traditional machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational complexity is large. The filter methods are independent of learning algorithms, with good generality. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. They mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. The wrapper methods are computationally expensive and tend to over fit on small training sets .The filter methods, in addition to their generality, are usually a good choice when the number of features is very large. Thus, we will focus on the filter method in this paper. II. Literature Survey Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and redundant features as possible. This is because: (i) irrelevant features do not contribute to the predictive accuracy, and (ii) redundant features do not redound to getting a better predictor for that they provide mostly information which is already present in other feature(s). Of the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle redundant features, yet some of others can eliminate the irrelevant while taking care of the redundant features. Traditionally, feature subset selection research has focused on searching for relevant features. A well known example is Relief [1], which weighs each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at removing redundant features as two predictive but highly correlated features are likely both to be highly weighted [2]. Relief-F [3] extends Relief, enabling this method to work with noisy and incomplete data sets and to deal with multi-class www.iosrjournals.org 108 | Page Feature subset selection for high dimensional data with domain analysis using semantic Mining problems, but still cannot identify redundant features.FCBF [4] is a fast filter method which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis. CMIM [5] iteratively picks features which maximize their mutual information with the class to predict, conditionally to the response of any feature already picked Butterworth et al. [6] proposed to cluster features using a special metric of Barthelemy-Montjardet distance, and then makes use of the dendrogram of the resulting cluster hierarchy to choose the most relevant attributes. Unfortunately, the cluster evaluation measure based on Barthelemy-Montjardet distance does not identify a feature subset that allows the classifiers to improve their original performance accuracy. Further more, even compared with other feature selection methods, the obtained accuracy is lower Hierarchical clustering also has been used to select features on spectral data. Van Dijk and Van Hullefor [7] proposed a hybrid filter/wrapper feature subset selection algorithm for regression. Krier et al. [8] presented a methodology combining hierarchical constrained clustering of spectral variables and selection of clusters by mutual information. Their feature clustering method is similar to that of Van Dijk and Van Hullefor [7] except that the former forces every cluster to contain consecutive features only. Both methods employed agglomerative hierarchical clustering to remove redundant features. With respect to the filter feature selection methods, the application of cluster analysis has been demonstrated to be more effective than traditional feature selection algorithms. In cluster analysis, graphtheoretic methods have been well studied and used in many applications. Their results have, sometimes, the best agreement with human performance . The general graph-theoretic clustering is simple: Compute a neighborhood graph of instances, then delete any edge in the graph that is much longer/shorter (according to some criterion) than its neighbors. The result is a forest and each tree in the forest represents a cluster. Qinbao Song proposed FAST [9] which is a minimum spanning tree based graph theoretic approach. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form the final subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. III. Proposed Solution 3.1 Problem Statement Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines from the data sets. Feature subset selection should be able to identify and remove as much of the irrelevant and redundant information as possible from the dataset. 3.2 Existing Solution FAST is the existing solution shown in Fig 3.1 Fig 3.1 Steps for FAST algorihm The FAST algorithm works in three steps. In the first step, the irrelevant features are removed In the second step, it construct the minimum spanning tree (MST) from a weighted complete graph In the third step, it will do the partitioning of the MST into a forest with each tree representing a cluster; and.the most representative feature that is strongly related to target classes is selected from each cluster to form the final subset of features www.iosrjournals.org 109 | Page Feature subset selection for high dimensional data with domain analysis using semantic Mining 3.3 Proposed Solution The existing solution “FAST “has one problem in the step 1 – irrelevant feature removal. The existing solution for step 1 is based only on entropy [deviation in the values across the dataset]. But entropy measure is not always accurate in identifying the irrelevant feature removal. The feature relevancy to the clustering depends on the domain area of classification. We propose to extract the semantic relation between the domain area of application & the features. Based on the domain analysis, the irrelevant features will be identified & removed in the Step 1. This will result in highly relevant features subsets. Semantic analysis will be drawn based on web mining the domain area concept. We will propose a algorithm to mine the web & provide the relation of any features to the domain area concept. For irrelevant feature removal we use domain analysis with semantic mining. For identifying the redundant features & removing it we use MST based clustering technique Fig 3.2 architecture for the proposed algorithm (semantic mining) 3.3.1Removing irrelevant Features In our proposed algorithm The feature relevancy to the clustering depends on the domain area of classification. We propose to extract the semantic relation between the domain area of application & the features. Based on the domain analysis, the irrelevant features will be identified & removed in the Step 1. This will result in highly relevant features subsets. The following flowchart explains the steps for the proposed algorithm. The user needs to upload the documents for mining and the attribute list for feature selection. Then the documents will be mined one by one for identifying and selecting the terms in the documents. A term is a selected word or item from the documents after mining. It also calculates the frequency of each term in the documents and if the frequencies are greater than 2 then it will be stored in a vector. Then the each attribute will be compared with the term one by one. If the attribute is matching with any of the term in the vector with a given threshold that attribute will be selected as feature. This process will be repeated until the entire attribute in the attribute list will be checked and compared with the list of terms. START Fß doc folder FKß Feature Keyworks file Array(Vect)ß Mine document in F FK(i) is selected feature FI ß FI (Array(vect)) STOP All items in FK checked no FK(i) in FI & rel(F(ki),class)> Thresh yes no Fig 3.3 flowchart for removing irrelevant features www.iosrjournals.org 110 | Page Feature subset selection for high dimensional data with domain analysis using semantic Mining 3.3.2Removing redundant Features Once the irrelevant features are removed, we start to find the redundant features and remove those features. In our algorithm, it involves (i) the construction of the minimum spanning tree (MST) from a weighted complete graph; (ii) the partitioning of the MST into a forest with each tree representing a cluster; and (iii) the selection of representative features from the clusters. IV. Conclusion and Future Work In this paper, we have detailed our solution for feature subset selection. Our mechanism effectively removes the irrelevant & redundant features from the feature set. Our proposed solution is different from the FAST algorithm [9] in following way. FAST uses the entropy measure for relevancy. But in our solution we have proposed domain analysis with semantic mining. As the future work, we have planned to compare the performance of the proposed algorithm with that of the FAST algorithm on the 35 publicly available image, microarray, and text data from the four different aspects of the proportion of selected features, runtime, classification accuracy of a given classifier, and the Win/Draw/Loss record. . References [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. Kira K. and Rendell L.A., The feature selection problem: Traditional methods and a new algorithm, In Proceedings of Nineth National Conference on Artificial Intelligence, pp 129-134, 1992. Koller D. and Sahami M., Toward optimal feature selection, In Proceedings of International Conference on Machine Learning, pp 284-292, 1996. Kononenko I., Estimating Attributes: Analysis and Extensions of RELIEF, In Proceedings of the 1994 European Conference on Machine Learning, pp 171-182, 1994. Yu L. and Liu H., Feature selection for high-dimensional data: a fast correlation-based filter solution, in Proceedings of 20th International Conference on Machine Leaning, 20(2), pp 856-863, 2003. Fleuret F., Fast binary feature selection with conditional mutual Information, Journal of Machine Learning Research, 5, pp 15311555, 2004. Butterworth R., Piatetsky-Shapiro G. and Simovici D.A., On Feature Selection through Clustering, In Proceedings of the Fifth IEEE international Conference on Data Mining, pp 581-584, 2005. Van Dijk G. and Van Hulle M.M., Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis, International Conference on Artificial Neural Networks, 2006 Krier C., Francois D., Rossi F. and Verleysen M., Feature clustering and mutual information for the selection of variables in spectral data, In Proc European Symposium on Artificial Neural Networks Advances in Computational Intelligence and Learning, pp 157162, 2007. A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data - QinBao Song in "IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR 2013" www.iosrjournals.org 111 | Page