Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014 Mining High Dimensional Data Using Attribute Clustering-Based Feature Subset Selection Algorithm Vivek Ravindra Prasad Pandey 1, T.Venu 2, N.Subhash Chandra 3 1 2 pursuing M.Tech (IT), Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTU- Hyderabad, A.P, India working as Assistant Professor in Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTU- Hyderabad, A.P, India 3 working as Professor (CSE Department) in Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTUHyderabad, A.P, India Abstract: Attribute based cluster involves, selecting the items or search the data item based on attributes. Common attributes are clustered and store those into a single place, so by this we can simply search the items, many more feature selections methods have been proposed and read for machine learning applications. Those are categorized into four types, those are Embedded, Wrapper, Filter and Hybrid approaches. In these four methods they are elect a concept called feature selection as part of the training process, this algorithm is very efficient than the other algorithms. In this proposed system we introduced attribute based algorithm for quick search items. Here common attributes of the items cluster into single unit, By using the common attribute it will easily search the items having same functionalities, here it will reduce the searching time, in the previous it separately based on featured here based on attributes. Keywords: attribute clustering, Embedded, Wrapper, Filter and Hybrid. 1. INTRODUCTION By the aim of selecting a subset of good attribute with corresponding to target concepts, feature subset algorithm is best approach for reducing dimensionality, removing unnecessary data means not relative data. Many feature subset selection methods have been suggested and studied for machine learning applications. Those are categorized into four broad: the Wrapper, Surrounded, Hybrid approaches and Filter. The embedded methods assimilate feature selection as a part of the training process and are usually special to given learning algorithms, and therefore may be more effective than the other three categories. Olden machine learning algorithms like artificial neural networks or decision trees are examples of embedded approaches. The wrapper methods use the anticipating accuracy of a predetermined learning algorithm to complete the goodness of the selected subsets, the exactness of the learning algorithms is usually high. However, the causality of the selected features is limited and the computational complication is large. The filter methods are ISSN: 2231-2803 alone of learning algorithms, with good simplification. Their computing difficulty is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a collection of filter and wrapper methods, by utilizing a filter method we will reduce the search space that will be considered by the wrapper’s subsequent. They mostly focus on combining filter and wrapper methods to reach the better possible performance with a particular learning algorithm with identical time complexity of the filter techniques. The wrapper techniques are computationally very expensive and used to over fit on small exercise groups. The filter techniques, in accumulation to their casualty, are usually a better choice when the number of features is very big. So, we will focus on attribute based cluster in this paper. With respect to the attribute cluster based selection techniques, the cluster analysis application has been demonstrated to be more effective than previous feature selection algorithms and Dhillon et al. employed the attribute based clustering of words to reduce the dimensionality of text data. In the cluster analysis, graph-theoretic techniques have been well understood and used in lot of applications. Their results having, in some situations, the best contract performance with human. The normal graph-theoretic clustering is humble: Compute a region graph of illustrations, and then delete any edge in the graph that is much longer or shorter than it’s besides. The significance is treated as a forest and every tree in the forest denotes a group. In our study, we apply graph theoretic gathering methods to features. In particular, we assume the minimum spanning tree (MST) based gathering algorithms, because they do not assume that data points are grouped around centers or divided by a regular symmetrical arch and have been broadly used in practice. Based on the MST method, we propose an attribute clustering-based feature Selection algorithm. The attribute cluster based algorithm works in two different steps. In the first and fore most steps, characteristics are divided into clusters by using graph-theoretic grouping methods. In the second step, the best illustrative feature that is powerfully related to target classes is certain from each cluster to arrangement the final subset of features. Features in different http://www.ijcttjournal.org Page63 International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014 clusters are comparatively autonomous; the clustering-based strategy of Attribute Clustering has a high likelihood of producing a separation of suitable and autonomous features. The suggested attribute cluster subset selection algorithm was tested upon 35 publicly available image, microarray, and text data sets. The investigational results show that, equaled with other five different types of feature subset selection algorithms, the proposed algorithm it’s not only reduces the number of features, but also increases the performances of the four well-known different types of classifiers. The remaining part of the article is organized as follows: In Section 2, we define the related works. In Section 3, we contemporary the new feature subset selection algorithm. In Section 4, we report extensive investigational results to support the proposed FAST algorithm. Finally we summarize the present study and draw some conclusions. Distributional grouping has been used to collection words into groups based either on their involvement in particular grammatical relations with other words by Pereira or on the distribution of class labels associated with each word by McCallum and Baker. As distributional clustering of words are agglomerative in countryside, and the result in sub-optimal word clusters and high computing cost, proposed a new information discordant algorithm for word clustering and applied this to text classification. Butterworth proposed to cluster features with a special metric of distance, and then makes use of the dendrogram of the resulting cluster hierarchy to choose the most relevant attributes. Inappropriately, the cluster calculation measure based on Barthelemy-Montjardet distance does not classify a feature subset that lets the classifiers to increase their original performance correctness. Furthermore, even compared with other feature selection techniques, the obtained accuracy is lower. 2 RELATED WORK Attribute cluster based selection can be seen as the process of recognizing and eliminating as many irrelevant and duplicate features as possible. This is for the reason that: (i) irrelevant attributes do not donate to the extrapolative accuracy [33], and (ii) duplicate attributes do not redound to getting a better interpreter for that they afford mostly information which is already existing in other attributes(s). Of the various feature subset selection algorithms, some can efficiently remove irrelevant attributes but fail to handle duplicate features [23], [31], [37], [34], [45], [59], so far some of others can remove the irrelevant while taking care of the duplicate features [5], [29], [42], [68]. Our suggested attribute cluster based algorithm falls into the second group. Traditionally, attribute cluster based selection investigation has concentrate on searching for relevant features. A famous example is Relief [34], which considers each attribute according to its ability to separate occurrences under different targets based on distance-based conditions function. However, Relief is unsuccessful at removing duplicate attributes as two prognostic but highly correlated attributes are likely both to be highly weighted [36]. Relief-F [37] extends Relief, allowing this technique to work with noisy and unfinished data sets and to deal with multi-class difficulties, but still cannot identify duplicate features. However, laterally with unrelated features, duplicate features also affect the speed and exactness of learning algorithms, and thus should be removed. Good feature subset is one that encloses features highly associated with the target, yet uncorrelated with each other. FCBF ([68], [71]) is a fast filter method which can recognize relevant features as well as redundancy between relevant features without pairwise association analysis. Iteratively picks attributes which maximize their mutual info with the class to forecast, conditionally to the response of any feature already picked. Different from these algorithms, our suggested attribute cluster based algorithm employs clustering based method to elect features. Recently, hierarchical clustering has been implemented in word selection in the circumstance of text arrangement. ISSN: 2231-2803 3 ATTRIBUTE CLUSTER BASED ALGORITHM 3.1 Framework and Definition Irrelevant features, along with duplicate attributes, severely affect the correctness of the learning machines. Thus, attribute cluster based subset selection should be able to identify and remove as much of the unrelated and duplicate info as possible. Moreover, “the good attribute cluster subsets contain features highly associated with (predictive of) the class, yet uncorrelated with each other.” Fig. 1: Framework of the proposed attribute cluster subset selection Algorithm keeping these in mind, we develop a novel algorithm which can efficiently deal with both irrelevant and duplicate features, and gain a good feature subset selection. We accomplish this through a new attribute cluster based subset selection framework (shown in Fig.1) which composed of the two connected components of irrelevant attributes removal and redundant attribute elimination. The former gains attribute relevant to the target concept by eliminating unrelated ones, and the latter eliminates duplicate http://www.ijcttjournal.org Page64 International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014 attributes from relevant ones via selecting representatives from different attribute clusters, and thus produces the final subset. The irrelevant attributes removal is straightforward once the right consequence measure is defined or selected, while the redundant attribute elimination is a bit of sophisticated. In our proposed Attribute cluster based algorithm, it involves (i) the structure of the minimum spanning tree (MST) from a weighted complete graph; (ii) the partitioning of the MST into a forest with each tree expressive a cluster; and (iii) the selection of illustrative features from the clusters. 3.2 Algorithm and analysis The proposed attribute cluster based algorithm logically consists of three steps: (i) eliminating irrelevant features, (ii) constructing a MST from relative ones, and (iii) partitioning the MST and selecting representative features. For a data set with attributes = { 1, 2…, } and class , we compute the T-Relevance ( , ) value for each attribute (1 ≤ ≤ ) in the first step. The attributes whose ( , ) values are greater than a predefined threshold comprise the target-relevant feature subset ′ = { ′1, ′2, ...} ( ≤ ). In the second step, we first calculate the F-Correlation ( ′ , ′ ) value for each pair of features ′ and ′ ( ′ , ′ ∈ ′ ∧ ∕= ). Then, viewing features ′ and ′ as vertices and ( ′ , ′ ) ( ∕= ) as the weight of the edge between vertices ′ and ′ , a weighted complete graph = ( , ) is constructed where = { ′ ∣ ′ ∈ ′ ∧ ∈ [1, ]} and = {( ′ , ′ ) ∣ ( ′ , ′ ∈ ′ ∧ , ∈ [1, ] ∧ ∕= }. As symmetric uncertainty is symmetric further the F-Correlation ( ′ , ′ ) is symmetric as well, thus is an undirected graph. The complete graph reflects the correlations among all the target-relevant features. Unfortunately, graph has vertices and ( −1)/2 edges. For high dimensional data, it is heavily dense and the edges with different weights are strongly inter weaved. Moreover, the decomposition of complete graph is NP-hard [26]. Thus for graph , we build a MST, which connects all vertices such that the sum of the weights of the edges is the minimum, using the well-known Prim algorithm [54]. The weight of edge ( ′ , ′ ) is F-Correlation ( ′ , ′ ). After building the MST, in the third step, we first remove the edges = {( ′ , ′ ) ∣ ( ′ , ′ ∈ ′ ∧ , ∈ [1, ] ∧ ∕= }, whose weights are smaller than both of the T-Relevance ( ′ , ) and ( ′ , ), from the MST. Each deletion results in two disconnected trees 1 and 2. Assuming the set of vertices in any one of the final trees to be ( ), we have the property that for each pair of vertices ( ′ , ′ ∈ ( )), ( ′, ′)≥ ( ′, )∨ ( ′, ′ )≥ ( ′ , ) always holds. ISSN: 2231-2803 Algorithm: Attribute Cluster inputs: D ( 1, 2, ..., , ) - the given data set - the T-Relevance threshold. output: S - selected feature subset . //==== Part 1 : Irrelevant Feature Removal ==== 1 for i = 1 to m do 2 T-Relevance = SU ( , ) 3 if T-Relevance > then S = S ∪ { 4 }; //==== Part 2 : Minimum Spanning Tree Construction ==== 5 G = NULL; //G is a complete graph 6 for each pair of features {Fi, Fj} ⊂ S do 7 F-Correlation = S U (Fi, Fj) 8 ′i / ′ ℎ F-Correlation ℎ ℎ the corresponding edge: 9 min span tree = Prim (G); // Using prime Algorithm to generate the minimum Spanning tree //=== Part3 : Tree part ion and Representative Feature Selection== == 10 Forest = minSpan Tree 11 for each edge Eij € Forest do 12 if S U (Fi, Fj)< SU( ′ , ) ∧ SU( ′ , ′ ) < SU( ′ , ) then 13 Forest = Forest 14 S = 15 for each tree Ti ∈ Forest do 16 = argmax ′ ∈ SU ( ′, ) 17 S = S U { }; 18 return S 4 EMPIRICAL STUDY 4.1 Data Source For the purposes of valuing the performance and efficiency of our proposed Attribute cluster based algorithm, verifying whether or not the technique is potentially useful in practice, and permitting other researchers to approve our results, 35 publicly available data sets1 were used. The numbers of attributes of the data sets vary from 37 to 49152 with a mean of 7874. The dimensionality of the 54.3% data sets overdoes 5000, of which 28.6% data sets have additional than 10000 attributes. The 35 data sets cover a range of application domains such as text, bio microarray and image data classification. Table 12 shows the corresponding statistical information. Note that for the data sets with continuous-valued features, the well-known off-the-shelf MDL method [74] was used to discretize the continuous values. 4.2 Experiment setup To evaluate the performance of our proposed Attribute cluster based algorithm and compare it with other feature selection algorithms in fair and reasonable way 1) The suggested algorithm is compared with five different types of illustrative feature selection algorithms. They are (i) FCBF [68], [71], (ii) ReliefF [57], (iii) CFS [29], (iv) Consist [14], and (v) FOCUSSF [2], respectively. FCBF and ReliefF calculate features individually. For FCBF, in the experiments, we set the significance threshold to be the http://www.ijcttjournal.org Page65 International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014 value of the ⌊ /log ⌋ ℎ ranked feature for Individual data set ( is the number of features in a given data set) as recommended by Yu and Liu [68], [71]. ReliefF searches for adjacent neighbors of instances of different classes and weights features according to how fine they distinguish instances of different classes. The other three feature selection algorithms are based on subset estimation. CFS exploits bestfirst search based on the calculation of a subset that encompasses features highly associated with the target concept, yet uncorrelated with each other. The Consist technique searches for the minimal subset that separates classes as constantly as the full set can under best-first search strategy. FOCUS-SF is a variation of FOCUS [2]. FOCUS has the same calculation strategy as Consist, but it examines all subsets of features. Considering the time efficiency, FOUCSSF replaces comprehensive search in FOCUS with consecutive forward selection. For our proposed Attribute cluster based algorithm, we heuristically set to be the value of the ⌊ √ ∗l g ⌋ ℎ ranked attribute for each data set. 2) Four dissimilar types of cataloging algorithms are hired to classify data sets before and after feature selection. They are (i) the probability-based Naive Bayes (NB), (ii) the tree-based C4.5, (iii) the instance-based lazy learning algorithm IB1, and (iv) the rule-based RIPPER, correspondingly. Naive Bayes uses a probabilistic technique for classification by multiplying the separate probabilities of every feature-value pair. This algorithm adopts independence among the features and even then affords excellent classification results. Decision tree learning algorithm C4.5 is an addition of ID3 that accounts for unavailable values, uninterrupted attribute value ranges, pruning of decision trees, rule derivation, and so on. The tree includes of nodes (features) that are selected by information entropy. Instance-based learner IB1 is a single-nearest neighbor algorithm, and it categorizes entities taking the class of the closest associated vectors in the training set via distance metrics. It is the modest between the algorithms used in our study. Inductive rule learner RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [12] is a propositional rule learner that defines a rule based detection model and seeks to increase it iteratively by using dissimilar heuristic techniques. The constructed rule set is then used to classify new instances. 4.3 Results and analysis In this section we represent the investigational results in terms of the quantity of selected attributes, the time to obtain the feature subset, the classification accuracy, and the Win/Draw/Loss record. For the purpose of discovering the statistical significance of the results, we performed a nonparametric Friedman test [24] followed by Nemenyi posthoc test [47], as instructed by Demsar [17] and Garcia and Herrerato [25] to statistically compare algorithms on multiple data sets. Thus the Friedman and the Nemenyi test results are reported as well. ISSN: 2231-2803 5 CONCLUSION In this paper, we have presented a novel attribute clustering-based feature subset selection algorithm for high dimensional data. The algorithm involves (i) removing irrelevant attributes, (ii) constructing a minimum spanning tree from relative ones, and (iii) partitioning the MST and selecting illustrative attributes. In this new algorithm, a cluster consists of features. Each cluster is treated as a single attribute and thus dimensionality is drastically reduced. We have compared the presentation of the proposed algorithm with those of the five well-known feature selection algorithms FCBF, ReliefF, CFS, Consist, and FOCUS-SF on the publicly existing image and text data from the four different types of features of the proportion of selected features, classification accuracy of a given classifier, and the Win/Draw/Loss record. Generally, the suggested algorithm obtained the best proportion of selected features, the best runtime, and the best cataloging accuracy for Naive Bayes, C4.5, and RIPPER, and the second best classification accuracy for IB1. The Loss/Draw/Win records established the conclusions. We also found that attribute cluster based obtains the rank of 1 for microarray data, the rank of 2 for text data, and the rank of 3 for image data in terms of cataloging accurateness of the four different types of classifiers, and CFS is a good alternative. At the same time, FCBF is a good another for text and image data. Moreover, Contain and FOCUSSF are alternatives for text data. For the upcoming work, we are planning to discover different types of correlation procedures, and study some formal properties of feature space. REFERENCES [1] Almuallim H. and Dietterich T.G., Learning boolean concepts in the presence of many irrelevant features, Artificial Intelligence, 69(1 2), pp 279- 305, 1994. [2] Almuallim H. and Dietterich T.G., Algorithms for Identifying Relevant Features, In Proceedings of the 9th Canadian Conference on AI, pp 38-45, 1992. [3] Baker L.D. and McCallum A.K., Distributional clustering of words for text classification, In Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval, pp 96-103, 1998. [4] Arauzo-Azofra A., Benitez J.M. and Castro J.L., A feature set measure based on relief, In Proceedings of the fifth international conference on Recent Advances in Soft Computing, pp 104-109, 2004. [5] Bell D.A. and Wang, H., A formalism for relevance and its application in feature subset selection, Machine Learning, 41(2), pp 175-195, 2000. [6] Battiti R., Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks, 5(4), pp 537-550, 1994. [7] Butterworth R., Piatetsky-Shapiro G. and Simovici D.A., On Feature Selection through Clustering, In Proceedings of the Fifth IEEE international Conference on Data Mining, pp 581-584, 2005. http://www.ijcttjournal.org Page66 International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014 [8] Biesiada J. and Duch W., Features election for highdimensionaldatała Pearson redundancy based filter, AdvancesinSoftComputing, 45, pp 242C249, 2008. [9] Chanda P., Cho Y., Zhang A. and Ramanathan M., Mining of Attribute Interactions Using Information Theoretic Metrics, In Proceedings of IEEE international Conference on Data Mining Workshops, pp 350-355, 2009. [10] Cardie, C., Using decision trees to improve case-based learning, In Proceedings of Tenth International Conference on Machine Learning, pp 25-32, 1993. AUTHOR PROFILE Vivek Ravindra Prasad Pandey pursuing M.Tech (IT) from Holy Mary Institute of Technology and Science, Keesara, Ranga Reddy Dist., Affiliated to JNTUHYDERABAD. T. Venu, Assistant Professor at Holy Mary Institute of Technology and Science, Keesara, Ranga Reddy Dist., Affiliated to JNTUHYDERABAD Dr.N.Subhash Chandra, Professor (CSE Department) at Holy Mary Institute of Technology and Science, Keesara, Ranga Reddy Dist., Affiliated to JNTUHYDERABAD ISSN: 2231-2803 http://www.ijcttjournal.org Page67