Download Mining High Dimensional Data Using Attribute Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014
Mining High Dimensional Data Using Attribute
Clustering-Based Feature Subset Selection
Algorithm
Vivek Ravindra Prasad Pandey 1, T.Venu 2, N.Subhash Chandra 3
1
2
pursuing M.Tech (IT), Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTU- Hyderabad, A.P, India
working as Assistant Professor in Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTU- Hyderabad,
A.P, India
3
working as Professor (CSE Department) in Holy Mary Institute of Technology and Science, Keesara, Affiliated to JNTUHyderabad, A.P, India
Abstract: Attribute based cluster involves, selecting the
items or search the data item based on attributes.
Common attributes are clustered and store those into a
single place, so by this we can simply search the items,
many more feature selections methods have been proposed
and read for machine learning applications. Those are
categorized into four types, those are Embedded, Wrapper,
Filter and Hybrid approaches. In these four methods they
are elect a concept called feature selection as part of the
training process, this algorithm is very efficient than the
other algorithms. In this proposed system we introduced
attribute based algorithm for quick search items. Here
common attributes of the items cluster into single unit, By
using the common attribute it will easily search the items
having same functionalities, here it will reduce the
searching time, in the previous it separately based on
featured here based on attributes.
Keywords: attribute clustering, Embedded, Wrapper, Filter
and Hybrid.
1. INTRODUCTION
By the aim of selecting a subset of good attribute with
corresponding to target concepts, feature subset algorithm is
best approach for reducing dimensionality, removing
unnecessary data means not relative data. Many feature subset
selection methods have been suggested and studied for
machine learning applications. Those are categorized into four
broad: the Wrapper, Surrounded, Hybrid approaches and
Filter.
The embedded methods assimilate feature selection as a part
of the training process and are usually special to given
learning algorithms, and therefore may be more effective than
the other three categories. Olden machine learning algorithms
like artificial neural networks or decision trees are examples
of embedded approaches. The wrapper methods use the
anticipating accuracy of a predetermined learning algorithm to
complete the goodness of the selected subsets, the exactness
of the learning algorithms is usually high. However, the
causality of the selected features is limited and the
computational complication is large. The filter methods are
ISSN: 2231-2803
alone of learning algorithms, with good simplification. Their
computing difficulty is low, but the accuracy of the learning
algorithms is not guaranteed. The hybrid methods are a
collection of filter and wrapper methods, by utilizing a filter
method we will reduce the search space that will be
considered by the wrapper’s subsequent. They mostly focus
on combining filter and wrapper methods to reach the better
possible performance with a particular learning algorithm with
identical time complexity of the filter techniques. The wrapper
techniques are computationally very expensive and used to
over fit on small exercise groups. The filter techniques, in
accumulation to their casualty, are usually a better choice
when the number of features is very big. So, we will focus on
attribute based cluster in this paper.
With respect to the attribute cluster based selection
techniques, the cluster analysis application has been
demonstrated to be more effective than previous feature
selection algorithms and Dhillon et al. employed the attribute
based clustering of words to reduce the dimensionality of text
data. In the cluster analysis, graph-theoretic techniques have
been well understood and used in lot of applications. Their
results having, in some situations, the best contract
performance with human. The normal graph-theoretic
clustering is humble: Compute a region graph of illustrations,
and then delete any edge in the graph that is much longer or
shorter than it’s besides. The significance is treated as a forest
and every tree in the forest denotes a group. In our study, we
apply graph theoretic gathering methods to features. In
particular, we assume the minimum spanning tree (MST)
based gathering algorithms, because they do not assume that
data points are grouped around centers or divided by a regular
symmetrical arch and have been broadly used in practice.
Based on the MST method, we propose an attribute
clustering-based feature Selection algorithm. The attribute
cluster based algorithm works in two different steps. In the
first and fore most steps, characteristics are divided into
clusters by using graph-theoretic grouping methods. In the
second step, the best illustrative feature that is powerfully
related to target classes is certain from each cluster to
arrangement the final subset of features. Features in different
http://www.ijcttjournal.org
Page63
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014
clusters are comparatively autonomous; the clustering-based
strategy of Attribute Clustering has a high likelihood of
producing a separation of suitable and autonomous features.
The suggested attribute cluster subset selection algorithm was
tested upon 35 publicly available image, microarray, and text
data sets. The investigational results show that, equaled with
other five different types of feature subset selection
algorithms, the proposed algorithm it’s not only reduces the
number of features, but also increases the performances of the
four well-known different types of classifiers. The remaining
part of the article is organized as follows: In Section 2, we
define the related works. In Section 3, we contemporary the
new feature subset selection algorithm. In Section 4, we report
extensive investigational results to support the proposed
FAST algorithm. Finally we summarize the present study and
draw some conclusions.
Distributional grouping has been used to collection words into
groups based either on their involvement in particular
grammatical relations with other words by Pereira or on the
distribution of class labels associated with each word by
McCallum and Baker. As distributional clustering of words
are agglomerative in countryside, and the result in sub-optimal
word clusters and high computing cost, proposed a new
information discordant algorithm for word clustering and
applied this to text classification. Butterworth proposed to
cluster features with a special metric of distance, and then
makes use of the dendrogram of the resulting cluster hierarchy
to choose the most relevant attributes. Inappropriately, the
cluster calculation measure based on Barthelemy-Montjardet
distance does not classify a feature subset that lets the
classifiers to increase their original performance correctness.
Furthermore, even compared with other feature selection
techniques, the obtained accuracy is lower.
2 RELATED WORK
Attribute cluster based selection can be seen as the process
of recognizing and eliminating as many irrelevant and
duplicate features as possible. This is for the reason that: (i)
irrelevant attributes do not donate to the extrapolative
accuracy [33], and (ii) duplicate attributes do not redound to
getting a better interpreter for that they afford mostly
information which is already existing in other attributes(s). Of
the various feature subset selection algorithms, some can
efficiently remove irrelevant attributes but fail to handle
duplicate features [23], [31], [37], [34], [45], [59], so far some
of others can remove the irrelevant while taking care of the
duplicate features [5], [29], [42], [68]. Our suggested attribute
cluster based algorithm falls into the second group.
Traditionally, attribute cluster based selection investigation
has concentrate on searching for relevant features. A famous
example is Relief [34], which considers each attribute
according to its ability to separate occurrences under different
targets based on distance-based conditions function. However,
Relief is unsuccessful at removing duplicate attributes as two
prognostic but highly correlated attributes are likely both to be
highly weighted [36]. Relief-F [37] extends Relief, allowing
this technique to work with noisy and unfinished data sets and
to deal with multi-class difficulties, but still cannot identify
duplicate features. However, laterally with unrelated features,
duplicate features also affect the speed and exactness of
learning algorithms, and thus should be removed. Good
feature subset is one that encloses features highly associated
with the target, yet uncorrelated with each other. FCBF ([68],
[71]) is a fast filter method which can recognize relevant
features as well as redundancy between relevant features
without pairwise association analysis. Iteratively picks
attributes which maximize their mutual info with the class to
forecast, conditionally to the response of any feature already
picked. Different from these algorithms, our suggested
attribute cluster based algorithm employs clustering based
method to elect features.
Recently, hierarchical clustering has been implemented in
word selection in the circumstance of text arrangement.
ISSN: 2231-2803
3 ATTRIBUTE CLUSTER BASED ALGORITHM
3.1 Framework and Definition
Irrelevant features, along with duplicate attributes, severely
affect the correctness of the learning machines. Thus, attribute
cluster based subset selection should be able to identify and
remove as much of the unrelated and duplicate info as
possible. Moreover, “the good attribute cluster subsets
contain features highly associated with (predictive of) the
class, yet uncorrelated with each other.”
Fig. 1: Framework of the proposed attribute cluster subset
selection Algorithm keeping these in mind, we develop a
novel algorithm which can efficiently deal with both irrelevant
and duplicate features, and gain a good feature subset
selection. We accomplish this through a new attribute cluster
based subset selection framework (shown in Fig.1) which
composed of the two connected components of irrelevant
attributes removal and redundant attribute elimination. The
former gains attribute relevant to the target concept by
eliminating unrelated ones, and the latter eliminates duplicate
http://www.ijcttjournal.org
Page64
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014
attributes from relevant ones via selecting representatives
from different attribute clusters, and thus produces the final
subset. The irrelevant attributes removal is straightforward
once the right consequence measure is defined or selected,
while the redundant attribute elimination is a bit of
sophisticated. In our proposed Attribute cluster based
algorithm, it involves (i) the structure of the minimum
spanning tree (MST) from a weighted complete graph; (ii) the
partitioning of the MST into a forest with each tree expressive
a cluster; and (iii) the selection of illustrative features from the
clusters.
3.2 Algorithm and analysis
The proposed attribute cluster based algorithm logically
consists of three steps: (i) eliminating irrelevant features, (ii)
constructing a MST from relative ones, and (iii) partitioning
the MST and selecting representative features.
For a data set with
attributes = { 1, 2…,
} and
class , we compute the T-Relevance ( , ) value for each
attribute
(1 ≤ ≤ ) in the first step. The attributes whose
( , ) values are greater than a predefined threshold
comprise the target-relevant feature subset ′ = { ′1, ′2, ...}
( ≤ ).
In the second step, we first calculate the F-Correlation
( ′ , ′ ) value for each pair of features ′ and ′
( ′ , ′ ∈ ′ ∧ ∕= ). Then, viewing features ′ and ′ as
vertices and
( ′ , ′ ) ( ∕= ) as the weight of the edge
between vertices ′ and ′ , a weighted complete graph =
( , ) is constructed where = { ′ ∣ ′ ∈ ′ ∧ ∈ [1, ]}
and = {( ′ , ′ ) ∣ ( ′ , ′ ∈ ′ ∧ , ∈ [1, ] ∧ ∕= }. As
symmetric uncertainty is symmetric further the F-Correlation
( ′ , ′ ) is symmetric as well, thus
is an undirected
graph. The complete graph reflects the correlations among
all the target-relevant features. Unfortunately, graph has
vertices and ( −1)/2 edges. For high dimensional data, it is
heavily dense and the edges with different weights are
strongly inter weaved. Moreover, the decomposition of
complete graph is NP-hard [26]. Thus for graph , we build a
MST, which connects all vertices such that the sum of the
weights of the edges is the minimum, using the well-known
Prim algorithm [54].
The weight of edge ( ′ , ′ ) is F-Correlation ( ′ , ′ ).
After building the MST, in the third step, we first remove the
edges = {( ′ , ′ ) ∣ ( ′ , ′ ∈ ′ ∧ , ∈ [1, ] ∧ ∕= },
whose weights are smaller than both of the T-Relevance
( ′ , ) and
( ′ , ), from the MST. Each deletion
results in two disconnected trees 1 and 2.
Assuming the set of vertices in any one of the final trees to be
( ), we have the property that for each pair of vertices ( ′ ,
′ ∈ ( )),
( ′, ′)≥
( ′, )∨
( ′, ′ )≥
( ′ , ) always holds.
ISSN: 2231-2803
Algorithm: Attribute Cluster
inputs: D ( 1, 2, ...,
, ) - the given data set
- the T-Relevance threshold.
output: S - selected feature subset .
//==== Part 1 : Irrelevant Feature Removal ====
1 for i = 1 to m do
2 T-Relevance = SU ( , )
3 if T-Relevance > then
S = S ∪ { 4 };
//==== Part 2 : Minimum Spanning Tree Construction ====
5 G = NULL; //G is a complete graph
6 for each pair of features {Fi, Fj} ⊂ S do
7 F-Correlation = S U (Fi, Fj)
8
′i
/
′
ℎ F-Correlation
ℎ
ℎ
the corresponding edge:
9 min span tree = Prim (G); // Using prime Algorithm to generate
the minimum Spanning tree
//=== Part3 : Tree part ion and Representative Feature Selection==
==
10 Forest = minSpan Tree
11 for each edge Eij € Forest do
12 if S U (Fi, Fj)< SU( ′ , ) ∧ SU( ′ , ′ ) < SU( ′ , ) then
13 Forest = Forest 14 S =
15 for each tree Ti ∈ Forest do
16
= argmax ′ ∈ SU ( ′, )
17 S = S U {
};
18 return S
4 EMPIRICAL STUDY
4.1 Data Source
For the purposes of valuing the performance and
efficiency of our proposed Attribute cluster based algorithm,
verifying whether or not the technique is potentially useful in
practice, and permitting other researchers to approve our
results, 35 publicly available data sets1 were used. The
numbers of attributes of the data sets vary from 37 to 49152
with a mean of 7874. The dimensionality of the 54.3% data
sets overdoes 5000, of which 28.6% data sets have additional
than 10000 attributes. The 35 data sets cover a range of
application domains such as text, bio microarray and image
data classification. Table 12 shows the corresponding
statistical information.
Note that for the data sets with continuous-valued features, the
well-known off-the-shelf MDL method [74] was used to
discretize the continuous values.
4.2 Experiment setup
To evaluate the performance of our proposed Attribute cluster
based algorithm and compare it with other feature selection
algorithms in fair and reasonable way
1)
The suggested algorithm is compared with five
different types of illustrative feature selection algorithms.
They are (i) FCBF [68], [71], (ii) ReliefF [57], (iii) CFS [29],
(iv) Consist [14], and (v) FOCUSSF [2], respectively. FCBF
and ReliefF calculate features individually. For FCBF, in the
experiments, we set the significance threshold to be the
http://www.ijcttjournal.org
Page65
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014
value of the ⌊ /log ⌋ ℎ ranked feature for Individual data set
(
is the number of features in a given data set) as
recommended by Yu and Liu [68], [71]. ReliefF searches for
adjacent neighbors of instances of different classes and
weights features according to how fine they distinguish
instances of different classes. The other three feature selection
algorithms are based on subset estimation. CFS exploits bestfirst search based on the calculation of a subset that
encompasses features highly associated with the target
concept, yet uncorrelated with each other. The Consist
technique searches for the minimal subset that separates
classes as constantly as the full set can under best-first search
strategy. FOCUS-SF is a variation of FOCUS [2]. FOCUS has
the same calculation strategy as Consist, but it examines all
subsets of features. Considering the time efficiency, FOUCSSF replaces comprehensive search in FOCUS with
consecutive forward selection. For our proposed Attribute
cluster based algorithm, we heuristically set to be the
value of the ⌊ √ ∗l g ⌋ ℎ ranked attribute for each data set.
2) Four dissimilar types of cataloging algorithms are hired to
classify data sets before and after feature selection. They are
(i) the probability-based Naive Bayes (NB), (ii) the tree-based
C4.5, (iii) the instance-based lazy learning algorithm IB1, and
(iv) the rule-based RIPPER, correspondingly. Naive Bayes
uses a probabilistic technique for classification by multiplying
the separate probabilities of every feature-value pair. This
algorithm adopts independence among the features and even
then affords excellent classification results. Decision tree
learning algorithm C4.5 is an addition of ID3 that accounts for
unavailable values, uninterrupted attribute value ranges,
pruning of decision trees, rule derivation, and so on. The tree
includes of nodes (features) that are selected by information
entropy. Instance-based learner IB1 is a single-nearest
neighbor algorithm, and it categorizes entities taking the class
of the closest associated vectors in the training set via distance
metrics. It is the modest between the algorithms used in our
study. Inductive rule learner RIPPER (Repeated Incremental
Pruning to Produce Error Reduction) [12] is a propositional
rule learner that defines a rule based detection model and
seeks to increase it iteratively by using dissimilar heuristic
techniques. The constructed rule set is then used to classify
new instances.
4.3 Results and analysis
In this section we represent the investigational results in
terms of the quantity of selected attributes, the time to obtain
the feature subset, the classification accuracy, and the
Win/Draw/Loss record. For the purpose of discovering the
statistical significance of the results, we performed a
nonparametric Friedman test [24] followed by Nemenyi posthoc test [47], as instructed by Demsar [17] and Garcia and
Herrerato [25] to statistically compare algorithms on multiple
data sets. Thus the Friedman and the Nemenyi test results are
reported as well.
ISSN: 2231-2803
5 CONCLUSION
In this paper, we have presented a novel attribute
clustering-based feature subset selection algorithm for high
dimensional data. The algorithm involves (i) removing
irrelevant attributes, (ii) constructing a minimum spanning
tree from relative ones, and (iii) partitioning the MST and
selecting illustrative attributes. In this new algorithm, a cluster
consists of features. Each cluster is treated as a single attribute
and thus dimensionality is drastically reduced.
We have compared the presentation of the proposed
algorithm with those of the five well-known feature selection
algorithms FCBF, ReliefF, CFS, Consist, and FOCUS-SF on
the publicly existing image and text data from the four
different types of features of the proportion of selected
features, classification accuracy of a given classifier, and the
Win/Draw/Loss record. Generally, the suggested algorithm
obtained the best proportion of selected features, the best
runtime, and the best cataloging accuracy for Naive Bayes,
C4.5, and RIPPER, and the second best classification
accuracy for IB1. The Loss/Draw/Win records established the
conclusions.
We also found that attribute cluster based obtains the rank
of 1 for microarray data, the rank of 2 for text data, and the
rank of 3 for image data in terms of cataloging accurateness of
the four different types of classifiers, and CFS is a good
alternative. At the same time, FCBF is a good another for text
and image data. Moreover, Contain and FOCUSSF are
alternatives for text data. For the upcoming work, we are
planning to discover different types of correlation procedures,
and study some formal properties of feature space.
REFERENCES
[1] Almuallim H. and Dietterich T.G., Learning boolean concepts in
the presence of many irrelevant features, Artificial Intelligence, 69(1
2), pp 279- 305, 1994.
[2] Almuallim H. and Dietterich T.G., Algorithms for Identifying
Relevant Features, In Proceedings of the 9th Canadian Conference on
AI, pp 38-45, 1992.
[3] Baker L.D. and McCallum A.K., Distributional clustering of
words for text classification, In Proceedings of the 21st Annual
international ACM SIGIR Conference on Research and Development
in information Retrieval, pp 96-103, 1998.
[4] Arauzo-Azofra A., Benitez J.M. and Castro J.L., A feature set
measure based on relief, In Proceedings of the fifth international
conference on Recent Advances in Soft Computing, pp 104-109,
2004.
[5] Bell D.A. and Wang, H., A formalism for relevance and its
application in feature subset selection, Machine Learning, 41(2), pp
175-195, 2000.
[6] Battiti R., Using mutual information for selecting features in
supervised neural net learning, IEEE Transactions on Neural
Networks, 5(4), pp 537-550, 1994.
[7] Butterworth R., Piatetsky-Shapiro G. and Simovici D.A., On
Feature Selection through Clustering, In Proceedings of the Fifth
IEEE international Conference on Data Mining, pp 581-584, 2005.
http://www.ijcttjournal.org
Page66
International Journal of Computer Trends and Technology (IJCTT) – volume 15 number 2 – Sep 2014
[8] Biesiada J. and Duch W., Features election for highdimensionaldatała
Pearson
redundancy
based
filter,
AdvancesinSoftComputing, 45, pp 242C249, 2008.
[9] Chanda P., Cho Y., Zhang A. and Ramanathan M., Mining of
Attribute Interactions Using Information Theoretic Metrics, In
Proceedings of IEEE international Conference on Data Mining
Workshops, pp 350-355, 2009.
[10] Cardie, C., Using decision trees to improve case-based learning,
In Proceedings of Tenth International Conference on Machine
Learning, pp 25-32, 1993.
AUTHOR PROFILE
Vivek
Ravindra Prasad
Pandey pursuing M.Tech
(IT) from Holy Mary Institute
of Technology and Science,
Keesara, Ranga Reddy Dist.,
Affiliated
to
JNTUHYDERABAD.
T. Venu, Assistant Professor
at Holy Mary Institute of
Technology and Science,
Keesara, Ranga Reddy Dist.,
Affiliated
to
JNTUHYDERABAD
Dr.N.Subhash
Chandra,
Professor (CSE Department)
at Holy Mary Institute of
Technology and Science,
Keesara, Ranga Reddy Dist.,
Affiliated
to
JNTUHYDERABAD
ISSN: 2231-2803
http://www.ijcttjournal.org
Page67