Download Dimensionality Reduction Using CLIQUE and Genetic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 6, Issue 3, July - Sept 2015
Dimensionality Reduction Using CLIQUE
and Genetic Algorithm
1
1,2
Behera Gayathri, 2A. Mary Sowjanya
Dept. of CS & SE, Andhra University College of Engineering(A), AP, India
Abstract
Clustering high dimensional data is an emerging research field as;
it is becoming a major challenge to cluster high dimensional data
due to the high scattering of data points in the full dimensional
space. Most of the conventional clustering algorithms consider all
the dimensions of the dataset for discovering clusters whereas only
some of the dimensions are relevant. In this paper, we propose a
“Dimensionality Reduction using Clique and Genetic Algorithm”
(DRCGA) for the problem of high-dimensional clustering.
The DRCGA algorithm consists of two phases. The first phase
is the preprocessing phase where CLIQUE is used for subspace
relevance analysis to find the dense subspaces. In the second
phase a genetic algorithm is used for clustering the subspaces
detected in the first phase. This algorithm is advantageous when
high-dimensional data is considered. The problem of losing
some of the regions that are actually densely populated due to
the high scattering of data points in high-dimensional space can
be overcome. The experiments performed on large and highdimensional synthetic and real world data sets show that DRCGA
performs with a higher efficiency and better resulting cluster
accuracy. Moreover, the algorithm not only yields accurate results
when the number of dimensions increases but also outperforms the
individual algorithms when the size of the dataset increases.
Keywords
CLIQUE, DRCGA, Genetic Algorithm, High Dimensional Data,
Subspace Clustering
I. Introduction
Data mining refers to extracting or “mining” knowledge from
large amounts of data. The main goal of the data mining process
is to extract information from a data set and transform it into an
understandable structure for further use. Data mining is an essential
step in the process of knowledge discovery. Knowledge discovery
consists of an iterative sequence of the steps data cleaning, data
integration, data selection, data transformation, data mining,
pattern evaluation and knowledge presentation [1].
Cluster analysis is a very important technique in data mining.
The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering. A cluster is a
collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other clusters.
Clustering is considered as a part of unsupervised learning and
there are different types of clustering techniques available like
hierarchical based clustering, partitioned based clustering, density
based clustering and grid based clustering.
Subspace clustering is the task of detecting groups of clusters
within different subspaces of the same dataset. There are several
problems which are encountered in clustering of high-dimensional
data [2]. High dimensional data are quite impossible to visualize,
and hard to think. When the dimensionality increases, the volume
of the space grows exponentially and as a result the data points in
200
International Journal of Computer Science And Technology
the full dimensional space become very sparse. This is called curse
of dimensionality. When the number of dimensions increases, the
distance between any two points in a given dataset converges. The
discrimination of the nearest and farthest point becomes difficult
and hence the concept of distance becomes meaningless. In high
dimensional data, most of the dimensions could be irrelevant
and can mask existing clusters in noisy data. Hence subspace
clustering algorithms localize the search for relevant dimensions
allowing them to find clusters that exist in multiple subspaces of
the dataset.
II. Related Work
The subspace clustering algorithms are of two types based on the
searching technique as top-down search and bottom-up search
methods [3]. CLIQUE [4] algorithm is a combination of density
and grid based clustering. CLIQUE uses Apriori approach to find
clusterable subspaces. Once the dense subspaces are found they
are sorted by coverage, where coverage is defined as the fraction
of the dataset covered by the dense units in the subspace. The
subspaces with the greatest coverage are kept and the rest are
pruned. ENCLUS[5] is a subspace clustering method which is
based on the CLIQUE algorithm. ENCLUS introduces the concept
of ‘Subspace Entropy’. The algorithm is based on the observation
that, a subspace with clusters typically has lower entropy than a
subspace without clusters. MAFIA [6] is another extension of
CLIQUE that uses an adaptive grid based on the distribution of
data to improve efficiency and cluster quality, it also introduces
parallelism to improve scalability. Cell-based Clustering (CBF)
[7] addresses scalability issues associated with many bottom-up
algorithms.CBF uses a cell creation algorithm that creates optimal
partitions by repeatedly examining minimum and maximum values
on a given dimension which results in the generation of fewer
bins (cells). The CBF algorithm is sensitive to two parameters.
Section threshold determines the bin frequency of a dimension.
The retrieval time is reduced as the threshold value is increased
because the number of records accessed is minimized. The other
parameter is cell threshold which determines the minimum density
of data points in a bin. CLTree [8] uses a modified decision tree
algorithm to select the best cutting planes for a given dataset. It
uses a decision tree algorithm to partition each dimension into
bins, separating areas of high density from areas of low density.
Density-based Optimal projective Clustering (DOC) is a type of
hybrid method of the grid based approach used by the bottomup approaches and the iterative improvement method from the
top-down approaches [9]. PROCLUS [10] was the first topdown subspace clustering algorithm. Similar to CLARANS [11],
PROCLUS samples the data, then selects a set of k medoids and
iteratively improves the clustering. The algorithm uses a three
phase approach consisting of initialization, iteration, and cluster
refinement. ORCLUS [12] is an extended version of the algorithm
PROCLUS[10] which looks for non-axis parallel subspaces. This
algorithm arises from the observation that many datasets contain
inter-attribute correlations. The algorithm is divided into three
steps: assign clusters, subspace determination, and merge. A Fast
w w w. i j c s t. c o m
IJCST Vol. 6, Issue 3, July - Sept 2015
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
and Intelligent Subspace Clustering Algorithm uses Dimension
Voting, FINDIT [13] is similar in structure to PROCLUS and the
other top-down methods, but uses a unique distance measure called
the Dimension Oriented Distance (DOD). The algorithm mainly
consists of three phases, namely sampling phase, cluster forming
phase, and data assignment phase. δ-clusters algorithm uses a
distance measure that attempts to capture the coherence exhibited
by subset of instances on subset of attributes [14]. The algorithm
takes the number of clusters and the individual cluster size as
parameters. Clustering On Subsets of Attributes (COSA) [15] is
an iterative algorithm that assigns weights to each dimension for
each instance, not each cluster. Starting with equally weighted
dimensions, the algorithm examines the k nearest neighbors (knn)
of each instance. These neighborhoods are used to calculate the
respective dimension weights for each instance.
III. DRCGA Framework
Phase 1: Preprocessing using CLIQUE
CLIQUE (CLustering In QUEst) [16] was the first algorithm which
was proposed for dimension-growth subspace clustering in highdimensional space. In dimension-growth subspace clustering, the
process of clustering starts from single-dimensional subspaces
and grows upward to higher-dimensional ones. CLIQUE can
be considered as an integration of density-based and grid-based
clustering methods. In the proposed algorithm CLIQUE is used
in the preprocessing phase for subspace relevance analysis to find
the dense subspaces. This process involves the following steps:
Step 1: In the first step, CLIQUE partitions the d-dimensional
data space into non-overlapping rectangular units.
Step 2: Identification of subspace that is dense
A. Finding of Dense Units
1. The zero density valued dimensions are discarded.
2. The threshold value is calculated, which is the smallest density
value of all dimensions.
3. The dimensions whose density values are less than the
threshold value are been removed, since a unit is considered
as dense if the fraction of total data points contained in it
exceeds the threshold value.
B. Finding Subspaces of High Coverage
The dimensions which are having dense units can be considered as
n-dimensional subspaces of high coverage, where n is the number
of dimensions which are dense.
Phase 2: Genetic Algorithm
1. Representation
In a genetic algorithm [17], the representation of a chromosome
is required, so as to describe each individual of the population.
A chromosome consists of a sequence of genes from certain
alphabet. An alphabet generally consists of binary digits (0 and
1). A common technique which is used to represent subspaces is
the standard binary encoding technique where all the individuals
are represented by binary strings with fixed and equal length 𝜃,
where 𝜃 is the number of dimensions of the dataset. Each bit of
the individual takes the value of “0” and “1”, respectively from
a binary alphabet set 𝐵 = {0,1}. Each bit indicates whether its
corresponding attribute is selected or not (“0” indicates attribute is
not selected and vice versa for “1”). For example, if we consider
the individual 110110 where the dimension of data stream 𝜃 =
6, represents a 4-dimensional subspace cluster containing the
1st, 2nd, 4th, and 5th attributes of the dataset. In the proposed
algorithm, a binary matrix 𝑀 is used to represent the individuals
of the population, in which the rows represent the individuals of
the population and columns represent the dimensions.
Fig. 1: DRCGA Framework
w w w. i j c s t. c o m
2. Fitness Function
The calculation of fitness function is a very important step in a
genetic algorithm. In DRCGA the fitness function calculates the
density of each subspace cluster 𝐶𝑠, so that the dataset could be
clustered in as many subspace clusters as possible. Hence weighted
density measure is used to calculate the fitness function to find the
fittest individuals. The mathematical definitions of the function
are as follows.
Consider 𝐴= {𝐴1,…𝐴𝑑} to be a set of d dimensional attributes.
Then the density of a single attribute 𝐴𝑖 is defined as the fraction
International Journal of Computer Science And Technology 201
IJCST Vol. 6, Issue 3, July - Sept 2015
of “1” entries in the column 𝐴𝑖, denoted as dens(𝐴𝑖). The density
of a subspace cluster 𝐶𝑠 denoted as dens(𝐶𝑠), is the ratio between
the number of data points contained in the subspace cluster |𝐶𝑠|
and the total number of data points in the dataset |𝐷|. A weighted
density measure is proposed to find the density of all subspace
clusters. The weighted density of a subspace cluster C𝑠 denoted
as dens𝑤 (𝐶𝑠), is defined as the ratio between dens(𝐶𝑠) and the
average density of all attributes contained in 𝐴; that is,
dens𝑤(𝐶𝑠) =dens(𝐶𝑠)/((1/|𝐴|)(∑𝐴𝑖∈𝐴 dens(𝐴𝑖))), where |𝐴| is the
number of attributes contained in subspace cluster. Here the
denominator (1/|𝐴|)(∑𝐴𝑖∈𝐴dens(𝐴𝑖)) is called as weight.
3. Selection
The selection operator supports the choice of the individuals
from a population that are allowed to create a new generation
of individuals. Genetic algorithm methods attempt to develop
fitness methods and elitism rules that find a set of optimal values
quickly and reliably. But, ultimately, it is the selection method
that is responsible for the choice of a new generation of parents.
Thus, the selection process is responsible for making the mapping
from the fitness values of the input population to a probability
of selection and thus the probability of an individual’s genetic
material being passed to the next generation. Here we adopt the
tournament selection method because its time complexity is low.
The basic concept of the tournament method is as follows: we first
randomly select a positive number N tour of chromosomes from
the population and copy the best fitted item from them into an
intermediate population. This process is repeated P times, and here
P is the population size. The procedure of tournament selection
is shown below
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
5. Mutation
Mutation may not be necessarily applied to all individuals, unlike
the crossover and selection operators. There are multiple mutation
specifications available like bit, uniform, and normal. Each type
of mutation has a different probability of being applied to each
individual. Therefore, some individuals may be mutated by a
single mutation method, some individuals may be mutated by
several mutation methods and some may not be mutated at all.
In DRCGA bit mutation is applied to the genes, where the genes
of the children are been reversed as “1” is changed to“0” and “0”
is changed to “1”.
6. Elitism
When a new population is created by performing crossover and
mutation, there is a high chance of losing the best subspace clusters
that are found in the previous generation. Hence, elitism method
is used to achieve a constantly improving population of subspace
clusters. The elitism strategy which is implemented in this
algorithm, directly copies the best or a few best subspaces to the
new population, together with other new individuals generated.
IV. Results and Discussion
DRCGA has been implemented on Zoo [18] dataset which is a real
dataset from the UCI machine learning repository and supermarket
dataset which is a synthetic dataset prepared to test this framework.
Validating the results of clustering is very important. Cluster
Accuracy [19] is used to evaluate the obtained clustering results.
The evaluation metric used in DRCGA is given below
where ,
N = Number of data points in the dataset
T = Number of resultant clusters
Xi = Number of data points of majority class in cluster i
Procedure of tournament selection method
4. Crossover
Crossover is an important feature of genetic algorithms, which is
used to take some attributes from one parent and other attributes
from a second parent. Uniform crossover is used in the proposed
algorithm in which two chromosomes are randomly selected as
parents from the current population. An offspring chromosome is
created by copying each allele from each parent with a probability
𝑝i. The 𝑝𝑖 is a random real number uniformly distributed in the
interval [0, 1]. Let P1 and P2 be two parents, and let C1 and
C2 be offspring chromosomes; the 𝑖th allele in each offspring is
defined as
𝐶1(𝑖) = 𝑃1(𝑖), 𝐶2(𝑖) = 𝑃2(𝑖) if 𝑝𝑖≥ 0.5,
𝐶1(𝑖) = 𝑃2(𝑖), 𝐶2(𝑖) = 𝑃1(𝑖) if 𝑝𝑖< 0.5.
For example consider 𝑃𝑖 and 𝑃𝑗 chromosomes are randomly
selected for uniform crossover
𝑃𝑖= 101010,
𝑃𝑗= 010101.
The probability 𝑝𝑖 is considered from a range of values
[0.3,0.7,0.8,0.4,0.2,0.5]. Now each allele is copied from each
parent with a probability 𝑝𝑖. The resulting off springs are
𝐶𝑖= 001100,
𝐶𝑗= 110011.
202
International Journal of Computer Science And Technology
Fig. 2: Scalability With Dataset Size
The above chart shows that the cluster accuracy for DRCGA
increases when the dataset size increases from 200 to 250. The
cluster accuracy increases even when the dataset size still increases
from 250 to 300. This shows that DRCGA is highly scalable with
increasing sizes of data.
w w w. i j c s t. c o m
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 6, Issue 3, July - Sept 2015
The above chart shows that the cluster accuracy increases when the
dataset is preprocessed compared to the cluster accuracy without
preprocessing the dataset.
Fig. 3: Comparison of Cluster Accuracy for Different
Algorithms
The above chart shows that the cluster accuracy for GA is more
compared to CLIQUE, but DRCGA exceeds both GA and CLIQUE
in terms of cluster accuracy.
Fig. 4: Comparison of Execution Time for Different Algorithms
The above chart shows that the execution time for CLIQUE is
less compared to GA, but the execution time for DRCGA is even
much lesser than CLIQUE.
Fig. 5: Comparison of cluster accuracy with and without
preprocessing
w w w. i j c s t. c o m
Fig. 6: Comparison of Execution Time With and Without Preprocessing
The above chart shows that the execution time reduces when the
dataset is preprocessed compared to the execution time without
preprocessing the dataset.
V. Conclusion
In this paper we have proposed a “Dimensionality Reduction using
Clique and Genetic Algorithm” (DRCGA) for high-dimensional
clustering. The DRCGA algorithm consists of two phases in which
the first phase performs preprocessing of data using CLIQUE
to find the dense subspaces. In the second phase we implement
a genetic algorithm which clusters the data points existing in
the subspaces of each of the dimensions generated in the first
phase. The experiments performed on large and high dimensional
synthetic and real world data sets show that DRCGA performs
with a higher efficiency and better resulting cluster accuracy.
Moreover, the algorithm not only yields accurate results when
the number of dimensions increases but also outperforms the
individual algorithms when the size of the dataset increases. An
interesting direction to explore is that, the preprocessing phase can
be more rigorously done to increase the cluster accuracy further,
taking care so as not to decrease the cluster accuracy and increase
the execution time.
References
[1] Data Mining: Concepts and Techniques, Second Edition,
Jiawei Han and Micheline Kamber.
[2] [Online] Available: https://en.wikipedia.org/wiki/Clustering_
high-dimensional_data
[3] Lance Parsons, Ehtesham Haque, Huan Liu,“Subspace
Clustering for High Dimensional Data: A Review”, Sigkdd
Explorations, Vol. 6, Issue 1.
[4] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Automatic
subspace clustering of high dimensional data for data mining
applications. In Proceedings of the 1998 ACM SIGMOD
international conference on Management of data, pp. 94-105.
ACM Press, 1998.
[5] C.-H. Cheng, A. W. Fu, Y. Zhang,"Entropy based subspace
clustering for mining numerical data. In Proceedings of the
fifth ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 84-93. ACM Press, 1999.
International Journal of Computer Science And Technology 203
IJCST Vol. 6, Issue 3, July - Sept 2015
[6] S. Goil, H. Nagesh, A. Choudhary,"Mafia: Effcient and
scalable subspace clustering for very large data sets",
Technical Report CPDC-TR-9906-010, Northwestern
University, 2145 Sheridan Road, Evanston IL 60208, June
1999.
[7] J.-W. Chang, D.-S. Jin,"A new cell-based clustering method
for large, high-dimensional data in data mining applications",
In Proceedings of the 2002 ACM symposium on Applied
computing, pp. 503-507. ACM Press, 2002.
[8] B. Liu, Y. Xia, P. S. Yu.,"Clustering through decision tree
construction", In Proceedings of the ninth international
conference on Information and knowledge management,
pages 20{29. CM Press, 2000.
[9] C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M.Murali,
"A monte carlo algorithm for fast projective clustering",
In Proceedings of the 2002 ACM SIGMOD international
conference on Management of data, pp. 418-427. ACM Press,
2002.
[10]C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, J. S. Park.
Fast algorithms for projected clustering", In Proceedings
of the 1999 ACM SIGMOD international conference on
Management of data, pp. 61-72. ACM Press, 1999.
[11] R. Ng, J. Han,"Effcient and effective clustering methods
for spatial data mining", In Proceedings of the 20th VLDB
Conference, pp. 144-155, 1994.
[12]C. C. Aggarwal, P. S. Yu.,"Finding generalized projected
clusters in high dimensional spaces", In Proceedings of
the 2000 ACM SIGMOD international conference on
Management of data, pp. 70-81. ACM Press, 2000.
[13]K.-G. Woo, J.-H. Lee.,"FINDIT: A Fast and Intelligent
Subspace Clustering Algorithm using Dimension Voting",
Ph.D thesis, Korea Advanced Institute of Science and
Technology, Taejon, Korea, 2002.
[14]J. Yang, W. Wang, H. Wang, P. Yu δ-clusters: Capturing
subspace correlation in a large data set. In Data Engineering,
2002. Proceedings. 18th International Conference on, pp.
517-528, 2002.
[15]J. H. Friedman, J. J. Meulman,"Clustering objects on subsets
of attributes", [Online] Available: http://citeseer.nj.nec.com/
friedman02clustering.html, 2002.
[16]Jyoti Yadav, Dharmender Kumar,“Subspace Clustering using
CLIQUE:An Exploratory Study”, IJARCET International
Journal of Advanced Research in Computer Engineering &
Technology Vol. 3, Issue 2, February 2014.
[17]Singh Vijendra, Sahoo Laxman,“Subspace Clustering
of High-Dimensional Data:An Evolutionary Approach”,
Applied Computational Intelligence and Soft Computing,
Vol. 2013 Article ID 863146
[18]Zoodataset: [Online] Available: https://archive.ics.uci.edu/
ml/datasets/Zoo
[19]A.M.Sowjanya, M.Shashi,“New Proximity Estimate for
Incremental Update of Non-Uniformly Distributed Clusters”,
International Journal of Data Mining & Knowledge
Management Process (IJDKP) Vol. 3, No. 5, September
2013.
204
International Journal of Computer Science And Technology
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
Behera Gayathri received her B.Tech
Degree in Computer Science and
Engineering from Gayatri Vidya
Parishad College of Engineering
for Women, Andhra Pradesh, India
and M.Tech Degree in Computer
Science and Technology from Andhra
University College of Engineering(A),
Andhra Pradesh, India. This work is a
part of her M.Tech thesis.
Dr. A. Mary Sowjanya is working as an
Assistant Professor in the department
of Computer Science and Systems
Engineering at Andhra University
College of Engineering(A), Andhra
Pradesh, India. She has completed
her Ph.D in the same department. Her
research interests include Data Mining,
Incremental clustering and Big data
analysis.
w w w. i j c s t. c o m