Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 6, Issue 3, July - Sept 2015 Dimensionality Reduction Using CLIQUE and Genetic Algorithm 1 1,2 Behera Gayathri, 2A. Mary Sowjanya Dept. of CS & SE, Andhra University College of Engineering(A), AP, India Abstract Clustering high dimensional data is an emerging research field as; it is becoming a major challenge to cluster high dimensional data due to the high scattering of data points in the full dimensional space. Most of the conventional clustering algorithms consider all the dimensions of the dataset for discovering clusters whereas only some of the dimensions are relevant. In this paper, we propose a “Dimensionality Reduction using Clique and Genetic Algorithm” (DRCGA) for the problem of high-dimensional clustering. The DRCGA algorithm consists of two phases. The first phase is the preprocessing phase where CLIQUE is used for subspace relevance analysis to find the dense subspaces. In the second phase a genetic algorithm is used for clustering the subspaces detected in the first phase. This algorithm is advantageous when high-dimensional data is considered. The problem of losing some of the regions that are actually densely populated due to the high scattering of data points in high-dimensional space can be overcome. The experiments performed on large and highdimensional synthetic and real world data sets show that DRCGA performs with a higher efficiency and better resulting cluster accuracy. Moreover, the algorithm not only yields accurate results when the number of dimensions increases but also outperforms the individual algorithms when the size of the dataset increases. Keywords CLIQUE, DRCGA, Genetic Algorithm, High Dimensional Data, Subspace Clustering I. Introduction Data mining refers to extracting or “mining” knowledge from large amounts of data. The main goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining is an essential step in the process of knowledge discovery. Knowledge discovery consists of an iterative sequence of the steps data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation [1]. Cluster analysis is a very important technique in data mining. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is considered as a part of unsupervised learning and there are different types of clustering techniques available like hierarchical based clustering, partitioned based clustering, density based clustering and grid based clustering. Subspace clustering is the task of detecting groups of clusters within different subspaces of the same dataset. There are several problems which are encountered in clustering of high-dimensional data [2]. High dimensional data are quite impossible to visualize, and hard to think. When the dimensionality increases, the volume of the space grows exponentially and as a result the data points in 200 International Journal of Computer Science And Technology the full dimensional space become very sparse. This is called curse of dimensionality. When the number of dimensions increases, the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point becomes difficult and hence the concept of distance becomes meaningless. In high dimensional data, most of the dimensions could be irrelevant and can mask existing clusters in noisy data. Hence subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple subspaces of the dataset. II. Related Work The subspace clustering algorithms are of two types based on the searching technique as top-down search and bottom-up search methods [3]. CLIQUE [4] algorithm is a combination of density and grid based clustering. CLIQUE uses Apriori approach to find clusterable subspaces. Once the dense subspaces are found they are sorted by coverage, where coverage is defined as the fraction of the dataset covered by the dense units in the subspace. The subspaces with the greatest coverage are kept and the rest are pruned. ENCLUS[5] is a subspace clustering method which is based on the CLIQUE algorithm. ENCLUS introduces the concept of ‘Subspace Entropy’. The algorithm is based on the observation that, a subspace with clusters typically has lower entropy than a subspace without clusters. MAFIA [6] is another extension of CLIQUE that uses an adaptive grid based on the distribution of data to improve efficiency and cluster quality, it also introduces parallelism to improve scalability. Cell-based Clustering (CBF) [7] addresses scalability issues associated with many bottom-up algorithms.CBF uses a cell creation algorithm that creates optimal partitions by repeatedly examining minimum and maximum values on a given dimension which results in the generation of fewer bins (cells). The CBF algorithm is sensitive to two parameters. Section threshold determines the bin frequency of a dimension. The retrieval time is reduced as the threshold value is increased because the number of records accessed is minimized. The other parameter is cell threshold which determines the minimum density of data points in a bin. CLTree [8] uses a modified decision tree algorithm to select the best cutting planes for a given dataset. It uses a decision tree algorithm to partition each dimension into bins, separating areas of high density from areas of low density. Density-based Optimal projective Clustering (DOC) is a type of hybrid method of the grid based approach used by the bottomup approaches and the iterative improvement method from the top-down approaches [9]. PROCLUS [10] was the first topdown subspace clustering algorithm. Similar to CLARANS [11], PROCLUS samples the data, then selects a set of k medoids and iteratively improves the clustering. The algorithm uses a three phase approach consisting of initialization, iteration, and cluster refinement. ORCLUS [12] is an extended version of the algorithm PROCLUS[10] which looks for non-axis parallel subspaces. This algorithm arises from the observation that many datasets contain inter-attribute correlations. The algorithm is divided into three steps: assign clusters, subspace determination, and merge. A Fast w w w. i j c s t. c o m IJCST Vol. 6, Issue 3, July - Sept 2015 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) and Intelligent Subspace Clustering Algorithm uses Dimension Voting, FINDIT [13] is similar in structure to PROCLUS and the other top-down methods, but uses a unique distance measure called the Dimension Oriented Distance (DOD). The algorithm mainly consists of three phases, namely sampling phase, cluster forming phase, and data assignment phase. δ-clusters algorithm uses a distance measure that attempts to capture the coherence exhibited by subset of instances on subset of attributes [14]. The algorithm takes the number of clusters and the individual cluster size as parameters. Clustering On Subsets of Attributes (COSA) [15] is an iterative algorithm that assigns weights to each dimension for each instance, not each cluster. Starting with equally weighted dimensions, the algorithm examines the k nearest neighbors (knn) of each instance. These neighborhoods are used to calculate the respective dimension weights for each instance. III. DRCGA Framework Phase 1: Preprocessing using CLIQUE CLIQUE (CLustering In QUEst) [16] was the first algorithm which was proposed for dimension-growth subspace clustering in highdimensional space. In dimension-growth subspace clustering, the process of clustering starts from single-dimensional subspaces and grows upward to higher-dimensional ones. CLIQUE can be considered as an integration of density-based and grid-based clustering methods. In the proposed algorithm CLIQUE is used in the preprocessing phase for subspace relevance analysis to find the dense subspaces. This process involves the following steps: Step 1: In the first step, CLIQUE partitions the d-dimensional data space into non-overlapping rectangular units. Step 2: Identification of subspace that is dense A. Finding of Dense Units 1. The zero density valued dimensions are discarded. 2. The threshold value is calculated, which is the smallest density value of all dimensions. 3. The dimensions whose density values are less than the threshold value are been removed, since a unit is considered as dense if the fraction of total data points contained in it exceeds the threshold value. B. Finding Subspaces of High Coverage The dimensions which are having dense units can be considered as n-dimensional subspaces of high coverage, where n is the number of dimensions which are dense. Phase 2: Genetic Algorithm 1. Representation In a genetic algorithm [17], the representation of a chromosome is required, so as to describe each individual of the population. A chromosome consists of a sequence of genes from certain alphabet. An alphabet generally consists of binary digits (0 and 1). A common technique which is used to represent subspaces is the standard binary encoding technique where all the individuals are represented by binary strings with fixed and equal length 𝜃, where 𝜃 is the number of dimensions of the dataset. Each bit of the individual takes the value of “0” and “1”, respectively from a binary alphabet set 𝐵 = {0,1}. Each bit indicates whether its corresponding attribute is selected or not (“0” indicates attribute is not selected and vice versa for “1”). For example, if we consider the individual 110110 where the dimension of data stream 𝜃 = 6, represents a 4-dimensional subspace cluster containing the 1st, 2nd, 4th, and 5th attributes of the dataset. In the proposed algorithm, a binary matrix 𝑀 is used to represent the individuals of the population, in which the rows represent the individuals of the population and columns represent the dimensions. Fig. 1: DRCGA Framework w w w. i j c s t. c o m 2. Fitness Function The calculation of fitness function is a very important step in a genetic algorithm. In DRCGA the fitness function calculates the density of each subspace cluster 𝐶𝑠, so that the dataset could be clustered in as many subspace clusters as possible. Hence weighted density measure is used to calculate the fitness function to find the fittest individuals. The mathematical definitions of the function are as follows. Consider 𝐴= {𝐴1,…𝐴𝑑} to be a set of d dimensional attributes. Then the density of a single attribute 𝐴𝑖 is defined as the fraction International Journal of Computer Science And Technology 201 IJCST Vol. 6, Issue 3, July - Sept 2015 of “1” entries in the column 𝐴𝑖, denoted as dens(𝐴𝑖). The density of a subspace cluster 𝐶𝑠 denoted as dens(𝐶𝑠), is the ratio between the number of data points contained in the subspace cluster |𝐶𝑠| and the total number of data points in the dataset |𝐷|. A weighted density measure is proposed to find the density of all subspace clusters. The weighted density of a subspace cluster C𝑠 denoted as dens𝑤 (𝐶𝑠), is defined as the ratio between dens(𝐶𝑠) and the average density of all attributes contained in 𝐴; that is, dens𝑤(𝐶𝑠) =dens(𝐶𝑠)/((1/|𝐴|)(∑𝐴𝑖∈𝐴 dens(𝐴𝑖))), where |𝐴| is the number of attributes contained in subspace cluster. Here the denominator (1/|𝐴|)(∑𝐴𝑖∈𝐴dens(𝐴𝑖)) is called as weight. 3. Selection The selection operator supports the choice of the individuals from a population that are allowed to create a new generation of individuals. Genetic algorithm methods attempt to develop fitness methods and elitism rules that find a set of optimal values quickly and reliably. But, ultimately, it is the selection method that is responsible for the choice of a new generation of parents. Thus, the selection process is responsible for making the mapping from the fitness values of the input population to a probability of selection and thus the probability of an individual’s genetic material being passed to the next generation. Here we adopt the tournament selection method because its time complexity is low. The basic concept of the tournament method is as follows: we first randomly select a positive number N tour of chromosomes from the population and copy the best fitted item from them into an intermediate population. This process is repeated P times, and here P is the population size. The procedure of tournament selection is shown below ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) 5. Mutation Mutation may not be necessarily applied to all individuals, unlike the crossover and selection operators. There are multiple mutation specifications available like bit, uniform, and normal. Each type of mutation has a different probability of being applied to each individual. Therefore, some individuals may be mutated by a single mutation method, some individuals may be mutated by several mutation methods and some may not be mutated at all. In DRCGA bit mutation is applied to the genes, where the genes of the children are been reversed as “1” is changed to“0” and “0” is changed to “1”. 6. Elitism When a new population is created by performing crossover and mutation, there is a high chance of losing the best subspace clusters that are found in the previous generation. Hence, elitism method is used to achieve a constantly improving population of subspace clusters. The elitism strategy which is implemented in this algorithm, directly copies the best or a few best subspaces to the new population, together with other new individuals generated. IV. Results and Discussion DRCGA has been implemented on Zoo [18] dataset which is a real dataset from the UCI machine learning repository and supermarket dataset which is a synthetic dataset prepared to test this framework. Validating the results of clustering is very important. Cluster Accuracy [19] is used to evaluate the obtained clustering results. The evaluation metric used in DRCGA is given below where , N = Number of data points in the dataset T = Number of resultant clusters Xi = Number of data points of majority class in cluster i Procedure of tournament selection method 4. Crossover Crossover is an important feature of genetic algorithms, which is used to take some attributes from one parent and other attributes from a second parent. Uniform crossover is used in the proposed algorithm in which two chromosomes are randomly selected as parents from the current population. An offspring chromosome is created by copying each allele from each parent with a probability 𝑝i. The 𝑝𝑖 is a random real number uniformly distributed in the interval [0, 1]. Let P1 and P2 be two parents, and let C1 and C2 be offspring chromosomes; the 𝑖th allele in each offspring is defined as 𝐶1(𝑖) = 𝑃1(𝑖), 𝐶2(𝑖) = 𝑃2(𝑖) if 𝑝𝑖≥ 0.5, 𝐶1(𝑖) = 𝑃2(𝑖), 𝐶2(𝑖) = 𝑃1(𝑖) if 𝑝𝑖< 0.5. For example consider 𝑃𝑖 and 𝑃𝑗 chromosomes are randomly selected for uniform crossover 𝑃𝑖= 101010, 𝑃𝑗= 010101. The probability 𝑝𝑖 is considered from a range of values [0.3,0.7,0.8,0.4,0.2,0.5]. Now each allele is copied from each parent with a probability 𝑝𝑖. The resulting off springs are 𝐶𝑖= 001100, 𝐶𝑗= 110011. 202 International Journal of Computer Science And Technology Fig. 2: Scalability With Dataset Size The above chart shows that the cluster accuracy for DRCGA increases when the dataset size increases from 200 to 250. The cluster accuracy increases even when the dataset size still increases from 250 to 300. This shows that DRCGA is highly scalable with increasing sizes of data. w w w. i j c s t. c o m ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 6, Issue 3, July - Sept 2015 The above chart shows that the cluster accuracy increases when the dataset is preprocessed compared to the cluster accuracy without preprocessing the dataset. Fig. 3: Comparison of Cluster Accuracy for Different Algorithms The above chart shows that the cluster accuracy for GA is more compared to CLIQUE, but DRCGA exceeds both GA and CLIQUE in terms of cluster accuracy. Fig. 4: Comparison of Execution Time for Different Algorithms The above chart shows that the execution time for CLIQUE is less compared to GA, but the execution time for DRCGA is even much lesser than CLIQUE. Fig. 5: Comparison of cluster accuracy with and without preprocessing w w w. i j c s t. c o m Fig. 6: Comparison of Execution Time With and Without Preprocessing The above chart shows that the execution time reduces when the dataset is preprocessed compared to the execution time without preprocessing the dataset. V. Conclusion In this paper we have proposed a “Dimensionality Reduction using Clique and Genetic Algorithm” (DRCGA) for high-dimensional clustering. The DRCGA algorithm consists of two phases in which the first phase performs preprocessing of data using CLIQUE to find the dense subspaces. In the second phase we implement a genetic algorithm which clusters the data points existing in the subspaces of each of the dimensions generated in the first phase. The experiments performed on large and high dimensional synthetic and real world data sets show that DRCGA performs with a higher efficiency and better resulting cluster accuracy. Moreover, the algorithm not only yields accurate results when the number of dimensions increases but also outperforms the individual algorithms when the size of the dataset increases. An interesting direction to explore is that, the preprocessing phase can be more rigorously done to increase the cluster accuracy further, taking care so as not to decrease the cluster accuracy and increase the execution time. References [1] Data Mining: Concepts and Techniques, Second Edition, Jiawei Han and Micheline Kamber. [2] [Online] Available: https://en.wikipedia.org/wiki/Clustering_ high-dimensional_data [3] Lance Parsons, Ehtesham Haque, Huan Liu,“Subspace Clustering for High Dimensional Data: A Review”, Sigkdd Explorations, Vol. 6, Issue 1. [4] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 94-105. ACM Press, 1998. [5] C.-H. Cheng, A. W. Fu, Y. Zhang,"Entropy based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 84-93. ACM Press, 1999. International Journal of Computer Science And Technology 203 IJCST Vol. 6, Issue 3, July - Sept 2015 [6] S. Goil, H. Nagesh, A. Choudhary,"Mafia: Effcient and scalable subspace clustering for very large data sets", Technical Report CPDC-TR-9906-010, Northwestern University, 2145 Sheridan Road, Evanston IL 60208, June 1999. [7] J.-W. Chang, D.-S. Jin,"A new cell-based clustering method for large, high-dimensional data in data mining applications", In Proceedings of the 2002 ACM symposium on Applied computing, pp. 503-507. ACM Press, 2002. [8] B. Liu, Y. Xia, P. S. Yu.,"Clustering through decision tree construction", In Proceedings of the ninth international conference on Information and knowledge management, pages 20{29. CM Press, 2000. [9] C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M.Murali, "A monte carlo algorithm for fast projective clustering", In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 418-427. ACM Press, 2002. [10]C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, J. S. Park. Fast algorithms for projected clustering", In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pp. 61-72. ACM Press, 1999. [11] R. Ng, J. Han,"Effcient and effective clustering methods for spatial data mining", In Proceedings of the 20th VLDB Conference, pp. 144-155, 1994. [12]C. C. Aggarwal, P. S. Yu.,"Finding generalized projected clusters in high dimensional spaces", In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 70-81. ACM Press, 2000. [13]K.-G. Woo, J.-H. Lee.,"FINDIT: A Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting", Ph.D thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea, 2002. [14]J. Yang, W. Wang, H. Wang, P. Yu δ-clusters: Capturing subspace correlation in a large data set. In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 517-528, 2002. [15]J. H. Friedman, J. J. Meulman,"Clustering objects on subsets of attributes", [Online] Available: http://citeseer.nj.nec.com/ friedman02clustering.html, 2002. [16]Jyoti Yadav, Dharmender Kumar,“Subspace Clustering using CLIQUE:An Exploratory Study”, IJARCET International Journal of Advanced Research in Computer Engineering & Technology Vol. 3, Issue 2, February 2014. [17]Singh Vijendra, Sahoo Laxman,“Subspace Clustering of High-Dimensional Data:An Evolutionary Approach”, Applied Computational Intelligence and Soft Computing, Vol. 2013 Article ID 863146 [18]Zoodataset: [Online] Available: https://archive.ics.uci.edu/ ml/datasets/Zoo [19]A.M.Sowjanya, M.Shashi,“New Proximity Estimate for Incremental Update of Non-Uniformly Distributed Clusters”, International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol. 3, No. 5, September 2013. 204 International Journal of Computer Science And Technology ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) Behera Gayathri received her B.Tech Degree in Computer Science and Engineering from Gayatri Vidya Parishad College of Engineering for Women, Andhra Pradesh, India and M.Tech Degree in Computer Science and Technology from Andhra University College of Engineering(A), Andhra Pradesh, India. This work is a part of her M.Tech thesis. Dr. A. Mary Sowjanya is working as an Assistant Professor in the department of Computer Science and Systems Engineering at Andhra University College of Engineering(A), Andhra Pradesh, India. She has completed her Ph.D in the same department. Her research interests include Data Mining, Incremental clustering and Big data analysis. w w w. i j c s t. c o m