Download a survey: fuzzy based clustering algorithms for big data

International Journal of Advances in Engineering Science and Technology Available online at www.ijaestonline.com 202 ISSN: 2319-1112 A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA K.Vidhya (M.E., Ph.D)1, R.Sivaramakrishnan (M.E.)2, P.Sangeetha3 1 Assist.Prof.(Sr.G), 2Assist.Prof, 3PG Scholar , Department of Computer Science and Engineering, Tamil Nadu, India ABSTRACT With the exponential growth of data from various social networks like Facebook, Twitter, Mobile applications, Digital cameras, Sensor networks etc., and also from biomedical researches the overall data volume has increased tremendously. So analyzing and extracting fruitful information from such a dynamic data is very much challenging task today. Data grouping or clustering plays a vital role in handling big data which is the basic foot step in data mining, pattern recognition and also in medical predictions. Due to the large size and variety of big data we can’t easily know the classification parameters for grouping of data .So it is very much difficult to use any supervised techniques without knowing the nature of the data. The unsupervised techniques like clustering techniques are very much suitable for handling big data in this case the learning parameters are computed from learning data. Using clustering algorithm we can assign labels to unlabeled data and reduce similarity between different clusters. This paper mainly discusses the fuzzy based clustering algorithms which eliminates the problem of classical clustering methods like prohibitive implementation (not directly) for mining big data and the time delay in clustering. So the accelerated fuzzy based clustering algorithms are needed for better speed and quality. This survey shares the detailed ideas of different FCM algorithm including their parameters, performance, execution time and scalability for large data sets. Keywords: Clustering algorithms, Big data, Fuzzy I. INTRODUCTION Big data from various resources are need to be collected and analyzed for decision making based on the needs of user. The data has 3V’s such as Volumes, Velocity and Variety. The big volume of data needs effective handling techniques for managing and reusing of data based on the analytical aspects. This massive volume of data can be really useful. But they are very much problematic in terms of their storage and analysis. The big volume makes analytical operations, process operations, retrieval operations very difficult and they are very much time consumable. In order to overcome these problems we need to have clustered form of big data [1].The fuzzy based clustering technique will improve the accuracy of the clustering process [13]. The Clustering technique organizes the data objects into set of disjoined classes called clusters. It generates good quality of cluster and uses efficient tool to deal with big data [1], [2]. The major requirements of clustering algorithm are scalability, determining arbitrary shape, high dimensionality, interoperability and usability. In clustering algorithms there are huge numbers of survey available in various domains such as machine learning, pattern recognition, data mining, signal processing, information retrieval and bio-informatics [1]. But user cannot understand which algorithm is best and appropriate for their need. This survey will provide sufficient information for choosing the best algorithm for further analysis of their big data. The objective of clustering algorithm as follows [5]: 1) To enlarge a clustering technique that is not aware of cluster centers in initial position. 2) To develop a new clustering techniques which gives minimized execution time and very less error rate. ISSN: 2319-1120 /V4N3: 202-208 © IJAEST A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA 203 3) To establish a clustering technique with less number of iterations and convergence time. 4) To expand an efficient clustering technique which provides better results in noisy and outlier’s data. II. CLASSIFICATION OF CLUSTERING ALGORITHMS Clustering algorithm Partitionbased 1. K-means 2. Kmedoids 3. K-modes 4. PAM 5. CLARA NS 6. CLARA 7. FCM 8. PSO 9. PFClust 10. CSO Hierarchical based 1. 2. 3. 4. 5. 6. 7. 8. Densitybased 1. BIRCH CURE ROCK Chamele on Echinda SOHAC ACADTRS HGCUD F 2. 3. 4. Gridbased DBSCA N OPTICS DBCLA SD DENCL UE 1. WaveCluster 2. STING 3. CLIQU E 4. OptiGrid Modelbased 1. EM 2. COB WEB 3. CLAS SIT 4. SOMS 9. SWIFT Figure.1: Classification of clustering algorithms In figure 1 the various clustering algorithms are listed. Farley and Raftery (1998) suggests clustering methods into two main groups: Hierarchical and Partitioning methods. Han and Kamber (2001) put forward into additional three main categories: Density-based methods, Grid-based methods and Model-based clustering [12]. Through this survey we discuss partitioning method [13] to improve the accuracy and quality of clustering process. A. Partitioning Clustering algorithm Partitioning clustering algorithm [12] uses relocation technique iteratively by moving them from one cluster to another, starting from an initial partitioning. Such methods require that number of clusters will be predetermined by the user. They are helpful in many applications where every cluster represent cluster center (prototype), and other instances in the cluster are similar to this prototype. a) K-means algorithm K-means is an unsupervised learning algorithm [12] which solves well known clustering algorithm. This algorithm goal is to minimizing an objective function, using squared error function. The objective function is k n J = ∑∑ xi( j ) − c j 2 j =1 i =1 where, 2 xi( j ) − c j - distance measure between data point xij and cluster center c j . ISSN: 2319-1120 /V4N3: 202-208 © IJAEST (1) IJAEST, Volume 4, Number 3 K.Vidhya et al. Advantages 1) Relatively scalable and simple. 2) Fast, robust and easy to understand 3) Suitable for datasets with compact spherical clusters that are well-separated[4] Disadvantages 1) 2) 3) 4) Severe effectiveness degradation in high dimensional spaces Poor cluster It doesn’t give effective result because if we choose cluster center randomly High sensitivity to initialization phase, noise and outliers III. FUZZY C-MEANS ALGORITHM Fuzzy c-means clustering (FCM) [3] is a piece of data to belong to more clusters and associated with each element is a set of membership levels. FCM is the advanced version of K-means clustering algorithm and FCM is known as Soft K-means algorithm. K-means describes the distance calculation but FCM does a full inverse-distance weighting [6]. The objective function is extended in two ways: 1) The fuzzy membership degrees in clusters were incorporated into the formula. 2) Then ‘m’ is an additional parameter was introduced as a weight exponent in the fuzzy membership. The extended objective function denoted Jm is N C J m = ∑∑ u ijm xi − c j , 1≤m<∞ 2 (2) i =1 j =1 N cj = ∑u i =1 N m ij ∑u i =1 .x i (3) m ij The membership value is calculated from Equation: 5 (4) 1 u ij =  xi − c j   x − c k  k =1  i  ∑  C 2 m −1 Where, m – level of cluster fuzziness uij – membership of ith data to jth cluster center x – input dataset c – number of cluster center n – number of data points S.N o Types Method Issue ISSN: 2319-1120 /V4N3: 202-208 © IJAEST Paramet er Dataset Perform ance Executi on time Pros Cons A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA 205 1 2 3 RSIOFCM[9] rseFCM[7] spFCM[7] Partition ing Method and classific ation Partition ing Method Partition ing The results in formatio n of effective clusters for eliminati on of the problem of overlapp ing cluster centers. Dataset, number of cluster, Cluster center, members hip function To minimizi ng the objective function X, c, m It allows for clusterin g of data sets which are too large for memory, but also allows for fast clusterin g of data ISSN: 2319-1120 /V4N3: 202-208 © IJAEST X, c, m, ns PenBased Recognit ion of Handwri tten digits, Page blocks classific ation Better accuracy and it greatly depends on clusterin g efficienc y 2D15, MNIST, Forest Perform ance low compare d to other FCM 2D15, MNIST, Forest It is better compare d with rseFCM It covers the object space. Drastic improve ment in performa nce Generate proper cluster center Segment ing accuratel y and quickly Its runtime is 170.269( sec) for forest data set[10] and speedup 6.291 for MRI data set[11] Easy to understa nd The average speedup was 59 times versus FCM Faster Faster Cluster centers location will have significa nt impact over classific ation results. Suffer overlapp ing cluster center Does not cover the object space Not supporte d streamin g data Perform ance drop when processi ng data in the order it arrives IJAEST, Volume 4, Number 3 K.Vidhya et al. that fits in memory. 4 oFCM[7] Partition ing This approach is to cluster streamin g data, as well as very large data sets. X, c, m, ns 2D15, MNIST, Forest It can produce good segment ation quality without randoml y accessin g data. It is better than SPFCM Good quality partition Used to cluster streamin g data. Poor performa nce for Streamin g algorith m. Accurate 5 GoFCM[8] Partition ing It is variant of SPFCM X, c, m, є, α, σ, a, r, fPDA, dPDA. MRI, ART, PLK01 It produce d partition within 1% of those of FCM on five dataset. 4-47 times faster than FCM It was consiste ntly faster than SPFCM Quality loss 6 MSERFCM [8] Partition ing It is variant of rseFCM X, c, m, є, α, σ, a, r, fPDA, dPDA. MRI, ART, PLK01 It produce d partition within 3% of those of FCM on five dataset. 4-26 times faster than FCM It is highest speedup compare d to rseFCM Low speed. Better average quality than rseFCM Better local minima 7 ELM Kmeans and ELM Partition ing To solve clusterin g ISSN: 2319-1120 /V4N3: 202-208 © IJAEST X, β UCI Machine Learning High Very Good It is easy to impleme Number of nodes should A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA 207 NMF[2] problem by using ELM feature on K-means and Fuzzy Cmeans. Reposito ry, Docume nt Corpus. nt and produce better results for ELM Kmeans than Mercer kernel based methods. be greater than 300 else performa nce is not optimal. Table 1: Comparative study of FCM algorithms RSIO-FCM – Random Sampling Iterative Optimization Fuzzy c-means rseFCM – Random Sampling plus Extension Fuzzy c-means spFCM – Single Pass Fuzzy c-means oFCM – Online Fuzzy c-means GoFCM – Geometric Progressive Fuzzy c-means MSERFCM – Minimum sample estimate random Fuzzy c-means ELM K-means and ELM NMF – Extreme Learning machine and nonnegative matrix factorization IV.CONCLUSION In this paper, we compared various FCM techniques based on execution time, cluster quality and their merits and demerits. The MSERFCM and GOFCM is better when compared to rseFCM and spFCM based on runtime and performance. The spFCM and oFCM has same runtime complexity and oFCM is slow when compared with other clustering algorithms. Based on the primary factors like execution time and cluster quality the ELM K-means and ELM NMF are suitable algorithms for efficient clustering of big data. REFERENCES [1] Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebto Foufou, and Abdelaziz Bouras, “Survey Of Clustering Algorithms For Big Data: Taxonomy And Empirical Analysis”, in IEEE Transactions on Emerging Topics in Computing vol. 2, September 2014. [2] Saurabh Arora, Inderveer Chana, “A Survey of clustering techniques for big data analysis”, in 5th International Conference – Confluence The Next Generation Information Technology Summit, 2014. [3] Dharmarajan A, Velmurugan T, “Applications Of Partition Based Clustering Algorithms: A Survey”, in International Conference On Computational Intelligence And Computing Research, 2013. [4] Atiya Kazi, Prof. D.T. Kurian, “ A Survey Of Data Clustering Techniques”, in International Journal of Engineering Research & Technology, vol. 3, Issue.10, October 2014. [5] Divya Sivanandini L, Mohan Raj M, “A Survey On Data Clustering Algorithms Based On Fuzzy Techniques”, in International Journal of Science and Research(IJSR), vol. 2, Issue. 4, April 2013. [6] P. IndiraPriya, Dr. D. K. Ghosh, “A Survey On Different Clustering Algorithms In Data Mining Technique”, in International Journal of Modern Engineering Research, vol. 3, Issue. 1, pp-267-274, Jan - Feb 2013. ISSN: 2319-1120 /V4N3: 202-208 © IJAEST IJAEST, Volume 4, Number 3 K.Vidhya et al. [7] Timothy C. Havens, James C. Bezdek, Christopher Leckie, Lawrence O. Hall, and Marimuthu Palaniswami, “Fuzzy c-Means Algorithms for Very Large Data”, in IEEE Transaction On Fuzzy Systems, vol. 20, No. 6, December 2012. [8] Jonathon K. Parker, “Accelerating Fuzzy c-means using an estimated subsample size”, in IEEE Transactions on Fuzzy Systems, vol.22, Issue No. 5, October 2014. [9] https://books.google.co.in/books?id=4BS6BQAAQBAJ&pg=PA225&lpg=PA225&dq=rsiofcm&source=bl&ots=myd3zm7kPb&sig=aHjFk9QVrBHA_jEwhIUXj6_xpno&hl=en&sa=X&ved=0CCQQ6AEwAWoVChMI96HEuM76 xwIVAX4aCh0KSQnO#v=onepage&q=rsio-fcm&f=false [10] Dhanesh Kothari, S. Thavasi Narayanan, K. Kiruthika Devi, “Extended Fuzzy c-means with Random Sampling Techniques for Clustering Large Data”, in International Journal of Innovative Research in Advanced Engineering, vol. 1, Issue. 1, March 2014. [11] http://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=6125&context=etd [12] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, 3rd ed. Waltham, USA. [13] Leena H.Patil, Dr. Mohammad Atique,”Candidate Cluster Extraction for Hierarchical Document Clustering”, in International Journal of Computer Science and Engineering, vol. 1, Issue.11, December 2011. Authors K. Vidhya(Assist.Prof.(Sr.G)) has completed B.E(Computer Science and Engineering) from Muthayammal Engineering College, Namakkal and M.E from Government College of Technology , Coimbatore. She is pursuing research in the domain of Cloud based Data Analytics. She is presently working as an Assistant Professor(Sr.G) in the department of Computer Science and Engineering at KPR Institute of Engineering and Technology, Coimbatore. She has 8.7 years of experience in the field of education. Email id: [email protected] R.Sivaramakrishnan(Assistant Professer) has completed B.E. (Computer Science and Engineering) from Tamilnadu College of Engineering, Coimbatore and M.E. (Computer Science and Engineering) from Anna University Regional Centre, Coimbatore with Distinction. He is presently working as an Assistant Professor in the department Computer Science and Engineering at KPR Institute of Engineering and Technology, Coimbatore. He has more than half a decade of experience in the field of education. His areas of interest include Cloud Computing, Programming, Theory of Computation and Compiler Design. Email id: [email protected] P.Sangeetha has completed B.E(Computer Science and Engineering) from SNS college of Technology, Coimbatore. Now, I am doing PG in K.P.R Institute of Engineering and Technology in the branch of Computer Science and Engineering, Coimbatore, TamilNadu. Email id: [email protected] ISSN: 2319-1120 /V4N3: 202-208 © IJAEST

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download a survey: fuzzy based clustering algorithms for big data