Download a survey: fuzzy based clustering algorithms for big data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Advances in Engineering Science and Technology
Available online at www.ijaestonline.com
202
ISSN: 2319-1112
A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR
BIG DATA
K.Vidhya (M.E., Ph.D)1, R.Sivaramakrishnan (M.E.)2, P.Sangeetha3
1
Assist.Prof.(Sr.G), 2Assist.Prof, 3PG Scholar ,
Department of Computer Science and Engineering, Tamil Nadu, India
ABSTRACT
With the exponential growth of data from various social networks like Facebook, Twitter, Mobile applications,
Digital cameras, Sensor networks etc., and also from biomedical researches the overall data volume has increased
tremendously. So analyzing and extracting fruitful information from such a dynamic data is very much challenging
task today. Data grouping or clustering plays a vital role in handling big data which is the basic foot step in data
mining, pattern recognition and also in medical predictions. Due to the large size and variety of big data we can’t
easily know the classification parameters for grouping of data .So it is very much difficult to use any supervised
techniques without knowing the nature of the data. The unsupervised techniques like clustering techniques are very
much suitable for handling big data in this case the learning parameters are computed from learning data. Using
clustering algorithm we can assign labels to unlabeled data and reduce similarity between different clusters. This
paper mainly discusses the fuzzy based clustering algorithms which eliminates the problem of classical clustering
methods like prohibitive implementation (not directly) for mining big data and the time delay in clustering. So the
accelerated fuzzy based clustering algorithms are needed for better speed and quality. This survey shares the detailed
ideas of different FCM algorithm including their parameters, performance, execution time and scalability for large
data sets.
Keywords: Clustering algorithms, Big data, Fuzzy
I. INTRODUCTION
Big data from various resources are need to be collected and analyzed for decision making based on the needs of
user. The data has 3V’s such as Volumes, Velocity and Variety. The big volume of data needs effective handling
techniques for managing and reusing of data based on the analytical aspects. This massive volume of data can be
really useful. But they are very much problematic in terms of their storage and analysis. The big volume makes
analytical operations, process operations, retrieval operations very difficult and they are very much time
consumable. In order to overcome these problems we need to have clustered form of big data [1].The fuzzy based
clustering technique will improve the accuracy of the clustering process [13].
The Clustering technique organizes the data objects into set of disjoined classes called clusters. It generates good
quality of cluster and uses efficient tool to deal with big data [1], [2]. The major requirements of clustering
algorithm are scalability, determining arbitrary shape, high dimensionality, interoperability and usability.
In clustering algorithms there are huge numbers of survey available in various domains such as machine learning,
pattern recognition, data mining, signal processing, information retrieval and bio-informatics [1]. But user cannot
understand which algorithm is best and appropriate for their need. This survey will provide sufficient information
for choosing the best algorithm for further analysis of their big data. The objective of clustering algorithm as follows
[5]:
1) To enlarge a clustering technique that is not aware of cluster centers in initial position.
2) To develop a new clustering techniques which gives minimized execution time and very less error rate.
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA
203
3) To establish a clustering technique with less number of iterations and convergence time.
4) To expand an efficient clustering technique which provides better results in noisy and outlier’s data.
II. CLASSIFICATION OF CLUSTERING ALGORITHMS
Clustering algorithm
Partitionbased
1. K-means
2. Kmedoids
3. K-modes
4. PAM
5. CLARA
NS
6. CLARA
7. FCM
8. PSO
9. PFClust
10. CSO
Hierarchical based
1.
2.
3.
4.
5.
6.
7.
8.
Densitybased
1.
BIRCH
CURE
ROCK
Chamele
on
Echinda
SOHAC
ACADTRS
HGCUD
F
2.
3.
4.
Gridbased
DBSCA
N
OPTICS
DBCLA
SD
DENCL
UE
1. WaveCluster
2. STING
3. CLIQU
E
4. OptiGrid
Modelbased
1. EM
2. COB
WEB
3. CLAS
SIT
4. SOMS
9. SWIFT
Figure.1: Classification of clustering algorithms
In figure 1 the various clustering algorithms are listed. Farley and Raftery (1998) suggests clustering methods into
two main groups: Hierarchical and Partitioning methods. Han and Kamber (2001) put forward into additional three
main categories: Density-based methods, Grid-based methods and Model-based clustering [12].
Through this survey we discuss partitioning method [13] to improve the accuracy and quality of clustering process.
A. Partitioning Clustering algorithm
Partitioning clustering algorithm [12] uses relocation technique iteratively by moving them from one cluster to
another, starting from an initial partitioning. Such methods require that number of clusters will be predetermined by
the user. They are helpful in many applications where every cluster represent cluster center (prototype), and other
instances in the cluster are similar to this prototype.
a)
K-means algorithm
K-means is an unsupervised learning algorithm [12] which solves well known clustering algorithm. This algorithm
goal is to minimizing an objective function, using squared error function. The objective function is
k
n
J = ∑∑ xi( j ) − c j
2
j =1 i =1
where,
2
xi( j ) − c j - distance measure between data point xij and cluster center c j .
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
(1)
IJAEST, Volume 4, Number 3
K.Vidhya et al.
Advantages
1) Relatively scalable and simple.
2) Fast, robust and easy to understand
3) Suitable for datasets with compact spherical clusters that are well-separated[4]
Disadvantages
1)
2)
3)
4)
Severe effectiveness degradation in high dimensional spaces
Poor cluster
It doesn’t give effective result because if we choose cluster center randomly
High sensitivity to initialization phase, noise and outliers
III. FUZZY C-MEANS ALGORITHM
Fuzzy c-means clustering (FCM) [3] is a piece of data to belong to more clusters and associated with each element is
a set of membership levels. FCM is the advanced version of K-means clustering algorithm and FCM is known as
Soft K-means algorithm. K-means describes the distance calculation but FCM does a full inverse-distance weighting
[6].
The objective function is extended in two ways:
1) The fuzzy membership degrees in clusters were incorporated into the formula.
2) Then ‘m’ is an additional parameter was introduced as a weight exponent in the fuzzy membership.
The extended objective function denoted Jm is
N
C
J m = ∑∑ u ijm xi − c j , 1≤m<∞
2
(2)
i =1 j =1
N
cj =
∑u
i =1
N
m
ij
∑u
i =1
.x i
(3)
m
ij
The membership value is calculated from Equation: 5
(4)
1
u ij =
 xi − c j 

x − c k 
k =1
 i

∑ 
C
2
m −1
Where,
m – level of cluster fuzziness
uij – membership of ith data to jth cluster center
x – input dataset
c – number of cluster center
n – number of data points
S.N
o
Types
Method
Issue
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
Paramet
er
Dataset
Perform
ance
Executi
on time
Pros
Cons
A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA
205
1
2
3
RSIOFCM[9]
rseFCM[7]
spFCM[7]
Partition
ing
Method
and
classific
ation
Partition
ing
Method
Partition
ing
The
results in
formatio
n of
effective
clusters
for
eliminati
on of the
problem
of
overlapp
ing
cluster
centers.
Dataset,
number
of
cluster,
Cluster
center,
members
hip
function
To
minimizi
ng the
objective
function
X, c, m
It
allows
for
clusterin
g of data
sets
which
are too
large for
memory,
but also
allows
for fast
clusterin
g of data
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
X, c, m,
ns
PenBased
Recognit
ion of
Handwri
tten
digits,
Page
blocks
classific
ation
Better
accuracy
and it
greatly
depends
on
clusterin
g
efficienc
y
2D15,
MNIST,
Forest
Perform
ance low
compare
d to
other
FCM
2D15,
MNIST,
Forest
It is
better
compare
d with
rseFCM
It covers
the
object
space.
Drastic
improve
ment in
performa
nce
Generate
proper
cluster
center
Segment
ing
accuratel
y and
quickly
Its
runtime
is
170.269(
sec) for
forest
data
set[10]
and
speedup
6.291 for
MRI
data
set[11]
Easy to
understa
nd
The
average
speedup
was 59
times
versus
FCM
Faster
Faster
Cluster
centers
location
will have
significa
nt
impact
over
classific
ation
results.
Suffer
overlapp
ing
cluster
center
Does not
cover the
object
space
Not
supporte
d
streamin
g data
Perform
ance
drop
when
processi
ng data
in the
order it
arrives
IJAEST, Volume 4, Number 3
K.Vidhya et al.
that fits
in
memory.
4
oFCM[7]
Partition
ing
This
approach
is to
cluster
streamin
g data,
as well
as very
large
data sets.
X, c, m,
ns
2D15,
MNIST,
Forest
It can
produce
good
segment
ation
quality
without
randoml
y
accessin
g data.
It is
better
than
SPFCM
Good
quality
partition
Used to
cluster
streamin
g data.
Poor
performa
nce for
Streamin
g
algorith
m.
Accurate
5
GoFCM[8]
Partition
ing
It is
variant
of
SPFCM
X, c, m,
є, α, σ, a,
r, fPDA,
dPDA.
MRI,
ART,
PLK01
It
produce
d
partition
within
1% of
those of
FCM on
five
dataset.
4-47
times
faster
than
FCM
It was
consiste
ntly
faster
than
SPFCM
Quality
loss
6
MSERFCM
[8]
Partition
ing
It is
variant
of
rseFCM
X, c, m,
є, α, σ, a,
r, fPDA,
dPDA.
MRI,
ART,
PLK01
It
produce
d
partition
within
3% of
those of
FCM on
five
dataset.
4-26
times
faster
than
FCM
It is
highest
speedup
compare
d to
rseFCM
Low
speed.
Better
average
quality
than
rseFCM
Better
local
minima
7
ELM Kmeans and
ELM
Partition
ing
To solve
clusterin
g
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
X, β
UCI
Machine
Learning
High
Very
Good
It is easy
to
impleme
Number
of nodes
should
A SURVEY: FUZZY BASED CLUSTERING ALGORITHMS FOR BIG DATA
207
NMF[2]
problem
by using
ELM
feature
on
K-means
and
Fuzzy Cmeans.
Reposito
ry,
Docume
nt
Corpus.
nt and
produce
better
results
for ELM
Kmeans
than
Mercer
kernel
based
methods.
be
greater
than
300 else
performa
nce is
not
optimal.
Table 1: Comparative study of FCM algorithms
RSIO-FCM – Random Sampling Iterative Optimization Fuzzy c-means
rseFCM – Random Sampling plus Extension Fuzzy c-means
spFCM – Single Pass Fuzzy c-means
oFCM – Online Fuzzy c-means
GoFCM – Geometric Progressive Fuzzy c-means
MSERFCM – Minimum sample estimate random Fuzzy c-means
ELM K-means and ELM NMF – Extreme Learning machine and nonnegative matrix factorization
IV.CONCLUSION
In this paper, we compared various FCM techniques based on execution time, cluster quality and their merits and
demerits. The MSERFCM and GOFCM is better when compared to rseFCM and spFCM based on runtime and
performance. The spFCM and oFCM has same runtime complexity and oFCM is slow when compared with other
clustering algorithms. Based on the primary factors like execution time and cluster quality the ELM K-means and
ELM NMF are suitable algorithms for efficient clustering of big data.
REFERENCES
[1]
Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebto Foufou, and Abdelaziz Bouras, “Survey
Of Clustering Algorithms For Big Data: Taxonomy And Empirical Analysis”, in IEEE Transactions on Emerging Topics in Computing vol.
2, September 2014.
[2]
Saurabh Arora, Inderveer Chana, “A Survey of clustering techniques for big data analysis”, in 5th International Conference – Confluence
The Next Generation Information Technology Summit, 2014.
[3]
Dharmarajan A, Velmurugan T, “Applications Of Partition Based Clustering Algorithms: A Survey”, in International Conference On
Computational Intelligence And Computing Research, 2013.
[4]
Atiya Kazi, Prof. D.T. Kurian, “ A Survey Of Data Clustering Techniques”, in International Journal of Engineering Research &
Technology, vol. 3, Issue.10, October 2014.
[5]
Divya Sivanandini L, Mohan Raj M, “A Survey On Data Clustering Algorithms Based On Fuzzy Techniques”, in International Journal of
Science and Research(IJSR), vol. 2, Issue. 4, April 2013.
[6]
P. IndiraPriya, Dr. D. K. Ghosh, “A Survey On Different Clustering Algorithms In Data Mining Technique”, in International Journal of
Modern Engineering Research, vol. 3, Issue. 1, pp-267-274, Jan - Feb 2013.
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST
IJAEST, Volume 4, Number 3
K.Vidhya et al.
[7]
Timothy C. Havens, James C. Bezdek, Christopher Leckie, Lawrence O. Hall, and Marimuthu Palaniswami, “Fuzzy c-Means Algorithms
for Very Large Data”, in IEEE Transaction On Fuzzy Systems, vol. 20, No. 6, December 2012.
[8]
Jonathon K. Parker, “Accelerating Fuzzy c-means using an estimated subsample size”, in IEEE Transactions on Fuzzy Systems, vol.22,
Issue No. 5, October 2014.
[9]
https://books.google.co.in/books?id=4BS6BQAAQBAJ&pg=PA225&lpg=PA225&dq=rsiofcm&source=bl&ots=myd3zm7kPb&sig=aHjFk9QVrBHA_jEwhIUXj6_xpno&hl=en&sa=X&ved=0CCQQ6AEwAWoVChMI96HEuM76
xwIVAX4aCh0KSQnO#v=onepage&q=rsio-fcm&f=false
[10] Dhanesh Kothari, S. Thavasi Narayanan, K. Kiruthika Devi, “Extended Fuzzy c-means with Random Sampling Techniques for Clustering
Large Data”, in International Journal of Innovative Research in Advanced Engineering, vol. 1, Issue. 1, March 2014.
[11] http://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=6125&context=etd
[12] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, 3rd ed. Waltham, USA.
[13] Leena H.Patil, Dr. Mohammad Atique,”Candidate Cluster Extraction for Hierarchical Document Clustering”, in International Journal of
Computer Science and Engineering, vol. 1, Issue.11, December 2011.
Authors
K. Vidhya(Assist.Prof.(Sr.G)) has completed B.E(Computer Science and Engineering) from Muthayammal Engineering
College, Namakkal and M.E from Government College of Technology , Coimbatore. She is pursuing research in the domain
of Cloud based Data Analytics.
She is presently working as an Assistant Professor(Sr.G) in the department of Computer Science and Engineering at
KPR Institute of Engineering and Technology, Coimbatore. She has 8.7 years of experience in the field of education.
Email id: [email protected]
R.Sivaramakrishnan(Assistant Professer) has completed B.E. (Computer Science and Engineering) from Tamilnadu
College of Engineering, Coimbatore and M.E. (Computer Science and Engineering) from Anna University Regional
Centre, Coimbatore with Distinction.
He is presently working as an Assistant Professor in the department Computer Science and Engineering at KPR Institute
of Engineering and Technology, Coimbatore. He has more than half a decade of experience in the field of
education. His areas of interest include Cloud Computing, Programming, Theory of Computation and Compiler Design.
Email id: [email protected]
P.Sangeetha has completed B.E(Computer Science and Engineering) from SNS college of Technology, Coimbatore. Now, I
am doing PG in K.P.R Institute of Engineering and Technology in the branch of Computer Science and Engineering,
Coimbatore, TamilNadu.
Email id: [email protected]
ISSN: 2319-1120 /V4N3: 202-208 © IJAEST