Download Analysis of Clustering Algorithms in E-Commerce using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
Analysis of Clustering Algorithms in E-Commerce using
WEKA
Goldy Rana1 and Silky Azad2
1
M.Tech. Student, CSE Deptt.
Samalkha Group of Institutions, Hathwala, Panipat (Haryana)
[email protected]
2
Assistant Professor, CSE Deptt.
Samalkha Group of Institutions, Hathwala, Panipat (Haryana)
[email protected]
Abstract
the set of data is portioned into groups on the basis of
data similarity (e g by clustering) and the then
assigning labels to the comparatively smaller number
of groups [1].
Data clustering is a process of putting similar data into
groups. A clustering algorithm partitions a data set into
several groups such that the similarity within a group is
larger than among groups. Moreover, most of the data
collected in many problems seem to have some inherent
properties that lend themselves to natural groupings.
Clustering algorithms are used extensively not only to
organize and categorize data, but are also useful for data
compression and model construction. This paper reviews
on types of clustering techniques- k-Means Clustering,
Hierarchical Clustering, DBScan clustering, Optics, EM
Algorithm.
Keywords: Data Mining,
Clustering, DB Scan.
Clustering,
II. What is Cluster Analysis?
Cluster analysis [3] groups objects (observations,
events) based on the information found in the data
describing the objects or their relationships. The goal
is that the objects in a group will be similar (or
related) to one other and different from (or unrelated
to) the objects in other groups[2].
K-Mean
I. Introduction
The greater the likeness (or homogeneity) within a
group, and the greater the disparity between groups,
the ―better or more distinct the clustering. The
definition of what constitutes a cluster is not well
defined, and, in many applications clusters are not
well separated from one another. Nonetheless, most
cluster analysis seeks as a result, a crisp classification
of the data into non-overlapping groups.
The process of Knowledge discovery executes in an
iterative sequence of steps such as cleaning of data,
its integration, its selection, & transformation of data,
data mining, evaluating patterns and presentation of
knowledge. Data mining features are characterization
and discrimination, mining frequent patterns,
association, correlation, Classification and prediction,
cluster analysis, outlier analysis and evolution
analysis. Data mining often involves the analysis of
data stored in a data warehouse. Three of the major
data mining techniques are regression, classification
and clustering [2]. Clustering is the process of
grouping the data into classes or clusters, so that
objects within a cluster have high similarity in
comparison to one another and very dissimilar to
object in other clusters. Dissimilarity is due to the
attributes values that describe the objects [1].
To better understand the difficulty of deciding what
constitutes a cluster, consider figures 1a through 1b,
which show fifteen points and three different ways
that they can be divided into clusters. If we allow
clusters to be nested, then the most reasonable
interpretation of the structure of these points is that
there are two clusters, each of which has three sub
clusters. However, the apparent division of the two
larger clusters into three sub clusters may simply be
an artifact of the human visual system. Finally, it may
not be unreasonable to say that the points from four
clusters. Thus, we stress once again that the
definition of what constitutes a cluster is imprecise,
and the best definition depends on the type of data
and the desired results [2].
The objects are grouped on the basis of the principle
of optimizing the intra-class similarity and reducing
the inter-class similarity to the minimum. First of all
IJCSMS
www.ijcsms.com
30
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
ii) K-Means Algorithm Process
•
•
Figure 1: Initial Points or Data in the Data
Warehouse [2]
•
III. Data Clustering Techniques
•
In this section a detailed discussion of each technique
is presented. Implementation and results are
presented in the following sections [4].
A) K-Means Clustering
•
It is a partition method technique which finds mutual
exclusive clusters of spherical shape. It generates a
specific number of disjoint, flat(non-hierarchical)
clusters. Stastical method can be used to cluster to
assign rank values to the cluster categorical data.
Here categorical data have been converted into
numeric by assigning rank value [5]. K-Means
algorithm organizes objects into k – partitions where
each partition represents a cluster. We start out with
initial set of means and classify cases based on their
distances to their centers. Next, we compute the
cluster means again, using the cases that are assigned
to the clusters; then, we reclassify all cases based on
the new set of means. We keep repeating this step
until cluster means don‟t change between successive
steps. Finally, we calculate the means of cluster once
again and assign the cases to their permanent clusters
[4].
The dataset is partitioned into K clusters and
the data points are randomly assigned to the
clusters resulting in clusters that have
roughly the same number of data points.
For each data point:
Calculate the distance from the data
point to each cluster.
If the data point is closest to its own cluster,
leave it where it is. If the data point is not
closest to its own cluster, move it into the
closest cluster.
Repeat the above step until a complete pass
through all the data points results in no data
point moving from one cluster to another. At
this point the clusters are stable and the
clustering process ends.
The choice of initial partition can greatly
affect the final clusters that result, in terms
of inter-cluster and intracluster distances and
cohesion [6].
B) Hierarchical Clustering
It builds a cluster hierarchy or, in other words, a tree
of clusters, also known as a dendrogram. Every
cluster node contains child clusters; sibling clusters
partition the points covered by their common parent.
Agglomerative (bottom up)[7]
1.
2.
3.
Start with 1 point (singleton).
Recursively add two or more appropriate
clusters.
Stop when k number of clusters is achieved.
Divisive (top down)
i) K-Means Algorithm Properties
•
•
•
•
1.
2.
Start with a big cluster.
Recursively divides into smaller clusters.
3. Stop when k number of clusters is achieved.
There are always K clusters.
There is always at least one item in each
cluster.
The clusters are non-hierarchical and they
do not overlap.
Every member of a cluster is closer to its
cluster than any other cluster because
closeness does not always involve the
'center' of clusters.
General steps of Hierarchical Clustering
Given a set of N items to be clustered, and an N*N
distance (or similarity) matrix, the basic process of
hierarchical clustering (defined by S.C. Johnson in
1967) is this [7]:
• Start by assigning each item to a cluster, so
that if we have N items, we now have N
clusters, each containing just one item. Let
IJCSMS
www.ijcsms.com
31
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
•
•
•
the distances (similarities) between the
clusters the same as the distances
(similarities) between the items they contain.
Find the closest (most similar) pair of
clusters and merge them into a single
cluster, so that now we have one cluster less.
Compute distances (similarities) between the
new cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are
clustered into K number of clusters.
E step. These parameter-estimates are then used to
determine the distribution of the latent variables in
the next E step.
1.
Expectation: Fix model and estimate
missing labels.
2.
Maximization: Fix missing labels (or a
distribution over the missing labels) and find
the model that maximizes the expected loglikelihood of the data.
C) DBSCAN Clustering
General EM Algorithm in English
DBSCAN (Density Based Spatial Clustering of
Application with Noise)[4].It grows clusters
according to the density of neighborhood objects. It is
based on the concept of “density reachibility” and
“density connectability”, both of which depends upon
input parameter- size of epsilon neighborhood e and
minimum terms of local distribution of nearest
neighbors. Here e parameter controls size of
neighborhood and size of clusters. It starts with an
arbitrary starting point that has not been visited [7].
The point’s e-neighborhood is retrieved, and if it
contains sufficiently many points, a cluster is started.
Otherwise the point is labeled as noise. The number
of point parameter impacts detection of outliers.
DBSCAN targeting low-dimensional spatial data
used DENCLUE algorithm [8].
Alternate steps until model parameters don’t change
much:
E step: Estimate distribution over labels given a
certain fixed model.
M step: Choose new parameters for model to
maximize expected log-likelihood of observed data
and hidden variables [7].
F) OPTICS
OPTICS (Ordering Points to Identify Clustering
Structure) is a density based method that generates an
augmented ordering of the data clustering structure
[4]. It is a generalization of DBSCAN to multiple
ranges, effectively replacing the e parametre with a
maximum search radius that mostly affects
performance. MinPts then essentially becomes the
minimum cluster size to find. It is an algorithm for
finding density based clusters in spatial data which
addresses one of DBSCAN‟S major weaknesses i.e.
of detecting meaningful clusters in data of varying
density. It outputs cluster ordering which is a linear
list of all objects under analysis and represents the
density-based clustering structure of the data. Here
parameter epsilon is not necessary and set to
maximum value. OPTICS abstracts from DBSCAN
by removing this each point is assigned as „core
distance‟, which describes distance to its MinPts
point. Both the core-distance and the reachabilitydistance are undefined if no sufficiently dense cluster
w.r.t. epsilon parameter is available [7].
D) Farthest First Clustering
Farthest first is a variant of K Means that places each
cluster centre in turn at the point furthermost from the
existing cluster centre. This point must lie within the
data area. This greatly speeds up the clustering in
most of the cases since less reassignment and
adjustment is needed [9].
E) EM Algorithm
EM Algorithm is an iterative method for finding
maximum likelihood or maximum a posteriori
(MAP) estimates of parameters in statistical models,
where the model depends on unobserved latent
variables [7]. The EM iteration alternates between
performing an expectation (E) step, which computes
the expectation of the log-likelihood evaluated using
the current estimate for the parameters, and
maximization (M) step, which computes parameters
maximizing the expected log-likelihood found on the
IJCSMS
www.ijcsms.com
32
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
to need additional used to propose such alternatives
to the customers.
IV. E-Commerce in Data Mining
Electronic commerce processes and data mining tools
have revolutionized many companies. Data that
businesses collect about customers and their
transactions are the greatest assets of that business.
Data mining is a set of automated techniques used to
extract buried or previously unknown pieces of
information from large databases, using different
criteria, which makes it possible to discover patterns
and relationships. We survey articles that are very
specific to data mining implementations in
ecommerce.
B) DM in Recommendation Systems
Systems have also been developed to keep the
customers automatically informed of important
events of interest to them. The article by Jeng &
Drissi (2000) [12] discusses an intelligent framework
called PENS that has the ability to not only notify
customers of events, but also to predict events and
event classes that are likely to be triggered by
customers [11]. The event notification system in
PENS has the following components: Event manager,
event channel manager, registries, and proxy
manager. The event-prediction system is based on
association rule-mining and clustering algorithms.
The salient applications of data mining techniques are
presented.
A) Customer Profiling
The PENS system is used to actively help an ecommerce service provider to forecast the demand of
product categories better. Data mining has also been
applied in detecting how customers may respond to
promotional offers made by a credit card e-commerce
company[13]. Techniques including fuzzy computing
and interval computing are used to generate if-thenelse rules.
It may be observed that customers drive the revenues
of any organization. Acquiring new customers,
delighting and retaining existing customers, and
predicting buyer behavior will improve the
availability of products and services and hence the
profits. Thus the end goal of any data mining exercise
in e-commerce is to improve processes that
contribute to delivering value to the end customer.
Consider an on-line store like http:www.dell.com
where the customer can configure a PC of his/her
choice, place an order for the same, track its
movement, as well as pay for the product and
services [10].
Niu et al (2002)[14] present a method to build
customer profiles in e-commerce settings, based on
product hierarchy for more effective personalization.
They divide each customer profile into three parts:
basic profile learned from customer demographic
data; preference profile learned from behavioural
data, and rule profile mainly referring to association
rules. Based on customer profiles, the authors
generate two kinds of recommendations, which are
interest
recommendation
and
association
recommendation. They also propose a special data
structure called profile tree for effective searching
and matching [11].
With the technology behind such a web site, Dell has
the opportunity to make the retail experience
exceptional. At the most basic level, the infonnation
available in web log files can detect what prospective
customers are Companies like Dell provide their
customers access to details about all of the systems
and configurations they have purchased so they can
incorporate the infonnation into their capacity
planning and infrastructure integration. Back-end
technology systems for the website customer profiles
and predictive modeling of scenarios of customer
interactions. For example, routers, switches, load
balancers, backup devices etc. Rule-mining based
systems could be Jilid 20, Bil. 2 Jumal Teknologi
Maklumat seeking from a site of the include
sophisticated data mining tools that take care of
knowledge representation of once a customer has
purchased a certain number of servers, they are likely
C) DM and Multimedia E-Commerce
Applications in virtual multimedia catalogs are
highly interactive, as in e-malls selling multimedia
content based products. It is difficult in such
situations to estimate resource demands required for
presentation of catalog contents. Hollfelder et al [15]
propose a method to predict presentation resource
demands in interactive multimedia catalogs. The
prediction is based on the results of mining the virtual
mall action log file that contains information about
IJCSMS
www.ijcsms.com
33
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
previous user interests and browsing and buying
behavior [11].
VI. Conclusion
Clustering is the process of grouping the data into
classes or clusters, so that objects within a cluster
have high similarity in comparison to one another
and very dissimilar to object in other clusters.
Dissimilarity is based on the attributes values
describing the objects. The objects are clustered or
grouped based on the principle of maximizing the
intra-class similarity and minimizing the inter-class
similarity. Firstly the set of data is portioned into
groups based on data similarity (e g Using clustering)
and the then assign labels to the relatively small
number of groups. This paper analyze the major
clustering algorithms: K-Means, Farthest First and
Hierarchical clustering algorithm, DB Scan, EM
Clustering Algorithm and also explain the role of
data mining in E-Commerce.
V. Implementation
The clustering is performed on the clothing dataset
downloaded from internet and results are analyzed
using the WEKA machine learning tool. The
comparison is done between the number of clusters
and size of each cluster. The comparison is shown
below in the table:
Table 1 Comparison of Clustering Algorithm
Name of
clustering
algorithm
Number
of
cluster
Size of cluster
K-means
2
48%,53%
Hierarchical
2
100%,0%
EM
4
29%,11%,20%,41%
Farthest
First
DB Scan
2
100%, 1%
1
100%
References
[1] Vishal Shrivastava, Prem narayan Arya, “A
Study of Various Clustering Algorithms on
Retail Sales Data”, Volume 1, No.2, SeptemberOctober 2012.
[2] Narendra Sharma, Aman Bajpai, Mr. Ratnesh
Litoriya, “Comparison the various clustering
algorithms of weka tools”, Volume 2, Issue 5,
May 2012.
[3] Yuni Xia, Bowei Xi ―Conceptual Clustering
Categorical Data with Uncertaintyǁ Indiana
University – Purdue University Indianapolis
Indianapolis, IN 46202, USA.
[4] Aastha Joshi, “A Review: Comparative Study of
Various Clustering Techniques in Data Mining”,
Volume 3, Issue 3, March 2013.
[5] Patnaik, Sovan Kumar, Soumya Sahoo, and
Dillip Kumar Swain, “Clustering of Categorical
Data by Assigning Rank through Statistical
Approach,” International Journal of Computer
Applications 43.2: 1-3, 2012.
[6] Improved
Outcome
Software,
K-Means
Clustering
Overview.
Retrieved
from:
http://www.improvedoutcomes.com/docs/WebSi
teDocs/Clustering/KMeans_Clustering_Overvie
w.html [Accessed 22/02/2013].
[7] Manish Verma, Mauly Srivastava, Neha Chack,
Atul Kumar Diswar, Nidhi Gupta, “A
Comparative Study of Various Clustering
Algorithms in Data Mining”, Vol. 2, Issue 3,
May-Jun 2012.
DB Scan
Farthest First
EM
Hierarchical
5
4
C 3
l 2
u 1
0
s
t
e
o r
f
N
u
m
b
e
r
K-means
Number of cluster
Number of
cluster
Algorithm Name
Figure2: Comparison of number of cluster in KMean, Hierarchical, EM, Farthest First, DB Scan
IJCSMS
www.ijcsms.com
34
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05
Publishing Month: May 2014
ISSN (Online): 2231 –5268
www.ijcsms.com
[8] Han, J., Kamber, M. 2012. Data Mining:
Concepts and Techniques, 3rd ed, 443-49.
[9] Pallavi, Sunila Godara, “A Comparative
Performance
Analysis
of
Clustering
Algorithms”, Vol. 1, Issue 3, pp.441-445.
[10] Rastegari, Hamid, Md Sap, and Mohd Noor.
"Data mining and e-commerce: methods,
applications, and challenges." Jurnal Teknologi
Maklumat 20.2 (2008): 116-128.
[11] N R Srinivasa Raghavan, “Data mining in ecommerce: A survey”, Vol. 30, Parts 2 & 3,
April/June 2005, pp. 275–289.
[12] Jeng J J, Drissi Y 2000 Pens: a predictive event
notification system for e-commerce environment.
In The 24th Annu. Int. Computer Software and
Applications Conference, COMPSAC 2000, pp
93–98.
[13] Zhang Y Q, Shteynberg M, Prasad S K,
Sunderraman R 2003 Granular fuzzy web
intelligence techniques for profitable data
mining. In 12th IEEE Int. Conf. on Fuzzy
Systems, FUZZ ’03 (New York: IEEE Comput.
Soc.) pp 1462–1464.
[14] Niu L, Yan XW, Zhang C Q, Zhang S C 2002
Product hierarchy-based customer profiles for
electronic commerce recommendation. In Int.
Conf. on Machine Learning and Cybernetics pp
1075–1080.
[15] Hollfelder S, Oria V, Ozsu M T 2000 Mining
user behavior for resource prediction in
interactive electronic malls. In IEEE Int. Conf.
on Multimedia and Expo (New York: IEEE
Comput. Soc.) pp 863–866.
IJCSMS
www.ijcsms.com
35