Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com Analysis of Clustering Algorithms in E-Commerce using WEKA Goldy Rana1 and Silky Azad2 1 M.Tech. Student, CSE Deptt. Samalkha Group of Institutions, Hathwala, Panipat (Haryana) [email protected] 2 Assistant Professor, CSE Deptt. Samalkha Group of Institutions, Hathwala, Panipat (Haryana) [email protected] Abstract the set of data is portioned into groups on the basis of data similarity (e g by clustering) and the then assigning labels to the comparatively smaller number of groups [1]. Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. Moreover, most of the data collected in many problems seem to have some inherent properties that lend themselves to natural groupings. Clustering algorithms are used extensively not only to organize and categorize data, but are also useful for data compression and model construction. This paper reviews on types of clustering techniques- k-Means Clustering, Hierarchical Clustering, DBScan clustering, Optics, EM Algorithm. Keywords: Data Mining, Clustering, DB Scan. Clustering, II. What is Cluster Analysis? Cluster analysis [3] groups objects (observations, events) based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from (or unrelated to) the objects in other groups[2]. K-Mean I. Introduction The greater the likeness (or homogeneity) within a group, and the greater the disparity between groups, the ―better or more distinct the clustering. The definition of what constitutes a cluster is not well defined, and, in many applications clusters are not well separated from one another. Nonetheless, most cluster analysis seeks as a result, a crisp classification of the data into non-overlapping groups. The process of Knowledge discovery executes in an iterative sequence of steps such as cleaning of data, its integration, its selection, & transformation of data, data mining, evaluating patterns and presentation of knowledge. Data mining features are characterization and discrimination, mining frequent patterns, association, correlation, Classification and prediction, cluster analysis, outlier analysis and evolution analysis. Data mining often involves the analysis of data stored in a data warehouse. Three of the major data mining techniques are regression, classification and clustering [2]. Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another and very dissimilar to object in other clusters. Dissimilarity is due to the attributes values that describe the objects [1]. To better understand the difficulty of deciding what constitutes a cluster, consider figures 1a through 1b, which show fifteen points and three different ways that they can be divided into clusters. If we allow clusters to be nested, then the most reasonable interpretation of the structure of these points is that there are two clusters, each of which has three sub clusters. However, the apparent division of the two larger clusters into three sub clusters may simply be an artifact of the human visual system. Finally, it may not be unreasonable to say that the points from four clusters. Thus, we stress once again that the definition of what constitutes a cluster is imprecise, and the best definition depends on the type of data and the desired results [2]. The objects are grouped on the basis of the principle of optimizing the intra-class similarity and reducing the inter-class similarity to the minimum. First of all IJCSMS www.ijcsms.com 30 IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com ii) K-Means Algorithm Process • • Figure 1: Initial Points or Data in the Data Warehouse [2] • III. Data Clustering Techniques • In this section a detailed discussion of each technique is presented. Implementation and results are presented in the following sections [4]. A) K-Means Clustering • It is a partition method technique which finds mutual exclusive clusters of spherical shape. It generates a specific number of disjoint, flat(non-hierarchical) clusters. Stastical method can be used to cluster to assign rank values to the cluster categorical data. Here categorical data have been converted into numeric by assigning rank value [5]. K-Means algorithm organizes objects into k – partitions where each partition represents a cluster. We start out with initial set of means and classify cases based on their distances to their centers. Next, we compute the cluster means again, using the cases that are assigned to the clusters; then, we reclassify all cases based on the new set of means. We keep repeating this step until cluster means don‟t change between successive steps. Finally, we calculate the means of cluster once again and assign the cases to their permanent clusters [4]. The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points. For each data point: Calculate the distance from the data point to each cluster. If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion [6]. B) Hierarchical Clustering It builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Agglomerative (bottom up)[7] 1. 2. 3. Start with 1 point (singleton). Recursively add two or more appropriate clusters. Stop when k number of clusters is achieved. Divisive (top down) i) K-Means Algorithm Properties • • • • 1. 2. Start with a big cluster. Recursively divides into smaller clusters. 3. Stop when k number of clusters is achieved. There are always K clusters. There is always at least one item in each cluster. The clusters are non-hierarchical and they do not overlap. Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters. General steps of Hierarchical Clustering Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this [7]: • Start by assigning each item to a cluster, so that if we have N items, we now have N clusters, each containing just one item. Let IJCSMS www.ijcsms.com 31 IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com • • • the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now we have one cluster less. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into K number of clusters. E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. 1. Expectation: Fix model and estimate missing labels. 2. Maximization: Fix missing labels (or a distribution over the missing labels) and find the model that maximizes the expected loglikelihood of the data. C) DBSCAN Clustering General EM Algorithm in English DBSCAN (Density Based Spatial Clustering of Application with Noise)[4].It grows clusters according to the density of neighborhood objects. It is based on the concept of “density reachibility” and “density connectability”, both of which depends upon input parameter- size of epsilon neighborhood e and minimum terms of local distribution of nearest neighbors. Here e parameter controls size of neighborhood and size of clusters. It starts with an arbitrary starting point that has not been visited [7]. The point’s e-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise the point is labeled as noise. The number of point parameter impacts detection of outliers. DBSCAN targeting low-dimensional spatial data used DENCLUE algorithm [8]. Alternate steps until model parameters don’t change much: E step: Estimate distribution over labels given a certain fixed model. M step: Choose new parameters for model to maximize expected log-likelihood of observed data and hidden variables [7]. F) OPTICS OPTICS (Ordering Points to Identify Clustering Structure) is a density based method that generates an augmented ordering of the data clustering structure [4]. It is a generalization of DBSCAN to multiple ranges, effectively replacing the e parametre with a maximum search radius that mostly affects performance. MinPts then essentially becomes the minimum cluster size to find. It is an algorithm for finding density based clusters in spatial data which addresses one of DBSCAN‟S major weaknesses i.e. of detecting meaningful clusters in data of varying density. It outputs cluster ordering which is a linear list of all objects under analysis and represents the density-based clustering structure of the data. Here parameter epsilon is not necessary and set to maximum value. OPTICS abstracts from DBSCAN by removing this each point is assigned as „core distance‟, which describes distance to its MinPts point. Both the core-distance and the reachabilitydistance are undefined if no sufficiently dense cluster w.r.t. epsilon parameter is available [7]. D) Farthest First Clustering Farthest first is a variant of K Means that places each cluster centre in turn at the point furthermost from the existing cluster centre. This point must lie within the data area. This greatly speeds up the clustering in most of the cases since less reassignment and adjustment is needed [9]. E) EM Algorithm EM Algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables [7]. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the IJCSMS www.ijcsms.com 32 IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com to need additional used to propose such alternatives to the customers. IV. E-Commerce in Data Mining Electronic commerce processes and data mining tools have revolutionized many companies. Data that businesses collect about customers and their transactions are the greatest assets of that business. Data mining is a set of automated techniques used to extract buried or previously unknown pieces of information from large databases, using different criteria, which makes it possible to discover patterns and relationships. We survey articles that are very specific to data mining implementations in ecommerce. B) DM in Recommendation Systems Systems have also been developed to keep the customers automatically informed of important events of interest to them. The article by Jeng & Drissi (2000) [12] discusses an intelligent framework called PENS that has the ability to not only notify customers of events, but also to predict events and event classes that are likely to be triggered by customers [11]. The event notification system in PENS has the following components: Event manager, event channel manager, registries, and proxy manager. The event-prediction system is based on association rule-mining and clustering algorithms. The salient applications of data mining techniques are presented. A) Customer Profiling The PENS system is used to actively help an ecommerce service provider to forecast the demand of product categories better. Data mining has also been applied in detecting how customers may respond to promotional offers made by a credit card e-commerce company[13]. Techniques including fuzzy computing and interval computing are used to generate if-thenelse rules. It may be observed that customers drive the revenues of any organization. Acquiring new customers, delighting and retaining existing customers, and predicting buyer behavior will improve the availability of products and services and hence the profits. Thus the end goal of any data mining exercise in e-commerce is to improve processes that contribute to delivering value to the end customer. Consider an on-line store like http:www.dell.com where the customer can configure a PC of his/her choice, place an order for the same, track its movement, as well as pay for the product and services [10]. Niu et al (2002)[14] present a method to build customer profiles in e-commerce settings, based on product hierarchy for more effective personalization. They divide each customer profile into three parts: basic profile learned from customer demographic data; preference profile learned from behavioural data, and rule profile mainly referring to association rules. Based on customer profiles, the authors generate two kinds of recommendations, which are interest recommendation and association recommendation. They also propose a special data structure called profile tree for effective searching and matching [11]. With the technology behind such a web site, Dell has the opportunity to make the retail experience exceptional. At the most basic level, the infonnation available in web log files can detect what prospective customers are Companies like Dell provide their customers access to details about all of the systems and configurations they have purchased so they can incorporate the infonnation into their capacity planning and infrastructure integration. Back-end technology systems for the website customer profiles and predictive modeling of scenarios of customer interactions. For example, routers, switches, load balancers, backup devices etc. Rule-mining based systems could be Jilid 20, Bil. 2 Jumal Teknologi Maklumat seeking from a site of the include sophisticated data mining tools that take care of knowledge representation of once a customer has purchased a certain number of servers, they are likely C) DM and Multimedia E-Commerce Applications in virtual multimedia catalogs are highly interactive, as in e-malls selling multimedia content based products. It is difficult in such situations to estimate resource demands required for presentation of catalog contents. Hollfelder et al [15] propose a method to predict presentation resource demands in interactive multimedia catalogs. The prediction is based on the results of mining the virtual mall action log file that contains information about IJCSMS www.ijcsms.com 33 IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com previous user interests and browsing and buying behavior [11]. VI. Conclusion Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another and very dissimilar to object in other clusters. Dissimilarity is based on the attributes values describing the objects. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. Firstly the set of data is portioned into groups based on data similarity (e g Using clustering) and the then assign labels to the relatively small number of groups. This paper analyze the major clustering algorithms: K-Means, Farthest First and Hierarchical clustering algorithm, DB Scan, EM Clustering Algorithm and also explain the role of data mining in E-Commerce. V. Implementation The clustering is performed on the clothing dataset downloaded from internet and results are analyzed using the WEKA machine learning tool. The comparison is done between the number of clusters and size of each cluster. The comparison is shown below in the table: Table 1 Comparison of Clustering Algorithm Name of clustering algorithm Number of cluster Size of cluster K-means 2 48%,53% Hierarchical 2 100%,0% EM 4 29%,11%,20%,41% Farthest First DB Scan 2 100%, 1% 1 100% References [1] Vishal Shrivastava, Prem narayan Arya, “A Study of Various Clustering Algorithms on Retail Sales Data”, Volume 1, No.2, SeptemberOctober 2012. [2] Narendra Sharma, Aman Bajpai, Mr. Ratnesh Litoriya, “Comparison the various clustering algorithms of weka tools”, Volume 2, Issue 5, May 2012. [3] Yuni Xia, Bowei Xi ―Conceptual Clustering Categorical Data with Uncertaintyǁ Indiana University – Purdue University Indianapolis Indianapolis, IN 46202, USA. [4] Aastha Joshi, “A Review: Comparative Study of Various Clustering Techniques in Data Mining”, Volume 3, Issue 3, March 2013. [5] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip Kumar Swain, “Clustering of Categorical Data by Assigning Rank through Statistical Approach,” International Journal of Computer Applications 43.2: 1-3, 2012. [6] Improved Outcome Software, K-Means Clustering Overview. Retrieved from: http://www.improvedoutcomes.com/docs/WebSi teDocs/Clustering/KMeans_Clustering_Overvie w.html [Accessed 22/02/2013]. [7] Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar, Nidhi Gupta, “A Comparative Study of Various Clustering Algorithms in Data Mining”, Vol. 2, Issue 3, May-Jun 2012. DB Scan Farthest First EM Hierarchical 5 4 C 3 l 2 u 1 0 s t e o r f N u m b e r K-means Number of cluster Number of cluster Algorithm Name Figure2: Comparison of number of cluster in KMean, Hierarchical, EM, Farthest First, DB Scan IJCSMS www.ijcsms.com 34 IJCSMS (International Journal of Computer Science & Management Studies) Vol. 14, Issue 05 Publishing Month: May 2014 ISSN (Online): 2231 –5268 www.ijcsms.com [8] Han, J., Kamber, M. 2012. Data Mining: Concepts and Techniques, 3rd ed, 443-49. [9] Pallavi, Sunila Godara, “A Comparative Performance Analysis of Clustering Algorithms”, Vol. 1, Issue 3, pp.441-445. [10] Rastegari, Hamid, Md Sap, and Mohd Noor. "Data mining and e-commerce: methods, applications, and challenges." Jurnal Teknologi Maklumat 20.2 (2008): 116-128. [11] N R Srinivasa Raghavan, “Data mining in ecommerce: A survey”, Vol. 30, Parts 2 & 3, April/June 2005, pp. 275–289. [12] Jeng J J, Drissi Y 2000 Pens: a predictive event notification system for e-commerce environment. In The 24th Annu. Int. Computer Software and Applications Conference, COMPSAC 2000, pp 93–98. [13] Zhang Y Q, Shteynberg M, Prasad S K, Sunderraman R 2003 Granular fuzzy web intelligence techniques for profitable data mining. In 12th IEEE Int. Conf. on Fuzzy Systems, FUZZ ’03 (New York: IEEE Comput. Soc.) pp 1462–1464. [14] Niu L, Yan XW, Zhang C Q, Zhang S C 2002 Product hierarchy-based customer profiles for electronic commerce recommendation. In Int. Conf. on Machine Learning and Cybernetics pp 1075–1080. [15] Hollfelder S, Oria V, Ozsu M T 2000 Mining user behavior for resource prediction in interactive electronic malls. In IEEE Int. Conf. on Multimedia and Expo (New York: IEEE Comput. Soc.) pp 863–866. IJCSMS www.ijcsms.com 35