Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Clustering John Owen Sarah Smith University of Houston Clear Lake ABSTRACT This paper discusses types of clustering from a data mining perspective, and business examples for clustering. 1. 1 INTRODUCTION While the price of data storage has dropped over the last several years businesses have been storing data at an increasing rate, the question asked by many businesses is “what do we do with the data we have collected?” The data may have been “cheap” to store in the beginning by the cost of data storage exceeds the purchase price of the disks. For example, most businesses back up their data, the more data that is collected the more the cost of archival. For data retained in a transactional database, performance becomes an issue as queries and update transactions take longer to process. As the data collected by businesses grows, it becomes increasingly important to make sense of large amounts of data. This data must be grouped in ways that can be interpreted, and put to use for the business. 2. An obvious example of clustering is the common activity of sorting laundry. Dark colors go in one group, light colors in another. The clothes that are dark in color are alike, and they are different from the light colored clothes. Simply put, clustering is the activity of grouping together objects that are like one another, and not like the objects other clusters. In a data warehouse clustering “can be viewed as one of finding groupings in a set of events by extremizing some criterion function.” [Wan, 1988] Clustering is a technique that has its roots is statistical analysis. In statistics one of the first calculations used when analyzing data is variance, which is the average of the squared deviation about the arithmetic mean for a set of numbers. [Black, 2006] Wan identified that one of the most widely used method of clustering analysis is to use the sum of squares method and apply it to the Euclidean distances from the cluster center. In data mining, clustering is used to give a user a high level view of what is going on in their database. Clustering can also be performed to make it easier to identify outliers. [Berson & Smith, p. 409] While the clustering algorithms themselves can be fairly complex, the general approach is relatively simple. There are five basic steps in a clustering task; [Jain, 1988] 3. 4. 5. Pattern representation. The analyst identifies the number, type, and scale of features available to the clustering algorithm. Identify the pattern proximity relative to the data domain. Usually performed using the Euclidean distances. Grouping or Clustering of the data. Data abstraction. Assessment of output. 2 TYPES OF CLUSTERING METHODS There are many clustering methods available. Four of the methods described by Han & Kamber (p346-348) are partitioning, hierarchical, density-based, and grid-based Partitioning (k-means clustering) One of the most popular methods of performing clustering is K-means or partitioning. Unlike the other methods of clustering k-means requires the analyst to know something about the underlying data. The analyst then tells the system how many clusters the system to create when analyzing the data. Partitioning consists of the classification of the data into k groups, which meet two requirements; each group must contain at least one object, and each object must belong to exactly one group. There are two steps to creating the clusters: 1. “For each data item, assign it to the closest center, resolving ties arbitrarily. A proof can be found in that this phase gives the optimal partition for the given centers.” 2. “Recalculate all the centers. Each center is moved to the geometric centroid of the points assigned to it. A proof can be found in that this phase gives the optimal center.” [Foreman, 2000] In simple terms one decides how many clusters there should be, then creates the best fit of points to a cluster. In the figure below the analyst decided to create two clusters. The algorithm then analyzed the data and creates a logical center of for each cluster. Once a center or centroid has been created the algorithm then identifies the distance between the centroid and the individual data point and assigns that point to a cluster. (Source k-means clustering http://www.togaware.com/datamining/survivor/K_Means.html) Hierarchical Unlike the k-means testing above the analyst does not have to identify the number of clusters as the algorithm. While k-means clustering is the most prevalent method of clustering hierarchical clustering is designed primarily for creating micro-clusters in large database sets. Microclusters are used when the data is so similar that the statistical difference between the points is statistically insignificant. [Yu, 2003] The hierarchal method is either agglomerative (bottomup) or divisive (top-down). The agglomerative approach works with data individually at first, and then forms it into clusters until all of the data is in one cluster. The divisive approach is performed in the opposite way, beginning with a single cluster, and breaking the data away until each piece of data is isolated. (Source http://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html#hierarchy) Density-Based The density based algorithms define the data by the density of the data distribution. “Clusters are formed by connecting neighboring ’core’ objects and those ’noncore’ objects either serve as the boundaies of clusters or become outliers.” [Jaing ,2004] Density-Based clustering also does not require the user to identify the number of clusters before beginning the data analysis. This clustering method will continue to grow a given cluster as long as the number of objects (density) exceeds a given threshold. This method is useful for dealing with outliers. (Source: http://klimt.iwr.uni-heidelberg.de/mip/research/hader_clust/) Grid-Based Grid-Based clustering is an adaptation of Density-Based Clustering. In Grid-Based clustering the data points are placed in a data grid. Each data grid is of equal size and can be decomposed into smaller and smaller data grids depending on the business need and level of data abstraction required. These grids can be either fixed or adaptive. Fixed data grids are calculated by portioning the data into discrete non-overlapping clusters. Then the algorithm creates a histogram for all possible data grids. An adaptive data grid does not require the user to identify the number of grids, grid size or data density. [Nagesh] Instead the grids are populated then histograms are created based on the data in the individual dimension instead of all of the data. This results in increased performance and data accuracy. 3 REQUIREMENTS FOR CLUSTERING There are requirements for clustering algorithms in data mining (Han & Kimber, p. 337). The algorithms must be: Scalable Able to deal with different types of attributes Able to deal with clusters of varying shapes Insensitive to input parameters Able to deal with noisy data Insensitive to the order of input records Able to handle data that is highly dimensional Able to work within constraints Interpretable and usable 4 BUSINESS USES OF CLUSTERING There are a number of ways in which data can be clustered, and then used to support decision making. Marketing is a classic example of clustering, where potential customers are grouped into categories by common characteristics. Clustering enables companies to identify outliers, which can be useful to determine which customers are in jeopardy of discontinuing services, or even in the detection of credit card fraud. The use of clustering is in no way limited to business support decisions. A review of ACM journal titles shows clustering being used in fields such as astrophysics, biology, chemistry, medicine, psychology, and sociology. Clustering techniques are being used to answer the questions regarding the very make up of a human being. “In recent years, clustering analysis has even become a valuable and useful tool for gene expression data.” [Tseng, 2005] Conclusion Clustering provides businesses a method of making sense of their data. While early clustering techniques were labor intensive and required analysts to pour over mounds of data, today clustering tools are readily available. Cluster analysis tools are built into statistical packages such as SAS, S-Plus, and SPSS [Han & Kamber, p336]. If the future of computing is in artificial intelligence, then that future of computing relies on clustering. For an artificial intelligence engine to work properly it must be capable of unsupervised learning from pattern recognition. [Figueiredo, 2002] REFERENCES [WAN, 1988] S. J. WAN, S. K. M. WONG, and P. PRUSINKIEWICZ, An Algorithm for Multidimensional Data Clustering, ACM Transactions on Mathematical Software, Vol. 14, No. 2, June 1988, Pages 153-162. [Black, 2006] Black, Ken, Business Statistics for Contemporary Decision Making, John Wiley & Sons, Hoboken, NJ. [Figueiredo, 2002] M. Figueiredo, A.K. Jain, "Unsupervised Learning of Finite Mixture Models", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3, March 2002. [Tseng, 2005] Tseng, Vincent S.and Kao, Ching-Pin, “Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method”, IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 4, OCTOBER-DECEMBER 2005 [Jain, 1988] Jain, A. K. AND Dubes, R. C. 1988. “Algorithms for Clustering Data”. Prentice-Hall, Upper Saddle River, NJ. [Foreman, 2000] Forman, G. and Zhang, B., “Distributed Data Clustering Can Be Efficient and Exact” , ACM SIGKDD Explorations, December 2000, Volume 2, Issue 2 [Yu, 2003]Yu, Hwanjo, Yang, Jiong and Han, Jiawei “Classifying Large Data Sets Using SVMs with Hierarchical Clusters” SIGKDD’ 03, August 24-27, 2003, [Nagesh] Nagesh, Harsha. Goil, Sanjay , and Choudhary, Alok. "Adaptive Grids for Clustering Massive Data Sets" Society for Industrial and Applied Mathematics, On line at http://www.siam.org/meetings/sdm01/pdf/sdm01_07.p df [Jaing ,2004] Daxin Jiang, Chun Tang, Aidong Zhang, "Cluster Analysis for Gene Expression Data: A Survey," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370-1386, Nov., 2004.