Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Modelling Clusters of Arbitrary Shape with Agglomerative Partitional Clustering Statistical Techniques in Pattern Recognition Prague, Czech Republic June, 1997 Eric W. Tyree and J. A. Long Department of Business Computing Systems City University Northamption Square London EC1V OHB United Kingdom Tel. 44 + (0171) 477 - 8551 e-mail: [email protected] Abstract A problem with the modelling of clusters as d - dimensional centroids is that centroids cannot relay much information about cluster shape - i.e. elongated, circular, irregular etc... The Agglomerative-Partitional Clustering (APC) methodology introduced here attempts to remedy this situation by joining together centroids coexisting within regions of relatively high density with line segments. Interconnected clusters are then modelled as the line segment as opposed to the original centroids. All interconnected centroids are treated as a single cluster. In addition, APC also allows the analyst to derive a hierarchical clustering tree based on inter cluster density as opposed to distance. Performance comparisons with other clustering techniques are given. 1. Introduction Regardless of how pattern partitioning is accomplished in cluster analysis1, clusters are often modelled as being the centroid of the observations composing the cluster. This approach to cluster analysis has two significant shortcomings: first, using a set of centroids to indicate the clustering of the data gives little information regarding the structure of the clusters themselves. For example, k-means [1] and moving methods [2,6] model clusters as centroids and cannot tell the analyst if a given cluster is elongated, compact or of some other structure. Second, given that cluster centroids arrived at via many pattern partitioning algorithms such as k-means and moving methods correspond to clusters of roughly hyperspherical shape, subsequent analysis of how individual observations relate to their given cluster can be distorted if the actual structure of the clusters in the data are not of that form. If the clustering is being conducted as part of a feature extraction or data reduction operation, this can lead to less than optimal representation of the structures within the data in the newly derived features. An example of this can be seen in figs. 1A and 1B which display the k-means clustering of a two cluster problem with k = 5 and k = 2. One cluster is roughly spherical and compact in shape, while the other 1 For general overviews of cluster analysis see [7], [8], or [9]. has an elongated structure. While the k = 5 solution has adequately found the areas of high density, there is nothing in the k-means derived centroids to tell the analyst that the lower cluster is in fact a single cluster. The k = 2 solution appropriately treats the lower cluster as a single cluster, but on subsequent analysis will give misleading information regarding the relationship between individual observations and the cluster. This is illustrated in Fig. 2. Given observations a and b, the analyst wants to know which observation is more "typical" of the cluster. If they were to use the distance of the observations to the cluster centroid to make this decision, observation b would be judged as being more typical than observation a as b is closer to the centroid. Assuming that the operating definition of a cluster is "a region of relatively high density", this is a misleading result as observation a is situated in a much more dense portion of the cluster. The k = 5 solution in fig. 1(b) would have correctly differentiated between the relative memberships of the two observations, however, in this case the number of clusters in the data set have been overestimated. Distance to Centroid Obs. B A B Obs A Centroid . Fig. 1: Possible k-means solutions to a two cluster problem. Fig 2: Single centroid representation of clusters can be misleading. This trade-off between over estimation of the number of clusters and adequate modelling of cluster structure is a direct result of modelling clusters as centroids. Unless all the clusters in the data set are roughly spherical is shape as in the top of fig. 1 this dilemma will exist. There is a need, therefore, for alternative ways to model clusters to accommodate arbitrary cluster shapes. APC attempts to remedy this situation by joining together k-means or otherwise derived centroids with line segments when they coexist within regions of relatively high density. This not only allows the modelling of any cluster structure that is approximatable in piece wise linear manner, but also allows more accurate modelling of how individual observations fit into the overall cluster structure of the data. In addition, APC preserves much of the speed and efficiency of pattern partitioning algorithms such as k-means and moving methods. Moreover, it's performance in noisy conditions is very robust and it incurs relatively low memory requirements giving it a decided advantage over other commonly used hierarchical techniques such as single linkage and average linkage. 2 Overview of APC Let X = (x1, x2, ... xn) be a set of d - dimensional data vectors, xn = (x1, x2, ... xd). APC begins with an initial cluster analysis of the data. The goal at this point is to identify areas of relatively high density in the data. Pattern partitioning techniques such as moving methods or k-means are ideally suited for this stage. Once the k clusters have been identified each cluster is represented as a d - dimensional centroid, ck = 1 p ∑ xi p i =0 (1) where ck is the cluster centroid and xi is an observation belonging to cluster k. APC then proceeds to estimate the density of the areas between each cluster centroid. If two centroids share an area of continuous high density, the two centroids are replaced with a line segment joining the two original centroids. Assuming that the initial centroids found via the pattern partitioning span contiguous regions of high density, joining the centroids together and replacing them with a line segment will give a more accurate representation of the structures present within the data. The newly merged cluster is now modelled as the line segment joining the centroids and the distance of any observation from the cluster is defined as the distance between the observation and the line segment. The density of the area between two centroids, c1 and c2 can be estimated as follows: First, provisionally connect the two centroids with a line segment. Next, find all observations that fall within a distance of sW of the line segment. Finally, calculate the density of the patterns that fall within the distance sW from the line segment. Effectively, the inter cluster density is the estimation of the distribution of observations that fall within a d - dimensional hypercylinder of radius sW extending between the two centroids, c1 and c2. If the density of the observations within this hypercylinder is above some threshold, t (see below), the inter cluster density of the two centroids is treated as being "uniform". At this point, a new cluster composed of the two original centroids defined as being the line segment joining the two centroids. The simplest of the density estimation techniques is the use of histograms. To calculate the inter cluster density, the line segment between the two centroids is divided up into a number of sub segments of length sL. The number of observations falling within each of the sub segments is counted. As can be seen in fig. 3 each of the bins of the histogram correspond to the number of observations found in each of the sW × sL sub cylinders (actually, sW × sL rectangles on 2 dimensions). If the density in the rectangular blocks is above the threshold, the two centroids are joined together with the line segment. Sw Sl Fig. 3: Inter cluster density is estimated via a series blocks of sW × sL regions of the inter cluster area. The heavy dots are the k-means derived centroids. The choice of the threshold t determines the minimum density of observations that must be present in each bin in order for the inter cluster density to be treated as "uniform". Rather than use an absolute value (as this value will vary between data sets and clusters within data sets) t is perhaps best set as being the ratio between the highest and lowest bin counts found in the histogram. In other words, agglomerate centroids where the ratio between the highest and lowest bin counts is above t. It is the relative density of observations found at the centroids and in the regions between them that is of importance rather than the absolute density. Also note that by examining the value of t for each possible agglomeration of the centroids, a hierarchical structure of that data based on inter cluster density can be produced. The analyst can then examine this hierarchy and decide on a suitable cut-off point or points. Once all of the inter cluster densities of the centroids have been estimated and all appropriate agglomerations of clusters have been completed, in subsequent analysis the distance of a given pattern from the new cluster is now the distance of the pattern from the line segment. If more than two centroids are linked together, the distance between a given observation and the string of continuously connected centroids (which are all treated as a single cluster) is defined as the distance between that observation and the closest line segment. 3.0 Empirical Demonstrations of APC 3.1 APC vs. Standard Hierarchical Clustering The following is a brief demonstration of APC's ability to model non spherically shaped clusters in two dimensions. The performance of APC will be compared to that of two standard agglomerative hierarchical methods: single linkage [4,3] and average linkage [5]. Two clustering problems were devised (fig. 4): a circular ring enclosing a gaussian cluster and a sine wave with gaussian clusters on either side. In addition, a second set of clustering problems was generated by replacing 50% of the data in the first set with uniform noise (fig. 5). For the circular and sine wave data sets APC was run with k = 10 and 7 respectively. K-means was seeded with randomly chosen patterns from the data sets. Cluster agglomeration was cut off at t = 0.3. For the standard hierarchical techniques, no cut off was specified a priori - once the analysis was run the cut-off was determined by visual inspection. The results are displayed in figs. 4 and 5. The dashed lines show how both single linkage and average linkage divided up the data set into two and three clusters on the ring and sine wave data sets respectively. The heavy dots and solid lines indicate the k-means derived centroids and how APC in turn agglomerated them into larger clusters. In the noise free conditions (fig. 4), it can be clearly seen that both APC and single linkage performed equally well. Both were able to capture the basic structure of the data by separating the ring from the gaussian cluster and the sine wave from the pair of gaussian clusters. Average linkage, however, performed quite poorly. This is due to the tendency of the algorithm to model clusters as being small, compact and spherical in shape. Only at the point of dividing up the data set into 10 different clusters did average linkage stop treating the central region and parts of the ring as having the same cluster membership. A similar result occurred with the sine wave data. Under the noisy condition (fig. 5) again average linkage performed poorly producing similar results as before. Single linkage was also completely unable to model the noisy data. Apparently it found two close points in the noisy region and began agglomerating from there. It is interesting to note that single linkage's tendency to produce long chaining clusters which allowed it to perform well in the non noisy conditions was the probable cause for its poor performance in the noisy conditions. The lack of clear distinction between clusters caused single linkage to chain points together without any regard to the actual structure of the data. APC on the other hand performed quite well in both noisy data sets. Although it did agglomerate some of the noisy patterns into a single cluster, it did extract the ring, sine and gaussian clusters correctly. 0.7 0.8 0.6 0.7 0.5 0.6 0.4 CL 0.5 0.3 0.4 0.2 0.3 AL 0.2 CL AL 0.1 0.1 0 0.1 0.3 0.5 0 0.7 0.2 0.4 0.6 0.8 1 Fig. 4. Noise free data sets. The APC clustering solution is represented as the heavy dots and lines. Single linkage and average linkage solutions are represented with dashed lines. 0.8 0.7 0.7 0.65 0.6 0.6 0.55 0.5 0.5 0.4 0.45 CL 0.3 0.4 0.2 0.35 AL 0.1 0.3 0.1 0.3 0.5 0.7 0 0.2 0.4 0.6 0.8 Fig. 5. 50% noise data sets. The APC clustering solution is represented as the heavy dots and lines. Single linkage and average linkage solutions are represented with dashed lines. 3.2. A Real World Example Hierarchical clustering based on inter cluster density as opposed to distance can produce useful insights in data. A simple example of this can be seen when running APC on some US census data2. The data (6551 observations) consisted of 5 variables considered good indicators of annual income. Table 1 shows the k-means derived centroids (k = 4) found on these five variables. At the bottom of the table is given the ratio of the number of observations in each cluster whose income is greater than $50,000 to those whose income is less than $50,000. The k-means analysis produced 2 clusters (1 and 2) that overwhelmingly corresponding to people with high and low incomes respectively. The other two clusters are somewhat mixed. APC was then applied to the four centroids (t = 0.3) which correspondingly agglomerated the two mixed clusters. This is an interesting result because if one were to agglomerate the clusters based on inter cluster distance (euclidean or city block), clusters 3 and 2 would have been treated as being more alike than 3 and 4. However, based on the ratios of the number of people earning less than 50K to those earning more than 50K, clusters 3 and 4 are certainly more similar - at least in terms of income profile. APC confirmed that in the data space, 3 and 4 occupy a continuous region of relatively high density. Based on a definition of a cluster as a region of relatively high density, these 2 "Adult" data set from the UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html. clusters should be treated as being more similar. Although this example is somewhat trivial, it does suggest that there is a role in cluster analysis for examining cluster agglomerations based on density. Cluster Num. 1 2 3* 4* Education Age Capital Gains Capital Losses Weekly Working Hrs. 0.364 0.690 0.377 0.336 0.461 0.139 0.525 0.007 0.003 0.359 0.523 0.515 0.013 0.001 0.373 0.290 0.718 0.009 0.001 0.468 ratio: >50K/ <50K 2.96 0.07 0.3 0.52 Table 1. APC agglomeration of k-means derived centroids on US census data. * indicates centroids agglomerated together by APC. 4 Summary APC has the following advantageous properties: ability to model clusters of arbitrary shape, reasonably fast computationally, low memory requirements (i.e. no proximity matrix need be calculated or stored) and robust performance under noisy conditions. The use of line segments as opposed to centroids allows it to model clusters of most any shape. Although not demonstrated here, this approach may provide a superior data reduction/feature extraction technique in situations where one has data composed of non linear spatial structures that cannot be dealt with properly with linear techniques such as principal component analysis or factor analysis. References [1] Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics Society Meetings, Riverside, California (Abstract in Biometrics, 21, 3, 768). [2] Ismail, M. A. and Kamel, M. S. (1989) Multi-dimensional data clustering utilising hybrid search strategies. Pattern Recognition, Vol 22, 75 - 89. [3] McQuitty, L. L. (1957) Elementary linkage analysis for isolating orthogonal and oblique types and typal relavancies. Educational and psychological measurement, 17, 207 - 229. [4] Sneath, P.H.A (1957) The application of computers in technology. Journal of General Microbiology, 17, 201 - 26. [5] Sokal, R. R. and Michener, C. D. (1958) "A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409 - 38. [6] Zhang, Q. and Boyle, R. D. (1991) A new clustering algorithm with multiple runs of iterative procedures. Pattern Recognition, Vol 24, No 9, 835 - 848. [7] Anderberg, M. R. (1973) Cluster Analysis for Applications. Academic Press, N.Y.. [8] Everitt, B. (1993) Cluster Analysis. 3rd Ed., Heinemann Educational, London. [9] Gordon A. D. (1981) Classification: Methods for the Exploratory Analysis of Multivariate Data. Chapman and Hall, London.