Download Modelling Clusters of Arbitrary Shape with Agglomerative

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Modelling Clusters of Arbitrary Shape with Agglomerative Partitional Clustering
Statistical Techniques in Pattern Recognition
Prague, Czech Republic
June, 1997
Eric W. Tyree and J. A. Long
Department of Business Computing Systems
City University
Northamption Square
London EC1V OHB
United Kingdom
Tel. 44 + (0171) 477 - 8551
e-mail: [email protected]
Abstract
A problem with the modelling of clusters as d - dimensional centroids is that centroids cannot relay much
information about cluster shape - i.e. elongated, circular, irregular etc... The Agglomerative-Partitional
Clustering (APC) methodology introduced here attempts to remedy this situation by joining together centroids
coexisting within regions of relatively high density with line segments. Interconnected clusters are then
modelled as the line segment as opposed to the original centroids. All interconnected centroids are treated as a
single cluster. In addition, APC also allows the analyst to derive a hierarchical clustering tree based on inter
cluster density as opposed to distance. Performance comparisons with other clustering techniques are given.
1. Introduction
Regardless of how pattern partitioning is accomplished in cluster analysis1, clusters are often modelled
as being the centroid of the observations composing the cluster. This approach to cluster analysis has
two significant shortcomings: first, using a set of centroids to indicate the clustering of the data gives
little information regarding the structure of the clusters themselves. For example, k-means [1] and
moving methods [2,6] model clusters as centroids and cannot tell the analyst if a given cluster is
elongated, compact or of some other structure. Second, given that cluster centroids arrived at via many
pattern partitioning algorithms such as k-means and moving methods correspond to clusters of roughly
hyperspherical shape, subsequent analysis of how individual observations relate to their given cluster
can be distorted if the actual structure of the clusters in the data are not of that form. If the clustering is
being conducted as part of a feature extraction or data reduction operation, this can lead to less than
optimal representation of the structures within the data in the newly derived features.
An example of this can be seen in figs. 1A and 1B which display the k-means clustering of a two cluster
problem with k = 5 and k = 2. One cluster is roughly spherical and compact in shape, while the other
1 For general overviews of cluster analysis see [7], [8], or [9].
has an elongated structure. While the k = 5 solution has adequately found the areas of high density,
there is nothing in the k-means derived centroids to tell the analyst that the lower cluster is in fact a
single cluster. The k = 2 solution appropriately treats the lower cluster as a single cluster, but on
subsequent analysis will give misleading information regarding the relationship between individual
observations and the cluster. This is illustrated in Fig. 2. Given observations a and b, the analyst wants
to know which observation is more "typical" of the cluster. If they were to use the distance of the
observations to the cluster centroid to make this decision, observation b would be judged as being more
typical than observation a as b is closer to the centroid. Assuming that the operating definition of a
cluster is "a region of relatively high density", this is a misleading result as observation a is situated in
a much more dense portion of the cluster. The k = 5 solution in fig. 1(b) would have correctly
differentiated between the relative memberships of the two observations, however, in this case the
number of clusters in the data set have been overestimated.
Distance to Centroid
Obs. B
A
B
Obs A
Centroid
.
Fig. 1: Possible k-means solutions to a two cluster
problem.
Fig 2: Single centroid representation
of clusters can be misleading.
This trade-off between over estimation of the number of clusters and adequate modelling of cluster
structure is a direct result of modelling clusters as centroids. Unless all the clusters in the data set are
roughly spherical is shape as in the top of fig. 1 this dilemma will exist. There is a need, therefore, for
alternative ways to model clusters to accommodate arbitrary cluster shapes. APC attempts to remedy
this situation by joining together k-means or otherwise derived centroids with line segments when they
coexist within regions of relatively high density. This not only allows the modelling of any cluster
structure that is approximatable in piece wise linear manner, but also allows more accurate modelling of
how individual observations fit into the overall cluster structure of the data. In addition, APC preserves
much of the speed and efficiency of pattern partitioning algorithms such as k-means and moving
methods. Moreover, it's performance in noisy conditions is very robust and it incurs relatively low
memory requirements giving it a decided advantage over other commonly used hierarchical techniques
such as single linkage and average linkage.
2 Overview of APC
Let X = (x1, x2, ... xn) be a set of d - dimensional data vectors, xn = (x1, x2, ... xd). APC begins with
an initial cluster analysis of the data. The goal at this point is to identify areas of relatively high density
in the data. Pattern partitioning techniques such as moving methods or k-means are ideally suited for
this stage. Once the k clusters have been identified each cluster is represented as a d - dimensional
centroid,
ck =
1 p
∑ xi
p i =0
(1)
where ck is the cluster centroid and xi is an observation belonging to cluster k. APC then proceeds to
estimate the density of the areas between each cluster centroid. If two centroids share an area of
continuous high density, the two centroids are replaced with a line segment joining the two original
centroids. Assuming that the initial centroids found via the pattern partitioning span contiguous regions
of high density, joining the centroids together and replacing them with a line segment will give a more
accurate representation of the structures present within the data. The newly merged cluster is now
modelled as the line segment joining the centroids and the distance of any observation from the cluster is
defined as the distance between the observation and the line segment.
The density of the area between two centroids, c1 and c2 can be estimated as follows: First,
provisionally connect the two centroids with a line segment. Next, find all observations that fall within a
distance of sW of the line segment. Finally, calculate the density of the patterns that fall within the
distance sW from the line segment.
Effectively, the inter cluster density is the estimation of the distribution of observations that fall within a
d - dimensional hypercylinder of radius sW extending between the two centroids, c1 and c2. If the
density of the observations within this hypercylinder is above some threshold, t (see below), the inter
cluster density of the two centroids is treated as being "uniform". At this point, a new cluster composed
of the two original centroids defined as being the line segment joining the two centroids.
The simplest of the density estimation techniques is the use of histograms. To calculate the inter cluster
density, the line segment between the two centroids is divided up into a number of sub segments of
length sL. The number of observations falling within each of the sub segments is counted. As can be
seen in fig. 3 each of the bins of the histogram correspond to the number of observations found in each
of the sW × sL sub cylinders (actually, sW × sL rectangles on 2 dimensions). If the density in the
rectangular blocks is above the threshold, the two centroids are joined together with the line segment.
Sw
Sl
Fig. 3: Inter cluster density is estimated via a series blocks of sW × sL regions of the inter cluster area.
The heavy dots are the k-means derived centroids.
The choice of the threshold t determines the minimum density of observations that must be present in
each bin in order for the inter cluster density to be treated as "uniform". Rather than use an absolute
value (as this value will vary between data sets and clusters within data sets) t is perhaps best set as
being the ratio between the highest and lowest bin counts found in the histogram. In other words,
agglomerate centroids where the ratio between the highest and lowest bin counts is above t. It is the
relative density of observations found at the centroids and in the regions between them that is of
importance rather than the absolute density. Also note that by examining the value of t for each possible
agglomeration of the centroids, a hierarchical structure of that data based on inter cluster density can be
produced. The analyst can then examine this hierarchy and decide on a suitable cut-off point or points.
Once all of the inter cluster densities of the centroids have been estimated and all appropriate
agglomerations of clusters have been completed, in subsequent analysis the distance of a given pattern
from the new cluster is now the distance of the pattern from the line segment. If more than two centroids
are linked together, the distance between a given observation and the string of continuously connected
centroids (which are all treated as a single cluster) is defined as the distance between that observation
and the closest line segment.
3.0 Empirical Demonstrations of APC
3.1 APC vs. Standard Hierarchical Clustering
The following is a brief demonstration of APC's ability to model non spherically shaped clusters in two
dimensions. The performance of APC will be compared to that of two standard agglomerative
hierarchical methods: single linkage [4,3] and average linkage [5]. Two clustering problems were
devised (fig. 4): a circular ring enclosing a gaussian cluster and a sine wave with gaussian clusters on
either side. In addition, a second set of clustering problems was generated by replacing 50% of the data
in the first set with uniform noise (fig. 5). For the circular and sine wave data sets APC was run with k
= 10 and 7 respectively. K-means was seeded with randomly chosen patterns from the data sets.
Cluster agglomeration was cut off at t = 0.3. For the standard hierarchical techniques, no cut off was
specified a priori - once the analysis was run the cut-off was determined by visual inspection.
The results are displayed in figs. 4 and 5. The dashed lines show how both single linkage and average
linkage divided up the data set into two and three clusters on the ring and sine wave data sets
respectively. The heavy dots and solid lines indicate the k-means derived centroids and how APC in turn
agglomerated them into larger clusters.
In the noise free conditions (fig. 4), it can be clearly seen that both APC and single linkage performed
equally well. Both were able to capture the basic structure of the data by separating the ring from the
gaussian cluster and the sine wave from the pair of gaussian clusters. Average linkage, however,
performed quite poorly. This is due to the tendency of the algorithm to model clusters as being small,
compact and spherical in shape. Only at the point of dividing up the data set into 10 different clusters
did average linkage stop treating the central region and parts of the ring as having the same cluster
membership. A similar result occurred with the sine wave data.
Under the noisy condition (fig. 5) again average linkage performed poorly producing similar results as
before. Single linkage was also completely unable to model the noisy data. Apparently it found two
close points in the noisy region and began agglomerating from there. It is interesting to note that single
linkage's tendency to produce long chaining clusters which allowed it to perform well in the non noisy
conditions was the probable cause for its poor performance in the noisy conditions. The lack of clear
distinction between clusters caused single linkage to chain points together without any regard to the
actual structure of the data. APC on the other hand performed quite well in both noisy data sets.
Although it did agglomerate some of the noisy patterns into a single cluster, it did extract the ring, sine
and gaussian clusters correctly.
0.7
0.8
0.6
0.7
0.5
0.6
0.4
CL
0.5
0.3
0.4
0.2
0.3
AL
0.2
CL
AL
0.1
0.1
0
0.1
0.3
0.5
0
0.7
0.2
0.4
0.6
0.8
1
Fig. 4. Noise free data sets. The APC clustering solution is represented as the heavy dots and lines. Single linkage and
average linkage solutions are represented with dashed lines.
0.8
0.7
0.7
0.65
0.6
0.6
0.55
0.5
0.5
0.4
0.45
CL
0.3
0.4
0.2
0.35
AL
0.1
0.3
0.1
0.3
0.5
0.7
0
0.2
0.4
0.6
0.8
Fig. 5. 50% noise data sets. The APC clustering solution is represented as the heavy dots and lines. Single linkage and
average linkage solutions are represented with dashed lines.
3.2. A Real World Example
Hierarchical clustering based on inter cluster density as opposed to distance can produce useful insights
in data. A simple example of this can be seen when running APC on some US census data2. The data
(6551 observations) consisted of 5 variables considered good indicators of annual income. Table 1
shows the k-means derived centroids (k = 4) found on these five variables. At the bottom of the table is
given the ratio of the number of observations in each cluster whose income is greater than $50,000 to
those whose income is less than $50,000. The k-means analysis produced 2 clusters (1 and 2) that
overwhelmingly corresponding to people with high and low incomes respectively. The other two clusters
are somewhat mixed. APC was then applied to the four centroids (t = 0.3) which correspondingly
agglomerated the two mixed clusters. This is an interesting result because if one were to agglomerate the
clusters based on inter cluster distance (euclidean or city block), clusters 3 and 2 would have been
treated as being more alike than 3 and 4. However, based on the ratios of the number of people earning
less than 50K to those earning more than 50K, clusters 3 and 4 are certainly more similar - at least in
terms of income profile. APC confirmed that in the data space, 3 and 4 occupy a continuous region of
relatively high density. Based on a definition of a cluster as a region of relatively high density, these
2 "Adult" data set from the UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html.
clusters should be treated as being more similar. Although this example is somewhat trivial, it does
suggest that there is a role in cluster analysis for examining cluster agglomerations based on density.
Cluster Num.
1
2
3*
4*
Education
Age
Capital Gains
Capital Losses
Weekly Working Hrs.
0.364
0.690
0.377
0.336
0.461
0.139
0.525
0.007
0.003
0.359
0.523
0.515
0.013
0.001
0.373
0.290
0.718
0.009
0.001
0.468
ratio: >50K/ <50K
2.96
0.07
0.3
0.52
Table 1. APC agglomeration of k-means derived centroids on US census data. * indicates centroids
agglomerated together by APC.
4 Summary
APC has the following advantageous properties: ability to model clusters of arbitrary shape, reasonably
fast computationally, low memory requirements (i.e. no proximity matrix need be calculated or stored)
and robust performance under noisy conditions. The use of line segments as opposed to centroids allows
it to model clusters of most any shape. Although not demonstrated here, this approach may provide a
superior data reduction/feature extraction technique in situations where one has data composed of non
linear spatial structures that cannot be dealt with properly with linear techniques such as principal
component analysis or factor analysis.
References
[1] Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency versus interpretability of
classifications. Biometrics Society Meetings, Riverside, California (Abstract in Biometrics, 21,
3, 768).
[2] Ismail, M. A. and Kamel, M. S. (1989) Multi-dimensional data clustering utilising hybrid search
strategies. Pattern Recognition, Vol 22, 75 - 89.
[3] McQuitty, L. L. (1957) Elementary linkage analysis for isolating orthogonal and
oblique types
and typal relavancies. Educational and psychological measurement, 17, 207 - 229.
[4] Sneath, P.H.A (1957) The application of computers in technology. Journal of General
Microbiology, 17, 201 - 26.
[5] Sokal, R. R. and Michener, C. D. (1958) "A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin, 38, 1409 - 38.
[6] Zhang, Q. and Boyle, R. D. (1991) A new clustering algorithm with multiple runs of iterative
procedures. Pattern Recognition, Vol 24, No 9, 835 - 848.
[7] Anderberg, M. R. (1973) Cluster Analysis for Applications. Academic Press, N.Y..
[8] Everitt, B. (1993) Cluster Analysis. 3rd Ed., Heinemann Educational, London.
[9] Gordon A. D. (1981) Classification: Methods for the Exploratory Analysis of Multivariate Data.
Chapman and Hall, London.