Download Cluster Analysis for Large, High

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Transcript
Cluster Analysis for Large, High-Dimensional
Datasets: Methodology and Applications
by
Iulian V. Ilieș
A thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy in Statistics
Approved, Thesis Committee:
Prof. Dr. Adalbert F.X. Wilhelm
Prof. Dr.-Ing. Lars Linsen
Prof. Dr. Patrick J.F. Groenen
Date of defense: December 1, 2010
School of Humanities and Social Sciences
ii
Acknowledgements
I am grateful to my supervisor, Prof. Dr. Adalbert Wilhelm, for his trust and support.
Thank you for providing me with the scientific support I needed, and for allowing me so
much freedom of research.
I would further like to thank Prof. Dr. Lars Linsen and Prof. Dr. Patrick Groenen for
their support as members of my dissertation committee.
I am grateful to my collaborator Arne Jacobs, and to Prof. Dr. Otthein Herzog, for
their support and productive discussions.
I am also grateful to Ruxandra Sîrbulescu for her continuous support, both within
and outside of the academic environment.
This work was supported by grants WI1584/8-1 and WI1584/8-2 of the Deutsche
Forschungsgemeinschaft (DFG).
iii
List of publications
Publications included in this thesis
Ilies, I., & Wilhelm, A. (2010). Projection-based partitioning for large, high-dimensional
datasets. Journal of Computational and Graphical Statistics, 19(2), 474-492.
(Reprinted here with permission from the Journal of Computational and Graphical Statistics.
Copyright 2010 by the American Statistical Association. All rights reserved.)
Ilies, I., & Wilhelm, A. (submitted). Simulating cluster patterns.
Ilies, I., Jacobs, A., Herzog, O., & Wilhelm, A. (in press). Combining text and image processing
in an automated image classification system. Computing Science and Statistics.
Reports partially included in this thesis
Ilies, I., Jacobs, A., Wilhelm, A., & Herzog, O. (2009). Classification of news images using
captions and a visual vocabulary. Technical Report No. 50, Universität Bremen, TZI,
Bremen, Germany.
Ilies, I. (2008). A divisive partitioning toolbox for MATLAB. Technical Report, Jacobs
University Bremen, Germany. Submitted to the 2009 Student Paper Competition of
the Statistical Computing and Statistical Graphics Sections of the ASA.
Other related publications
Ilies, I., & Wilhelm, A. (2008). Projection-based clustering for high-dimensional data sets.
COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany:
Physica Verlag.
Ilies, I., & Jacobs, A. (in press). Automatic image annotation through concept propagation. In
P. Ludes (Ed.), Algorithms of Power – Key Invisibles.
Jacobs, A., Herzog, O., Wilhelm, A., & Ilies, I. (2008). Relaxation-based data mining on images
and text from news web sites. Proceedings of IASC 2008.
iv
Abstract
Cluster analysis represents one of the most versatile methods in statistical science.
It is employed in empirical sciences for the summarization of datasets into groups of similar
objects, with the purpose of facilitating the interpretation and further analysis of the data.
Cluster analysis is of particular importance in the exploratory investigation of data of high
complexity, such as that derived from molecular biology or image databases. Consequently,
recent work in the field of cluster analysis, including the work presented in this thesis, has
focused on designing algorithms that can provide meaningful solutions for data with high
cardinality and/or dimensionality, under the natural restriction of limited resources.
In the first part of the thesis, a novel algorithm for the clustering of large, highdimensional datasets is presented. The developed method is based on the principles of
projection pursuit and grid partitioning, and focuses on reducing computational
requirements for large datasets without loss of performance. To achieve that, the algorithm
relies on procedures such as sampling of objects, feature selection, and quick density
estimation using histograms. The algorithm searches for low-density points in potentially
favorable one-dimensional projections, and partitions the data by a hyperplane passing
through the best split point found. Tests on synthetic and reference data indicated that the
proposed method can quickly and efficiently recover clusters that are distinguishable from
the remaining objects on at least one direction; linearly non-separable clusters were usually
subdivided. In addition, the clustering solution was proved to be robust in the presence of
noise in moderate levels, and when the clusters are partially overlapping.
In the second part of the thesis, a novel method for generating synthetic datasets
with variable structure and clustering difficulty is presented. The developed algorithm can
construct clusters with different sizes, shapes, and orientations, consisting of objects
sampled from different probability distributions. In addition, some of the clusters can have
multimodal distributions, curvilinear shapes, or they can be defined only in restricted
subsets of dimensions. The clusters are distributed within the data space using a greedy
geometrical procedure, with the overall degree of cluster overlap adjusted by scaling the
clusters. Evaluation tests indicated that the proposed approach is highly effective in
prescribing the cluster overlap. Furthermore, it can be extended to allow for the production
of datasets containing non-overlapping clusters with defined degrees of separation.
In the third part of the thesis, a novel system for the semi-supervised annotation of
images is described and evaluated. The system is based on a visual vocabulary of prototype
visual features, which is constructed through the clustering of visual features extracted
from training images with accurate textual annotations. Consequently, each training image
is associated with the visual words representing its detected features. In addition, each such
image is associated with the concepts extracted from the linked textual data. These two sets
of associations are combined into a direct linkage scheme between textual concepts and
visual words, thus constructing an automatic image classifier that can annotate new images
with text-based concepts using only their visual features. As an initial application, the
developed method was successfully employed in a person classification task.
v
Table of contents
Acknowledgements ............................................................................................................................ ii
List of publications............................................................................................................................. iii
Abstract .................................................................................................................................................. iv
Table of contents ................................................................................................................................. v
I.
Introduction ........................................................................................................................ 1
Scope of this thesis.............................................................................................................................. 3
II. Cluster analysis techniques for large, high-dimensional datasets.................. 4
1.
State of the art ......................................................................................................................... 4
1.1. Data summarization ......................................................................................................... 4
1.2. Data sampling ..................................................................................................................... 5
1.3. Domain decomposition ................................................................................................... 5
1.4. Space partitioning ............................................................................................................. 6
1.5. Projected clusters .............................................................................................................. 7
1.6. Mixture models................................................................................................................... 8
1.7. Machine learning approaches ...................................................................................... 8
2.
Proposed method ................................................................................................................... 9
2.1. Theoretical background of the algorithm..............................................................10
2.2. Variable selection based on multimodality likelihood.....................................11
2.3. Sampling-based determination of high variance components .....................13
2.4. Smoothed histograms as density estimators .......................................................14
2.5. Local minima scoring .....................................................................................................14
3.
Practical implementation..................................................................................................15
3.1. Graphical interface..........................................................................................................16
3.2. Partitioning parameters ...............................................................................................17
4.
Experimental results...........................................................................................................18
4.1. Large, high-dimensional synthetic datasets .........................................................19
4.2. Comparison with common approaches .................................................................21
4.3. High-dimensional real datasets .................................................................................22
4.4. Theoretical and empirical algorithm complexity ...............................................23
III. Generation of synthetic datasets with clusters .................................................... 25
1.
State of the art .......................................................................................................................25
2.
Proposed method .................................................................................................................27
2.1. Dataset generation algorithm ....................................................................................27
2.2. Construction of nonlinear clusters ...........................................................................29
2.3. Construction of the dataset .........................................................................................37
3.
Practical implementation..................................................................................................41
3.1. Configurable program parameters ..........................................................................42
3.2. Generation of random numbers ................................................................................44
vi
3.3.
3.4.
Construction of different types of clusters ........................................................... 46
Computational complexity of the algorithm ........................................................ 49
4.
Experimental validation.................................................................................................... 50
4.1. Cluster rotations .............................................................................................................. 53
4.2. Cluster overlap ................................................................................................................. 54
4.3. Timing performance ...................................................................................................... 55
IV. Applications of cluster analysis in automatic image classification ............. 60
1.
Background ............................................................................................................................ 60
2.
Methodology and data ....................................................................................................... 61
2.1. Image classification system ........................................................................................ 61
2.2. Data collection and preprocessing........................................................................... 63
2.3. Visual vocabulary construction................................................................................. 64
2.4. Forward concept propagation ................................................................................... 64
2.5. Reverse concept propagation .................................................................................... 66
2.6. Algorithm validation ...................................................................................................... 66
3.
Experimental results .......................................................................................................... 67
3.1. Associations between visual words and concepts ............................................ 67
3.2. Differences between classification procedures .................................................. 68
3.3. Differences between visual vocabularies.............................................................. 69
3.4. Differences between classifiers with an optimal visual vocabulary .......... 70
3.5. Differences between visual vocabularies with optimal size ......................... 73
3.6. Summary ............................................................................................................................ 74
V. Discussion.......................................................................................................................... 76
1.
Cluster analysis for high-dimensional datasets ...................................................... 76
2.
Generation of synthetic datasets ................................................................................... 77
3.
Automatic image classification ...................................................................................... 78
4.
Future developments ......................................................................................................... 79
4.1. Projection-based partitioning algorithm .............................................................. 79
4.2. Synthetic datasets generator ..................................................................................... 80
4.3. Automatic image annotation system ...................................................................... 80
VI. References ......................................................................................................................... 82
I.
Introduction
One of the most extensively explored problems in statistical research is that of
cluster analysis, the unsupervised grouping of objects from a dataset according to their
similarity. Commonly, this summarization process has the purpose of facilitating the
interpretation and any subsequent analyses of the data. From an applied perspective, the
clustering task can be related to other research fields such as pattern recognition, data
compression, or density estimation (Bishop, 2006; Hastie et al., 2003). Consequently, a large
number of algorithms, originating from both statistics and computer science, have been
proposed over the years (Jain et al., 1999; Berkhin, 2006).
Following the recent technological progress, it is possible to produce everincreasing amounts of data of high complexity (e.g. sales histories or molecular biology
data) (Hinneburg & Keim, 1998). This results in datasets consisting of millions of objects
with tens to hundreds of dimensions, which are difficult to analyze. This impediment is
particularly evident in the context of analyzing data collected from the World Wide Web
(e.g. document collections, image databases, or traffic referral data). These types of datasets
are large in two or three directions simultaneously, i.e. in terms of the number of objects,
the number of dimensions, and, in some situations, the number of clusters.
Most importantly, such data mining imposes severe computational constraints
(Berkhin, 2006). However, traditional cluster analysis algorithms such as k-means
(Hartigan & Wong, 1978) do not usually address the problem of processing large datasets
with a limited amount of resources (i.e. system memory and processing time).
Consequently, during the last twenty years there has been growing emphasis on
exploratory analysis in very large databases (VLDBs) (Zhang et al., 1997). Attempts to
extend standard clustering methods to this setting have focused on reducing the working
data by squashing or sampling (e.g. Guha et al., 1998, Ganti et al., 1999), and/or on requiring
only one data pass (incremental mining). The most notable data reduction algorithm is
BIRCH (Zhang et al. 1997), which summarizes the data into a height-balanced tree. The
basic implementation, running an agglomerative hierarchical procedure on the leaves of the
tree, is available as TwoStep clustering in recent versions of SPSS. Breunig et al. (2001)
modified the summarization procedure such that additional information is stored in the
leaves. This allows for the usage of more complex, density-based clustering procedures that
provide solutions of increased quality, such as the OPTICS algorithm (Ankerst et al., 1999).
More recently, Teboulle and Kogan (2005) proposed a three-step clustering procedure
consisting of BIRCH, PDDP (Boley, 1998), and a smoothed version of the k-means algorithm.
High dimensionality of data presents additional problems beyond the computational
complexity. On one hand, the effect of the “curse of dimensionality” (Huber, 1985; Aggarwal
et al., 2001) is observed: the data become sparse, and the concept of proximity loses
meaning in more than 15 dimensions. The Euclidean distance to the nearest objects
Iulian Ilieș
PhD Thesis
2
becomes of the same order as the distance to any other object, and the proportion of
populated regions decays rapidly, even for the coarsest space-partitioning grids (Hinneburg
& Keim, 1999). Interestingly, fractional distance metrics ( metrics with
) seem to
provide more meaningful similarity measures (Aggarwal et al., 2001), but have not been
sufficiently explored so far. The likely reason is that such metrics retain the high
computational cost associated with calculating distances in high dimensionality. Indeed, the
FastMap algorithm (Faloutsos & Lin, 1995), which maps objects to a low-dimensional space
in an almost distance-preserving manner, has proven rather successful (e.g. Ganti et al.,
1999). On the other hand, the higher the number of attributes, the more likely it is to have
an increased number of irrelevant ones, and the clusters become impossible to find
(Berkhin, 2006). The apparent solution is to reduce the dimensionality of the data (for a
survey, see Becher et al., 2000). However, the direct application of feature transformation or
selection techniques is susceptible to problems. If there are numerous irrelevant
dimensions (i.e. very high noise level), the effectiveness of factor analysis is significantly
decreased (Parsons et al., 2004). Similarly, since clusters usually reside in different
subspaces, it is difficult to restrict the set of dimensions without pruning some that are
relevant to only a few of the clusters.
These problems motivated the development of a variety of subspace clustering
algorithms during the last decade (Kriegel et al., 2009). CLIQUE (Agrawal et al., 1998) and
its extension MAFIA (Nagesh et al., 2001) are searching for maximally-connected dense
unions of (hyper-) rectangular cells in subspaces of increasing dimensionality. They use a
recursive bottom-up procedure, with higher-dimensional dense cells obtained by joining
lower-dimensional cells with common faces. ProClus (Aggarwal et al., 1999) and its
derivatives OrClus (Aggarwal & Yu, 2000) and DOC (Procopiuc et al., 2002) are partitionrelocation methods that construct clusters as subset-subspace pairs rather than just
subsets. The required condition is that the projection of every subset into the corresponding
subspace constitutes a cluster with low internal average Manhattan distance. OptiGrid
(Hinneburg & Keim, 1999), its variation O-Cluster (Milenova & Campos, 2002), and CLTree
(Liu et al., 2000) partition the data set recursively by multi-dimensional grids passing
through low-density regions, thus constructing a tree of high-density projected clusters.
They are essentially extensions of the mode analysis approach (e.g. Hartigan, 1981) to a
high-dimensional context – cluster separation is done using density level sets and, similarly
to hierarchical methods, no prior knowledge on the number or geometry of the clusters is
required. More recently, several model-based subspace clustering methods have been
developed: Fern and Brodley (2003) proposed a cluster-ensemble approach which
combines mixture models obtained using different random projections of the data on lowdimensional spaces; Raftery and Dean (2006) designed a variable selection method that
finds an optimal low-dimensional subspace where the actual clustering is performed;
finally, the SSC algorithm (Candillier et al., 2005) drastically reduces the number of
parameters by assuming that all covariance matrices are diagonal.
Due to the wide variety of methods available, it is imperative to conduct an
appropriate assessment of the strengths and weaknesses of the different algorithms, in
order to select the most appropriate method for each context. However, the evaluation of
3
Introduction
clustering algorithms has often been criticized as either improper or insufficient, because
simplistic measures are used and few or no comparisons to other available methods are
performed (Kriegel et al., 2009). Moreover, many algorithms are only assessed on sample
empirical datasets with known classifications, despite the fact that such an approach has
limited generalizability (Maitra & Melnykov, 2010).
Scope of this thesis
The present thesis aims to develop improved methods for the clustering of highdimensional datasets, as well as further applications of such algorithms in practical settings.
In the first part of the thesis (see Chapter II), a more detailed review of the
representative clustering algorithms focused on the analysis of very large or highdimensional datasets is provided. Subsequently, a newly developed method for this purpose
(Ilies & Wilhelm, 2010) is described and evaluated. The developed algorithm is based on the
principles of projection pursuit and grid partitioning, and focuses on reducing
computational requirements for large datasets without loss of performance.
In the second part of the thesis (see Chapter III), a novel method for generating
synthetic datasets with variable structure and clustering difficulty, that is aimed at
evaluating clustering algorithms is presented.
In the third part of the thesis (see Chapter IV), the applications of cluster analysis to
the field of automatic image classification are investigated. A novel system for the semisupervised annotation of images is described and evaluated. The system is based on a
vocabulary of clusters of visual features extracted from images with known classification.
The thesis concludes with a summary and discussion of the described methods in
the context of the current state of the art (see Chapter V). Possible directions for future
research are also discussed.
II.
Cluster analysis techniques
for large, high-dimensional datasets
1.
State of the art
1.1.
Data summarization
The algorithm BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies; Zhang et al., 1997) is a data reduction method developed for very large
datasets. It addresses the case when the amount of memory available is much smaller than
the size of the data, and it aims at minimizing the cost of input-output operations during
clustering. This is achieved by summarizing the data into a height-balanced tree, whose
nodes represent tight groups of objects; the tree is constructed with only one pass through
the data. For each node, the number of objects, their centroid, and the total sum of squares
are stored; these values are sufficient for calculating typical measures such as the distance
between two clusters or the intra-cluster variance. The final leaves are the input to a
clustering algorithm of choice. Additional data passes can be done to refine the solution (by
reassigning points to the best possible clusters), or to resolve potential irregularities (e.g.
identical points ending up in different clusters; this may happen since the initial tree
construction is data-ordering dependent). BIRCH is very versatile, and can be easily adapted
to suit the employed clustering method. For example, Breunig et al. (2001) modified the
summarization procedure such that additional information is stored in the leaves (e.g.
distance to the nearest neighbors). This permits the usage of more complex, density-based
clustering procedures, for instance their algorithm OPTICS (Ordering Points To Identify the
Clustering Structure; Ankerst et al., 1999), that provide solutions of increased quality as
compared to the default hierarchical methods. A more recent application was developed by
Teboulle and Kogan (2005), who proposed a three-step clustering procedure consisting of
BIRCH, PDDP (Boley, 1998), and a smoothed version of the k-means method. Their SMOKA
algorithm was shown to be particularly effective on collections of documents (Kogan, 2007).
BUBBLE (Ganti et al., 1999) is a more general data squashing method that was
developed for arbitrary metric spaces. In contrast to vector spaces, in this context it is not
possible to add or subtract objects (and hence to calculate centroids), however one can still
calculate distances. Similarly to BIRCH, the algorithm summarizes the data into a heightbalanced tree whose leaves represent groups of similar objects. For each leaf, the algorithm
stores the number of contained objects, a centrally located object and several other
representatives, and the radius (the distance from the central object to most of the others).
The central objects of the final set of leaves are the then subjected to a hierarchical
clustering algorithm of choice; all remaining data points are distributed into clusters based
Cluster analysis for large,
high-dimensional datasets
5
on their leaf’s center. To deal with expensive distance functions (e.g. for strings, or in high
dimensional settings) more efficiently, Ganti et al. (1999) developed a variant of the method
that incorporates the FastMap algorithm (Faloutsos & Lin, 1995). Given a set of objects and
a distance function, this procedure returns (in a linear time) a set of equal size of vectors in
a low-dimensional Euclidian space (typically less than 10), with the property that the
distance between any two such image vectors is approximately equal to the distance
between the corresponding objects. This allows for an easy approximation of distances
when descending new objects into the tree, and reduces computation times significantly
with small impact on the overall performance.
1.2.
Data sampling
The algorithm CURE (Clustering Using Representatives; Guha et al., 1998) has an
inbuilt sampling mechanism that makes it scalable to VLDBs. It is a modified agglomerative
hierarchical method of the nearest-neighbor type; in contrast to the standard single linkage
algorithm, it employs subsets of several well-scattered representatives rather than all
objects when deciding which clusters to merge. This provides CURE with robustness in the
presence of outliers, while allowing it to detect non-spherical clusters and to separate
correctly between clusters of different sizes and densities. As other hierarchical methods,
CURE has quadratic complexity with respect to the number of objects, and hence direct
application on VLDBs would not be very efficient. Instead, the method relies on sampling:
the main procedure runs on a random sample of sufficient size such that all clusters are
represented (calculated via Chernoff bounds; Hoeffding, 1963). To further reduce
computations, this sample is split into several partitions that are pre-clustered to a certain
level. The resulting sets of small clusters are combined into one set, and then the merging
process continues until the desired number of clusters is achieved. Afterwards, the rest of
the data is distributed into clusters based on the nearest representatives.
1.3.
Domain decomposition
When the number of expected clusters is very large (hundreds or thousands), an
interesting alternative approach to reducing computation times arises in the form of
problem (domain) decomposition: dividing the data into subsets, and then running a
clustering algorithm of choice only within these sets. This reduces significantly the number
of distance computations, and, provided that the domains are chosen in an intelligent way,
clustering accuracy can be preserved. When using an agglomerative underlying clustering
algorithm, the solution is identical if every extant cluster can be covered by one domain or a
connected set of domains (McCallum et al., 2000). Critical to this approach is that the initial
partitioning into domains is fast enough to offer computational advantages, and good
enough to preserve accuracy. McCallum et al. (2000) proposed an efficient procedure for
constructing overlapping domains, using an inexpensive distance metric, e.g. the number of
dimensions on which objects are closer than some threshold for numerical data, or the
number of common words for text data. Their method covers the data iteratively with
Iulian Ilieș
PhD Thesis
6
disjoint balls of a given radius (using an appropriate fast metric), enlarges the balls, and
employs them as domains.
1.4.
Space partitioning
CLIQUE (Clustering In Quest; Agrawal et al., 1998) is one of the first algorithms that
attempts to find clusters within subspaces of the data set. It searches for dense units
(elementary rectangular cells) of increased dimensionality via a recursive bottom-up
procedure, by joining lower dimensional units with common faces. To avoid an exponential
explosion of the search space, all subspaces that fall below a minimal coverage criterion (i.e.
containing few dense units) are pruned before increasing the dimension. In each selected
subspace, the algorithm tries to form clusters as maximally connected dense regions
(constructed with a greedy scheme). A different implementation (Cheng et al., 1999)
measures densities and coverage indirectly, via entropy. The end result is a series of cluster
systems in different subspaces, rather than a partitioning of the data; clusters may overlap
each other, and some objects may not belong to any cluster. The CLIQUE algorithm is rather
robust, being able to find clusters of various shapes and dimensionalities, provided that the
input parameters (initial grid size and density threshold) are well tuned to the data. Its
extension, MAFIA, (Merging of Adaptive Finite Intervals; Nagesh et al., 2001) solved the
parameter dependency (by using adaptive grids and relative density criterions,
respectively), while also focusing on parallelization.
The algorithm OptiGrid (Optimal Grid Partitioning) was introduced by Hinneburg
and Keim (1999) as an extension to high dimensional data of their previously developed
density-based method DENCLUE (Hinneburg & Keim, 1998). The data is partitioned
recursively by multi-dimensional grids passing through low-density regions; only the highly
populated grid cells are further refined. The algorithm returns as clusters all final grid cells
with density above a certain noise threshold. Their algorithm is particularly interesting
since it requires no knowledge on the number, dimensionality or relative density of the
clusters. The mechanism for selecting cutting points is however somewhat cumbersome. On
each selected projection (typically the coordinate axes), the algorithm calculates the object
densities (with kernel estimators), searches for the left- and rightmost density maxima that
are above a certain noise level, and then for the minimal density between these two. The
procedure was simplified by Milenova and Campos (2002) – their algorithm looks simply
for a pair of maxima with a minimum between them where the difference between bin
counts is statistically significant (assessed with the chi-squared test).
PDDP (Principal Direction Divisive Partitioning; Boley, 1998) is a partitioning
method developed for document collections. At each step, the current group of objects is
bisected by a hyperplane passing through its centroid and orthogonal to the first principal
component. The direction is calculated via singular value decomposition (SVD) of the zerocentered data matrix. To reduce computational demands when analyzing large datasets,
PDDP relies on a special SVD implementation that calculates only the first eigenvector. The
main shortfall of their approach is that the true clusters can be frequently fragmented, the
reason being that the splitting plane always passes through a fixed point.
Cluster analysis for large,
high-dimensional datasets
7
1.5.
Projected clusters
ProClus (Projected Clustering; Aggarwal et al., 1999) is an iterative algorithm for
finding projected clusters: subset-subspace pairs with the property that the projection of
the subset into the corresponding subspace constitutes a cluster with low internal average
Manhattan distance. At each step, the cluster subspaces are calculated, the objects are
reassigned to the nearest cluster based on within-subspace distances, and the cluster with
lowest quality (large intra-cluster average distance) is pruned and replaced with a new
random one. The selection is restricted to a set of possible medoids constructed greedily
before the iterative phase. The algorithm was extended by Aggarwal and Yu (2000) to
subspaces that are not necessarily parallel to the axes. A different alteration, by Kim et al.
(2004), partially addressed the difficulty in specifying the input parameters (the number of
clusters, and most importantly the average cluster dimensionality). They constructed a
heuristics-based method for calculating the number of associated dimensions of any given
medoid. A more interesting variation was developed by Procopiuc et al. (2002), who
improved the actual search mechanism. Instead of constructing all clusters at once, their
algorithm DOC (Density-based Optimal projected Clustering) builds clusters one at a time,
following a greedy scheme. DOC finds the best projected cluster by comparison with a
reference sample of the current data (extracted using Monte Carlo techniques). The
associated dimensions are those where the distance from the current center to the sample
is larger than a certain threshold, and the cluster contains all points in the corresponding
inner hypercube. After convergence, the process restarts using the remaining data.
The algorithm FINDIT (a Fast and Intelligent subspace clustering algorithm using
Dimension voting; Woo et al., 2004) represents a different approach to the dimension
selection process. They introduced a dimension-oriented similarity index for objects: the
total number of attributes on which the distance is lower than some threshold. This
measure is employed in determining the nearest neighbors of cluster medoids. In turn, the
thus selected neighbors determine the key dimensions of the clusters – each neighbor votes
for all attributes on which it lies close to the medoid, and then the most voted dimensions
are selected. The algorithm starts with the random selection of two samples: a
representative set that contains points from all clusters (of size calculated via Chernoff
bounds; Hoeffding, 1963), and a smaller set of possible medoids. To do that, the algorithm
does not require the number of clusters, but the minimal cluster size. The algorithm then
finds the nearest neighbors of medoids from the distribution sample, calculates the
associated dimensions, and assigns all points in the sample to the nearest medoid. Next,
these clusters are merged until the minimal distance between them exceeds some userdefined threshold. The process is repeated for different values of the distance threshold
employed in the similarity measure, and the best solution is retained. Finally, the cluster
membership is extended in a natural way to the rest of the data.
Iulian Ilieș
PhD Thesis
1.6.
8
Mixture models
The SSC method (Statistical Subspace Clustering; Candillier et al., 2005) represents a
probabilistic approach for finding projected clusters in high dimensional data. It uses the
expectation-maximization algorithm (EM; an iterative, two-step method for log-likelihood
optimization; Dempster et al., 1977) to find the best mixture-of-distributions model for the
data. The method supports both discrete and continuous attributes; data is assumed to
follow Gaussian distributions on the continuous dimensions, and multinomial distributions
on the discrete ones. To achieve linear dependence on dimensionality, the algorithm makes
the additional assumption that clusters follow independent distributions on each
dimension. Since the EM algorithm is very sensitive to the initial conditions, the clustering
procedure is run repeatedly with randomized starting points, and the best solution is kept.
To speed up the process, the standard stopping condition is replaced with a fuzzy k-meanslike criterion (the most probable cluster for each object does not change), which guarantees
a faster convergence. The EM phase is followed by a post-processing stage, where a minimal
description of the clusters (in terms of the associated dimensions) as a set of rules is
achieved.
Other model-based approaches have focused on examining different subsets of
attributes in order to find an optimal representation of the entire set of clusters. Fern and
Brodley (2003) proposed a cluster ensemble approach which combines mixture models
obtained using different projections of the data. The underlying assumption is that different
projections uncover different parts of the structure in the data, and hence complement each
other. Consequently, their method performs several clustering runs, each consisting of a
random projection of the high dimensional data, followed by the clustering of the reduced
data using the EM algorithm. These results are aggregated in a similarity matrix, which is
then used to produce the final clusters via an agglomerative hierarchical procedure.
In a more rigorous approach, Raftery and Dean (2006) incorporated a variable
selection mechanism in the basic mixture-model clustering paradigm, by recasting the
problem of comparing two nested sets of attributes as a model comparison problem. Their
algorithm performs a greedy search to find the optimal low dimensional representation of
the clustering structure, starting with the pair of attributes that shows the most evidence of
multivariate clustering. Their method is therefore able to select dimensions, the number of
clusters, and the clustering model simultaneously. Regrettably, the approach is very slow,
requiring hours of processing for even moderately large datasets.
1.7.
Machine learning approaches
The decision tree approach (Quinlan, 1986) is a supervised learning method for the
classification of multivariate data into known classes. The algorithm constructs a tree where
non-terminal nodes correspond to value tests on single attributes, while the leaves indicate
the class. In essence, it represents a partitioning of the data space in hyper-rectangular
regions, some of which correspond to the extant classes of objects, while others are empty.
The partitioning is done by a divide and conquer algorithm: at each step, the algorithm
Cluster analysis for large,
high-dimensional datasets
9
chooses the best cut, following an information maximization criterion. CLTREE (Clustering
based on decision Trees; Liu et al., 2000) is a modification of the basic algorithm, adapted
for the task of cluster analysis. Hypothetical uniform noise is added to the data, to ensure
that the chosen decision tree algorithm can be used with no problems. Furthermore, the
best cut is imposed the additional condition of passing through a low density region. The
procedure continues until no improvement can be made, resulting in a complex tree that
often partitions the space more than necessary. To correct for that, a user-driven pruning of
the final leaves is performed. The process is controlled by two factors: the minimal cluster
size, and a minimal relative density for merging of adjacent leaves. While the final solution
is very sensitive with respect to these two parameters, the basic algorithm is fairly robust
and scales well with increasing number of objects or dimensions.
U*C (Ultsch, 2005) is an unsupervised clustering algorithm employing selforganizing neural networks (SOMs; Kohonen, 1990). Using an intrinsic system of
competitive generation and elimination of attraction centers (activated neurons excite their
neighbors and inhibit the rest), SOMs are able to group similar objects by mapping them to
neighboring units. The U*C algorithm distances itself from previous SOM-based methods
(see Wann & Thomopoulos, 1997, for a few examples) in that it does not attempt to directly
identify neurons with clusters. Their implementation sports a SOM with a large number of
units (typically in the range of thousands), which provides a topographical projection of the
high-dimensional input data onto the plane. For each unit, the algorithm calculates its
density (number of objects) and the distance to its neighbors (the average distance between
the objects mapped to it and those mapped to the neighboring units). This information can
be displayed as a two-dimensional “geographical” map, which allows for the immediate
identification of clusters. Typically hard to distinguish clusters (e.g. linearly non-separable)
are correctly recognized. Ultsch and Hermann (2006) developed an automatic method for
determining the number of clusters and their members.
2.
Proposed method
We propose a projection-based clustering method following the fundamental
principles of grid partitioning, as outlined by Hinneburg and Keim (1999). Our algorithm
searches for low-density points (local minima) in selected one-dimensional projections, and
divides the current data by an orthogonal hyperplane passing through the best split point
found, if any. The process terminates when no more splits can be found for any of the extant
subsets (see schematic representation of the algorithm in Figure II-1). To find good
projections, we follow the heuristics of projection pursuit (Huber, 1985), in locating the
directions with the highest variances via principal component analysis (PCA). To avoid
introducing noise from irrelevant attributes, the PCA is restricted to the subspace of
dimensions with possible multimodal distributions that are involved in the largest
correlations (see Sections II.2.2-2.3). If no split is found, the search is extended to the
coordinate projections, in decreasing order of likelihood of multimodality.
Aiming for a method truly applicable to large datasets, we focused on reducing both
memory and processor loads wherever possible. Solutions include a non-recursive
Iulian Ilieș
PhD Thesis
10
implementation of the algorithm (extant clusters are stored as a dynamic list of structures),
relying on sampling of objects and dimensions for quick finding of optimal projections, and
using average shifted histograms (ASH; Scott, 1992) for easy detection and scoring of local
minima. These procedures are presented in more detail below (Sections II.2.2 – II.2.5).
Similarly to other grid partitioning procedures (see Section II.1.4), our algorithm
does not require the user to specify the expected number or size of clusters, and therefore
constitutes a powerful tool for exploratory analysis. Such parameters are nevertheless
supported as optional inputs, and can influence the final solution (e.g. by prohibiting very
small clusters; note though that, due to methodological constraints, the minimal size
threshold is not absolute, and smaller clusters may still be produced when splitting larger
clusters). If the number of clusters is limited, the decision on which group to divide next is
based mainly on the quality of the split (Savaresi et al., 2002), with a logarithmic correction
term that favors very large clusters (more than 10000 objects) over smaller ones.
Figure II-1: Schematic representation of the proposed partitioning
algorithm. Rectangular cells represent calculation steps, while hexagonal
cells represent conditional nodes. Dark cells mark the partition-finding
process: modality- and PCA-guided search for good projections, histogrambased search and scoring of split points, and cluster division following the
best split. This process is bypassed for clusters smaller than the user-defined
threshold, if any. The procedure continues until the maximum number of
clusters is reached, if specified, or until no more splits are found.
2.1.
Theoretical background of the algorithm
The idea of partitioning objects using one-dimensional projections has been
employed in classification for a relatively long time. The decision trees approach (e.g.
Quinlan, 1986) splits the training data into subsets by value tests on single attributes,
essentially partitioning the data-space in hyper-rectangular regions that either correspond
to classes of objects or are empty. Recently, this principle has been imported in cluster
11
Cluster analysis for large,
high-dimensional datasets
analysis (e.g. Liu et al., 2000) in the context of high-dimensional data. Here, the cutting
planes are drawn through low-density regions with increased likelihood of discriminating
between clusters. The success of this approach stems from the reliance on contracting
projections; these have the intrinsic property that the density at any point in the projected
space constitutes an upper bound for the densities at all points from the original space that
project onto it. Therefore, objects that constitute a cluster in a given subspace will also be
part of a cluster in any sub-subspace (Agrawal et al., 1998), and low-density regions that do
not lie at the borders of the projected space will be good candidates for cutting planes.
Furthermore, under the additional assumptions that clusters are spherical and of similar
sizes, the misclassification rate when using cutting planes parallel to the coordinate axes
decreases exponentially with cluster dimensionality (Hinneburg & Keim, 1999). In general,
denser areas are better preserved, while smaller or multimodal clusters will be subdivided,
and this fragmenting behavior is augmented if multiple cuts are performed at the same time
(Milenova & Campos, 2002).
While the framework outlined above does not make any assumption on the
projections other than that they are contracting, the employment of projections other than
the coordinates has not been explored in detail. Perhaps the only significant recent
application is PDDP (Boley, 1998), a document clustering procedure that splits the data
successively with a plane orthogonal to the principal component at its midpoint. The
shortfalls of the approach are immediately noticeable – since the splitting plane always
passes through a fixed point rather than through a low density point, fragmentation of
clusters can occur very frequently. Although the PDDP implementation is not optimal, the
guiding heuristic of PCA-driven projection (or projection pursuit; Huber, 1985) remains
valid. PCA provides the directions of highest variance (i.e. the natural axes) of the data via
an eigenvalue decomposition of the covariance or correlation matrix. Assuming that the
data contains a structure describable by only a few variables (e.g. a subspace cluster), and
that the observed attributes are linear combinations of these underlying variables and
noise, then leading principal components tend to pick projections with interesting
characteristics (e.g. good discrimination between clusters), while the noise is relegated to
the trailing components. Using correlations rather than covariances provides the advantage
that the solution is both balanced (all dimensions are treated equally) and scale-invariant
(since correlations are not affected by the rescaling of attributes), thereby significantly
diminishing the influence of noise variables with dominant variances. Additionally, the
decision on how many components to consider for further analysis is facilitated: since the
initial coordinates are all normalized to unit variance, one can simply select all components
that are equivalent to at least one of the original dimensions, up to some error (i.e. with
eigenvalues > 0.95 in our implementation).
2.2.
Variable selection based on multimodality likelihood
To avoid introducing noise from irrelevant attributes, the search for projections
should be restricted to a relevant subset. Optimally, the selected subset would consist of
dimensions where at least one cluster is at least partially distinguishable from the rest of
Iulian Ilieș
PhD Thesis
12
the data, and would include sufficiently many dimensions to separate at least one cluster.
These conditions can be reduced to the simpler requirement that the selected dimensions
exhibit some degree of bi- or multimodality. In order to quickly quantify the potential
multimodality of each attribute, we defined a simple criterion based on low-order statistics.
We use the ratio between the standard deviation and the average absolute deviation,
)
√∑ (
⁄
√
∑ |
|
R is a kurtosis-like measure that is lower for symmetric bimodal distributions than for
unimodal ones. The value obtained is corrected for skewness by subtracting the fraction of
objects lying between the mean and the median. This second term can be computed in a
linear time as half the absolute average value of the sign function over the centered data:
∑ |
(
)|
⁄
The difference
allows for an efficient modality likelihood assessment: most
multimodal distributions score below 1.2, with only uniform-like unimodal distributions
scoring in the same range (see Figure II-2 for several examples). Multimodal distributions
with large number of modes tend to score similarly to the uniform distribution, near 1.15.
While other, more exact modality tests exist (e.g. the Dip Test; Hartigan & Hartigan, 1985),
our criterion is overall faster to compute. Furthermore, the final result of the partitioning
procedure is unaffected by any accidental selection of unimodal distributions at this step,
since the algorithm will not be able to find split points within unimodal projections.
Figure II-2: Attribute multimodality likelihood assessment. A: Location
of several univariate distributions in a pseudo kurtosis-skewness space.
Horizontal axis: the ratio between the standard deviation and the average
absolute deviation, a kurtosis-like measure. Vertical axis: the fraction of
objects lying between the mean and the median, a skewness-like measure.
The dashed segments mark several lines of equal multimodality likelihood
scores. Filled squares represent multimodal distributions, while filled circles
represent unimodal distributions. B – F: various bimodal distributions; their
density functions are depicted in the corresponding panels. 3M – a trimodal
distribution; B2 – Beta(2, 2); N – standard normal distribution; T – a tdistribution; U – uniform distribution; XP – exponential distribution.
13
Cluster analysis for large,
high-dimensional datasets
Using this criterion, we split the set of attributes into three subsets: the first
contains all dimensions with scores below 1.2 that are involved in the highest correlations;
the second contains all dimensions with scores lower than the normal distribution (1.25)
that were not included in the first one; and the third contains all the remaining dimensions
(i.e. with scores > 1.25). The subspace defined by the first set of attributes is considered to
be the most favorable – here, the algorithm will look for splits along the largest principal
components (eigenvalues > 0.95), and also along the coordinates. If no suitable split is
found, the algorithm will search the coordinate projections from the second set, and, if
needed, from the third set.
2.3.
Sampling-based determination of high variance components
In order to estimate correlations fast and efficiently, we resort to sampling: the
algorithm selects randomly (based on object index; for potential data ordering
dependencies, see Section V.1) a small group of objects, and calculates the correlations
between the attributes within this reduced set. To ensure that the error introduced by
sampling is low, we first examined the error values for different data sizes, sampling rates,
and correlation degrees, on random datasets with normal or uniform distributions (Figure
II-3A, B). The implemented sampling rate is ( )
(
) (see Figure
II-3C). All objects are employed when the current data has low cardinality (m smaller than
approximately 6000 objects). The sampling rate decreases polynomially with rate 2/3 for
increasingly larger datasets. This choice guarantees that the median relative error is lower
than 5%, while the median absolute error is lower than 0.01, even if all correlations are low
(absolute value < 0.2). For the more interesting correlations in the upper range (absolute
value > 0.5), the errors fall below these thresholds in more than 95% of the cases.
Figure II-3: Correlation estimation errors due to sampling. A, B: Median
relative errors (%) for low correlations (absolute value < 0.2; A) and large
correlations (absolute value between 0.5 and 0.8; B), for different data sizes
(horizontal axis; logarithmic scale) and sampling rates (vertical axis). The
error rates are represented as filled contour plots with logarithmic scales.
Each map is interpolated between 25 points; the values at each point are
summaries for 1000 random tests on uniformly and normally distributed
two-dimensional data (with no inner structure). C: Implemented sampling
rate, chosen such that the median relative errors are always less than 5%,
independent of the actual scale of the correlations.
Iulian Ilieș
PhD Thesis
14
Since directions of high variance lie in subspaces consisting of (strongly) intercorrelated dimensions, the PCA can be restricted to the attributes involved in the largest
correlations. Given that the sample size varies within the same run (clusters obtained at
different steps have different sizes), we decided to select a number of attributes dependent
on the total dimensionality d rather than the actual scale of the correlations. In the current
( ) dimensions are retained – corresponding to no information
implementation,
loss for lower-dimensional datasets (
) that can be processed fast, and consistent
with the general expectation that clusters in high-dimensional data reside in lowdimensional subspaces.
2.4.
Smoothed histograms as density estimators
In order to approximate the distribution functions of the objects on the different
one-dimensional projections in a fast way, we use ASHs (Scott, 1992, pp. 113-117). These
are a smoothed version of histograms, obtained by averaging several histograms of the
same bin size and different offsets. Alternatively, ASHs can be constructed as fine-gridded
histograms that are then smoothed with a moving-average filter. It is recommended (Scott,
1992, pp. 118-120) to set the narrow bin width such that the range of the data is covered by
50 to 500 bins (proportional to the size of the current data, m) with some additional empty
bins on both sides, while the averaging parameter should be at least three. Our algorithm
⁄
calculates the number of narrow bins as the nearest integer to
with an upper bound
of 500, and uses a triangular moving-average filter of length 7 (corresponding to an
averaging parameter of 4). With this choice of parameters, the bin width of the averaged
histograms,
is equal to approximately 70% of the oversmoothing
( )
bin width
(Scott, 1992, pp. 72-75), ensuring that major features of
the density distribution are correctly represented in the ASH. Indeed, for distributions with
small skewness coefficients (absolute value < 1), the resultant bin width lies within 30% of
the optimal value with respect to the mean integrated square error (MISE) criterion,
(Scott, 1992, pp. 55-57) for small datasets (approximately 1000
objects;
for normal distributions), and within 30% of 2h* for very large
datasets (approximately one million objects).
2.5.
Local minima scoring
In their initial paper presenting the grid partitioning framework and the OptiGrid
method, Hinneburg and Keim (1999) suggested to detect potential splitting points by first
finding pairs of density maxima that are above a certain noise level, and then by finding the
minimal density between them (with the quality of the split proportional to the density at
that point). The presence of a noise level parameter makes their procedure difficult to use in
an efficient way, while the actual scoring is too simplistic since it ignores the geometry of
the maxima. Milenova and Campos (2002) improved the criterion by contrasting the bin
count at the minimum point with the average of the two maxima via a chi-square test, and
ranking split points by test significance. However, their system is still prone to errors: since
Cluster analysis for large,
high-dimensional datasets
15
it does not take into account the width of the maxima, it is likely to rank highly those cutting
planes that separate distribution-tail noise or outliers from the rest of the data.
We further improved the scoring system by employing cumulative densities instead
of maximal values (as also suggested by Stuetzle (2003) for limiting the number of noise
clusters). Our criterion is very similar to the runt excess mass proposed by Stuetzle and
Nugent (2010): the score of each minimum point is defined as the geometric average
(rather than the minimum) of the excess masses (Hartigan, 1987) of the two resulting subclusters (i.e. the sums of bin parts above the trough level; Figure II-4A).
Furthermore, to ensure comparability between different clusters, we use
histograms based on relative frequency counts. This measure has a maximum value of 0.5
(complete separation between two equal groups), and penalizes asymmetric cases (e.g.
Figure II-4B, C), thus eliminating the need for a minimal value for local maxima. Our tests
suggest using values of 0.05-0.15 as minimal score thresholds, and hence as stopping
criterions (see also Figure II-4B-F). Values larger than 0.1 are more appropriate for data
with clusters of similar sizes, or when trying to minimize the effects of noise. If the data is
expected to contain a large number of clusters and/or clusters of highly variable sizes,
threshold values between 0.05 and 0.1 generally yield better results. Non-meaningful
fluctuations in the ASHs score just below 0.05.
Figure II-4: Local minima scoring. A: The goodness-of-fit of a local
minimum as a splitting point is calculated as the geometric average of the
normalized excess masses of the two resulting subclusters (graphically
corresponding to the areas of the associated peaks lying above the trough
level). B – F: Scoring examples on several bimodal distributions with
variable degrees of separation and symmetry. Vertical lines mark the best
splitting points; the corresponding scores are printed above the graphs.
3.
Practical implementation
The proposed clustering algorithm is implemented as a MATLAB function with four
input arguments and two outputs. The input arguments are: the dataset to be partitioned
(mandatory), the maximum number of clusters, the minimum cluster size, and the minimum
Iulian Ilieș
PhD Thesis
16
split threshold (Section II.2.5). The outputs are the list of cluster indices for each object and,
optionally, the decision tree constructed during the partitioning process. Additionally, we
have implemented a graphical interface for the algorithm (see Section II.3.1). The interface
allows for the specification of additional partitioning parameters (see Section II.3.2), and
also includes several visual tools that facilitate the exploration of the provided solutions.
3.1.
Graphical interface
The graphical interface (Figure II-5) consists of five modules: for data and
parameter selection; for visualizing the solution provided by the partitioning algorithm; and
for displaying the message log. Each panel is presented in more detail below. Contextual
help is available for every element of the interface in the form of tooltips.
Figure II-5: Graphical interface for the proposed partitioning
algorithm. The interface allows for the configuration of different
partitioning parameters. In addition, it includes several visualization tools.
The data selection panel, located in the upper left corner of the interface, allows the
user to select a dataset for partitioning from the variables extant in the global MATLAB
workspace. If the desired dataset has not been loaded yet, the panel provides the option of
browsing for data files on the local disk directly from the graphical interface. The program
will attempt to load MATLAB data files (“.mat”) and comma separated value files (“.csv”)
directly, and will open the data import wizard for other types of files.
The parameter setup panel is located in the center-left region of the interface. It
contains a variety of parameters that can influence the outcome of the partitioning process
(see Section II.3.2). To accommodate potential conflicts between the various parameters,
Cluster analysis for large,
high-dimensional datasets
17
each of them is assigned a priority level. The priorities can be changed easily via the linked
slider bars located under the name of each parameter. If at any time during the partitioning
procedure a conflict occurs, the parameter with the highest priority is used.
The dendrogram panel, located in the upper right region of the interface, displays a
compact, horizontal version of the most recent partitioning tree (an example is included in
Figure II-5). For user reference, a copy of the tree is also saved in the main workspace. The
internodes are colored in black, while the final leaves are colored in red and are labeled
incrementally in the order in which they were produced. To prevent over-cluttering of the
image, the scale of the tree is fixed. If the tree does not fit completely in the figure, the side
scrollbars activate, allowing the user to explore the entire image.
The matrix panel is located in the lower right part of the interface. It contains a
representation of the clusters corresponding to the final leaves (see example in Figure II-5),
aimed at highlighting the cluster-subspace associations that occur in high dimensional
settings. Each cluster is represented by a row of colored squares (one for each dimension)
with fixed size. For each dimension, the corresponding color is proportional with the
difference between the cluster mean and the grand mean on that dimension. Red hues
denote positive differences, blue hue negative differences, while green hues represent
differences close to zero. Intense blue or red colorings are indicative of strong clusterdimension associations. The labels of the clusters are the same as in the dendrogram plot. If
the matrix is too large, the side scrollbars activate, thus enabling the complete visualization.
The message log panel is located in the lower left corner of the interface. It displays
all error, warning, and informative messages generated at parameter setup or during the
partitioning process. Multiple selection of the printed text, both continuous (using the
SHIFT key) and discontinuous (using the CTRL key) is supported. Any selection is copied to
the system clipboard and can be pasted in a text editor of choice.
3.2.
Partitioning parameters
The interface permits the definition of four different parameters that can control the
partitioning process: the split score; the number and size of clusters; and the partitioning
level. Each parameter is presented in more detail below. For every parameter, the user can
define both lower and upper bounds. A number of constraints stemming from the
interactions between the different parameters apply when defining multiple values (e.g. the
product of the minimum number of clusters and the minimum cluster size must not exceed
the total number of objects). Apart from these restrictions, any combination of values is
possible, as well as defining only certain parameters or none at all. The latter choice will
result in a complete dendrogram.
The expected number of clusters is a simple parameter that may be the easiest to
work with for less experienced users. Although usually associated with iterative methods
such as k-means or density-based algorithms, it can have a significant influence on the
outcome of partitioning methods as well. If the user has knowledge about the structure of
the data to be analyzed, for example from previous test runs, the solution can be greatly
improved by specifying a target number of clusters.
Iulian Ilieș
PhD Thesis
18
The partitioning level is perhaps the most commonly employed stopping criterion
for hierarchical methods. Setting the number of successive divisions provides a balanced
tree with respect to the number of branches, although it may have negative effects on the
size of the final leaves. The interface offers a more flexible approach, allowing the user to
specify both a minimal and a maximal value for this parameter. Used in conjunction with
restrictions on the size of the clusters, it can result in equilibrated solutions.
The cluster size is a more sophisticated parameter, in that it requires prior
information about the data for a successful usage. Setting a minimum threshold limits the
number of very small clusters that may just represent noise in the data. Correspondingly,
imposing a maximal value forces the partitioning algorithm to produce a more balanced
tree, with the objects more evenly distributed among the final clusters.
The split score, as defined in Section II.2.5, is a parameter measuring the quality of
the partitioning of any analyzed cluster. Despite having an unbalancing effect on the cluster
tree, defining a minimal and/or maximal value will usually lead to a more accurate solution.
Setting a lower bound guarantees that groups with no internal structure will not be
subdivided further. Correspondingly, specifying an upper bound guarantees that even the
finer structure in the data is revealed.
4.
Experimental results
As a first assessment, we tested the proposed method on several simple datasets
with problematic configurations (several preliminary results using an early version of the
algorithm are reported in Ilies & Wilhelm, 2008). The results (see examples in Figure II-6)
indicated that our algorithm could detect clusters that were linearly separable and nonoverlapping. If the former condition does not hold, any interlocked clusters are subdivided
into smaller groups, resulting in several clusters that are subsets of the actual clusters
(Figure II-6C). The algorithm effectively cuts away pieces from such clusters until the
remaining data is linearly separable. If the clusters are overlapping, then misclassifications
may occur in the overlap regions, or where the clusters are close enough to each other such
that the objects in the contact area cannot be assigned unambiguously (Figure II-6A, B).
In most cases, following the projection pursuit heuristic proves to be beneficial, and
allows for fewer misclassification errors compared to employing only the coordinate
projections, as OptiGrid does. Figure II-6A illustrates such a situation; using principal
component projections, the error rate is reduced from 6.3% to 1.6%. However, there are
situations when the principal components do not contain useful information (Chang, 1983)
– e.g. if there are many isotropically distributed clusters (Huber, 1985), or if the clusters are
oriented such that the correlations are close to zero (e.g. Figure II-6B). In such cases, the
inclusion of coordinate projections provides additional chances to differentiate between the
clusters; in the example shown in Figure II-6B, this additional step decreases the
misclassification rate from 3.9% to 1.5%. At worst, the proposed algorithm partitions the
data only with planes parallel to the coordinate axes, similarly to OptiGrid (Figure II-6B-C).
Cluster analysis for large,
high-dimensional datasets
19
Figure II-6: Qualitative performance assessment. The clusters recovered
by the proposed method are represented in different hues and markers. A:
Since our algorithm considers the dominant principal component (diagonal
line) as a projection, it is able to recover all clusters. Overlapped clusters
(lower pair) are separated as best as possible; some misclassifications will
still occur in the overlap region. By contrast, OptiGrid shows reduced
performance, since it analyzes only coordinate projections (cutting planes
shown as dashed lines). B: Principal components are not always useful, and
any partitioning procedure relying only on PCA to find good projections may
produce inaccurate solutions (cutting planes shown as dashed lines). Our
method incorporates both coordinate and PCA projections, and is therefore
able to recover the clusters correctly. C: An intrinsic limitation of the
proposed approach is the subdivision of linearly non-separable clusters.
4.1.
Large, high-dimensional synthetic datasets
We further analyzed the performance of our method on synthetic datasets of large
cardinality (10000-89000 objects, median = 45000) and dimensionality (10-50 attributes,
median = 30.5) (see example in Figure II-7D). The datasets were generated using an early
version of the algorithm presented in Chapter III. Objects were distributed into 2 to 20
(median = 12) ellipsoidal clusters of variable size (within 50% of average size), following
diagonal multivariate normal distributions. Each cluster was associated with a random
subspace of random size; on these selected attributes, the mean and standard deviation
were set at random values, while on all other dimensions the values were sampled from the
standard normal distribution, with mean zero and unit standard deviation (see e.g. Figure
6F). In half of the cases, up to 24% (median = 4%) noise objects, sampled from spherical
multivariate normal distributions with zero means and unit standard deviations on every
attribute, were appended to the data set. This generating procedure allowed us to create
clusters with various degrees of overlap, by varying the number of associated dimensions
(1-5, median = 3), and the distribution parameters on these attributes (cluster mean
ranging from 0.4 to 3.8, median = 1.8; standard deviation between 0.3 and 0.6, median =
0.4). To quantify the overlap, we used a purely geometrical measure. For each cluster, we
calculated the percentage of objects lying inside the convex hull of the inner 99% of the
remaining data, within the associated subspace. Thus, a 0% overlap corresponds to wellspaced, linearly separable clusters, while a 100% overlap indicates a rather continuous
Iulian Ilieș
PhD Thesis
20
distribution of objects. Every generated data set (total = 540) was used as input to our
method (Figure II-7D, E shows a typical example). We set the split threshold at 0.1, and left
both the maximal number and minimal size of the clusters undefined, in order to test
whether our algorithm can make appropriate stopping decisions.
We investigated the accuracy of the provided solutions by examining the
contingency of the initial (constructed) clusters and the ones found by our method. We
calculated the Adjusted Rand Index (ARI; Hubert & Arabie, 1985), a partition-similarity
measure that takes values between 0 (for random partitions) and 1 (for perfect agreement).
In addition, we computed the fractions of recovered clusters and of objects classified
correctly. An original cluster was said to be recovered as one of the clusters found by the
algorithm if their intersection contained at least 75% of the objects from either. The objects
belonging to such intersections were considered to be correctly classified.
Figure II-7: Performance on synthetic datasets. A – C: The Adjusted Rand
Index (A), the percentage of clusters recovered (B), and the percentage of
objects correctly classified (C), as functions of the average cluster overlap.
Error bars represent standard errors of the means (46-53 sets per point).
The datasets were generated as explained in Section II.4.1, with 1000089000 objects, 10-50 dimensions, and 2-20 hyper-ellipsoidal clusters. D – E:
Typical example. The left panel (D) shows the constructed clusters, while the
right panel (E) shows the clusters uncovered by our method. Corresponding
clusters are linked with arrows. Note that the noise objects (bottom group in
panel E) were also recovered as an independent cluster. Color map: -4
(black) to 4 (white). F: Two-dimensional projection of the example dataset
within the definition subspace of one of the clusters (marked with a star in
panel E). Gray squares mark objects from the selected cluster, while black
dots correspond to all other objects (cluster overlap = 10%).
Cluster analysis for large,
high-dimensional datasets
21
Results (see Figure II-7A-C) revealed that clusters satisfying the most basic
differentiation criterion – being distinguishable from noise (or other clusters) on at least
one dimension – were correctly recovered. If the average degree of overlap was less than
25%, almost all clusters were retrieved (including the group of noise objects, if present),
and the ARI was larger than 0.9, while for an overlap of up to 50%, at least 65% of the
clusters were found on average, with corresponding ARI values of 0.6-0.9 (Figure II-7A, B).
In contrast, at high overlap degrees (more than 85%), very few of the modes in the data
were uncovered, as expected given that the original clusters were merged by construction.
The fraction of objects correctly classified (Figure II-7C) showed a similar pattern,
albeit with somewhat lower absolute values. The misclassifications originated mostly from
the overlap regions, where objects were assigned to the different clusters based on their
position with respect to the minimal density cuts. A secondary effect was that the clusters
found could have different sizes than the initial ones (e.g. the upmost cluster in Figure II-7D,
E), since intersection areas were redistributed unevenly, or even separated as distinct
clusters (e.g. lowest cluster in Figure II-7E). At low and medium degrees of overlap, the
provided solution contained a significant number of small clusters (up to 50% of the total),
most likely the result of noise, outliers, or cluster overlapping.
4.2.
Comparison with common approaches
To assess the performance of our MATLAB-implemented method relative to the
already available algorithms, we compared it to several other approaches, which are either
commonly used or considered to be very effective: k-means and k-medians (vectorized
MATLAB versions); average linkage (MATLAB implementation); the BIRCH-based TwoStep
cluster analysis from SPSS; and the model-based clustering Mclust (Fraley & Raftery, 1999)
from R. We included k-medians in addition to k-means since it should perform better in a
high-dimensional context due to its reliance on Manhattan distances (Aggarwal et al., 2001).
We tested the selected methods on ten moderately large synthetic datasets consisting of
6000-16500 objects (median = 10400) defined in 10-20 dimensions (median = 18). Objects
were grouped into 3-6 clusters (median = 5) with subspace sizes of 1-5 (median = 3), and
low degrees of overlap (0-38%, median = 16%). For our method, we set the minimal split
threshold at 0.1 and left the other parameters undefined. For k-means, k-medians, and
average linkage we specified the true number of clusters as input. For TwoStep and Mclust,
we used the default settings, allowing the algorithm to choose the best model with at most
nine clusters.
Since the datasets are essentially mixtures of partially overlapping multivariate
normal distributions with diagonal covariance matrices, we expected the k-centroid
methods, and especially the model-based clustering procedure, to perform particularly well.
Indeed, Mclust was able to recover 98% of the clusters (range = 80-100% over the ten
datasets; mean ARI of 0.98, range = 0.9-1), and to classify correctly 96% of the objects
(range = 72-100%), albeit in a very long time (80-280 seconds). Our method attained
similar results, recovering 100% of the clusters (mean ARI of 0.96, range = 0.88-1), and
classifying correctly 98% of the objects (range = 89-100%) very quickly (in at most 0.5
Iulian Ilieș
PhD Thesis
22
seconds). TwoStep and k-medians also performed well, with 93-95% of the clusters
recovered, an average ARI of 0.93-0.94, and 91-93% of the objects classified correctly, also
in very short times (averages of 2.2 and 0.7 seconds, respectively). K-means was slightly
worse, recovering 87% of the clusters (mean ARI of 0.9), and classifying 85% of the objects
correctly, and was roughly as fast as k-medians. Only average linkage performed markedly
worse, being able to recover only 39% of the clusters and to classify correctly just 39% of
the objects (mean ARI of 0.55) in, on average, 12 seconds. These results indicate that,
similarly to state-of-the-art methods like Mclust and TwoStep, our algorithm is able to
provide solutions of good quality without prior information on the number of clusters. In
addition, it is very fast (with running times several times smaller than k-medians or
TwoStep), and this ability scales well with increasing the number of objects or dimensions
(see Section 3.4 below). By contrast, we could not run Mclust on data with more than 5000
objects and 75 dimensions on the computer used in this study.
4.3.
High-dimensional real datasets
To complete the assessment of the proposed method, we tested it on several real
datasets from the UCI Machine Learning Repository (Asuncion & Newman, 2007). We
searched the repository for multivariate datasets with high dimensionality and numericalonly attributes that were marked as appropriate for classification, and selected the
following: the University of Wisconsin Breast Cancer Original Data Set (Mangasarian &
Wolberg, 1990) – 683 observations described by nine integer valued variables, distributed
into two classes (444 benign, 239 malign); the Johns Hopkins University Ionosphere
Database (Sigillito et al., 1989) – 351 radar returns described by 34 continous attributes,
distributed into two classes (225 good, 126 bad); the Statlog Landsat Satellite Dataset –
6435 observations with 36 hue values each (range = 0-255), grouped in six classes (e.g. red
soil, cotton crop) of sizes 626, 703, 707, 1358, 1508, and 1533; the Statlog Image
Segmentation Dataset – 16 color and contrast measures of 2310 regions cropped from
outdoor photos, grouped into seven classes (e.g. foliage, cement, sky) of equal size; the
SPAM E-mail Database – 4601 e-mails described by 57 continuous attributes (various
character and word frequencies), distributed in two classes (1813 spam, 2788 not spam);
the Statlog Vehicle Silhouettes Dataset (Siebert, 1987) – 846 observations described by 18
continuous attributes obtained from the processing of standardized photos of cars, grouped
into four classes (e.g. bus, van) of sizes 199, 212, 217, 218; and the Multiple Features Digit
Dataset – six sets of measurements (with 6-240 features per set, median = 70) of 2000
monochrome images of digits, distributed into ten classes (“0” to “9”) of equal size.
Since we had no prior information about the separability of the classes of objects,
we opted not to define the maximum number of clusters. Instead, in each case, we set the
minimal cluster size threshold at the size of the smallest extant class. Thus, assuming that
the true clusters are not linearly separable, and therefore would likely be subdivided by our
algorithm, our parameter choice guaranteed that any fragments that are smaller than a true
cluster will not be further partitioned, as they probably span only one cluster.
Correspondingly, we relaxed the cluster recovery rule defined in Section II.4.1 to allow for
Cluster analysis for large,
high-dimensional datasets
23
smaller cluster fragments. We associated the clusters found with the original clusters if
their intersection contained at least 75% of the objects from the retrieved cluster, and at
least 25% of the objects from the true cluster. The performance was rather modest, with a
mean ARI of 0.31 (range = 0.14-0.53 over the twelve datasets). On average, at least one
fragment was recovered for 52% of the clusters (range = 20-100%), with a mean number of
1.4 fragments per cluster, and 38% of the objects (range = 10-91%) were correctly
classified. When allowing for more heterogeneity within the retrieved clusters, by lowering
the 75% requirement mentioned above to 50%, the proportion of recovered clusters
increased to 76% (range = 50-100%), while the fraction of correctly classified objects
increased to 50% (range = 32-91%). These results suggest that the considered datasets
contained significantly overlapping, linearly non-separable clusters. Indeed, errors were
typically larger when the algorithm had to discriminate between very similar images, e.g.
damp gray soil and very damp gray soil, cement patches and paths, or the digits “6” and “9”.
For comparison, we also clustered the twelve test datasets using the k-means,
TwoStep, and Mclust algorithms. For k-means and TwoStep we specified the correct
number of clusters beforehand, while for Mclust we employed the default automatic model
selection. The performances were similar to that of our method, albeit more varied across
the datasets. They had slightly lower rates of cluster recovery (averages: 38-43%, range: 0100%) and correct classification of objects (averages: 32-35%, range: 0-96%), but
somewhat larger ARI values (averages: 0.4-0.44, range: 0-0.88) due to the reduced cluster
fragmentation.
4.4.
Theoretical and empirical algorithm complexity
To calculate the theoretical complexity, let n be the total number of objects, d be the
number of attributes,
be the average cluster subspace size,
the size of a group
of objects analyzed at some point, and q the number of true clusters contained in this group.
The assessment of the multimodality of the distributions of this group on all dimensions
involves the calculation of ( ) sums over objects, and hence has complexity ( ). On
average, a number of
dimensions will be employed in the subsequent search
for optimal projections. The estimation of the correlations between the selected attributes,
based on a sample of
(
)
(
) objects, has complexity ( ⁄
). The
determination of the principal components for the dimensions involved in the largest
( ( )) correlations involves an eigenvalue decomposition, and hence has a complexity
of ( ( ) ). The subsequent calculation of the projections of objects on the ( ( ))
( ( ) )
(
( ) ) The construction of
principal directions is of order
histograms is of order at most ( ), when all coordinate projections are also analyzed.
The analysis of local minima depends quadratically on the number of bine, and hence has
complexity at most ( ⁄ ). Aggregating, the expected analysis time for one cluster is
(
( ))
(
). Let now k be the final number of clusters found by the
algorithm. Then
clusters of progressively smaller size and number of children are
analyzed in total. The decrease in size and number of subclusters should be approximately
linear with respect to the index. Therefore, on average, analyzing additional clusters will
Iulian Ilieș
PhD Thesis
24
amount to a sub-linear increase in computation time. Summarizing, the overall expected
algorithm complexity is (
.
), with
To verify the above calculations empirically, we performed three sets of tests
(consisting of 160-250 tests each) using randomly generated data with very small overlap
between the clusters (generated as explained in Section II.4.1). In each set, we varied one of
the following three parameters – number of objects (10000 to one million), number of
dimensions (1-500), and number of clusters (2-240), while keeping the other two
restricted. The results (see Figure II-8A-C) confirmed our expectations: the running time
was linear with respect to the number of objects (exponent = 0.997 with a 95%-confidence
interval of ± 0.045, adjusted R2 = 0.989), slightly superlinear with respect to the number of
dimensions (exponent = 1.067 ± 0.023, adjusted R2 = 0.995), and sublinear with respect to
the number of clusters found (exponent = 0.774 ± 0.074, adjusted R2 = 0.932). Furthermore,
the total time was generally very low, in the order of seconds to tens of seconds on a regular
computer (all tests were performed on a computer with 2 GHz dual-core Intel processor
and 2 GB of RAM).
Figure II-8: Timing performance. A – C: Running time, as function of the
number of objects (A; 5-6 clusters in 20 dimensions), the number of
dimensions (B; 50000-55000 objects in 5-6 clusters), and the number of
clusters found (C; 50000-55000 objects in 20 dimensions). Clusters were
defined in 1-5 dimensions, with an average degree of overlap of at most 5%.
Dashed lines represent the best approximations by power functions
(adjusted R2 > 0.93 in all cases). The exponents are 0.997 for the number of
clusters, 1.067 for the dimensionality, and 0.774 for the number of clusters
found.
We conducted an additional set of 100 tests on synthetic data of similar structure as
above (24000-81000 objects, 10-90 dimensions, 3-133 clusters found), where we
monitored the time spent by the algorithm in the various sections. On average,
approximately 45% of the time was used for constructing histograms, 23% for calculating
projections, 19% for the modality tests, 5% for estimating correlations, and 5% for
analyzing local minima (with the remainder 3% being taken up by overhead). The low value
registered for the correlations estimation step is an effect of the attribute selection
procedure, and corroborates with the observed low-order dependence on the number of
dimensions (Figure II-8B); without prior feature selection, the dependence would have
been quadratic.
III. Generation of synthetic datasets with clusters
1.
State of the art
The only effective technique for the evaluation of clustering algorithms relies on the
analysis of a large number of synthetic datasets (Milligan, 1996). Since the exact cluster
structure of computer-generated datasets is known, the performance of any clustering
algorithm can be measured efficiently by calculating the agreement between the partition
found by the algorithm and the true partition. Moreover, depending on the method
employed, it is possible to generate datasets with variable degrees of clustering difficulty,
for example by adding measurement error, by appending different amounts of noise, or by
varying the degree of separation between the clusters. Correspondingly, clustering
algorithms that perform poorly on datasets with well-separated clusters can be considered
of low quality, while algorithms which perform well on closely-spaced clusters can be
considered of high quality (Qiu & Joe, 2006).
Even when using simulated datasets, generalizability remains questionable, since
the results of the performance analysis may only be valid for datasets with similar
structures (Milligan, 1996). Many of the early approaches for constructing artificial datasets
(see Steinley & Henson, 2005, for a review) have relatively simplistic implementations of
cluster separation and/or overlap. The procedure of McIntyre and Blashfield (1980)
generates mixtures of multivariate normal distributions with random covariance matrices
and uniformly distributed means. The overlap between clusters is varied in a heuristic
manner, by scaling the covariance matrices of all clusters with a common factor.
Consequently, although the overall degree of overlap increases with the scaling factor, some
pairs of clusters may overlap much more simply because their centers lie close to each
other. This is also a problem for the method of Overall and Magee (1992), which constructs
mixtures of diagonal multivariate normal distributions with normally distributed means
using a similar implementation of cluster overlap. The widely used algorithm of Milligan
(1985) generates datasets consisting of well-separated clusters sampled from independent
truncated normal distributions on each dimension. Measurement error as well as outliers
can be added to the datasets, however with unpredictable effects on the intended cluster
separation (Atlas & Overall, 1994).
Many of the generators of synthetic datasets employed by researchers to validate
their clustering algorithms have similar drawbacks. Datasets constructed with the method
of Zhang et al. (1997) consist of simple clusters with diagonal multivariate normal
distributions and with centers selected uniformly or placed on a regular grid. While
clustering difficulty can be increased by adding uniform noise, the effects of the potential
overlap between clusters are not taken into account. The algorithm described by Zaït and
Messatfa (1997) builds datasets with hyper-rectangular clusters sampled from multivariate
Iulian Ilieș
PhD Thesis
26
uniform distributions, featuring very narrow ranges on selected dimensions. The degree of
separation between clusters depends on the number of such special dimensions and on the
widths of the clusters on these dimensions, all of which must be specified by the user. A
more refined method for creating datasets containing this type of clusters has been
developed by Procopiuc et al. (2002). In the latter approach, clusters are sampled from
independent normal or uniform distributions on selected dimensions, and from uniform
distributions spanning the entire data space on all other dimensions. Each successive
cluster shares some of the special dimensions with the previous cluster, and it is placed
such that it overlaps moderately with the previous cluster on their common dimensions.
However, since all these decisions are random, it is difficult to quantify the resulting degree
of cluster overlap.
More recently, significant theoretical work has addressed the production of artificial
datasets having deterministic levels of separation or overlap between the clusters. The
algorithm of Waller et al. (1999) can produce a variety of multivariate clusters with
different volumes and covariance matrices, and with objects sampled from normal-like
distributions with different degrees of skewness and kurtosis. The positioning of the
clusters and their relative degrees of overlap are calculated using what the authors define
as cluster indicator validities: measures related to the total between-clusters variance. The
method of Qiu and Joe (2006) constructs mixtures of multivariate normal distributions with
arbitrary covariance matrices, centered at the vertices of a multi-dimensional grid of
equilateral simplexes. In order to increase the complexity of such datasets, normally
distributed noise variables or uniformly distributed noise objects can be appended to the
dataset. Furthermore, the entire dataset can be subjected to a random rotation, in order to
ensure that clusters are not distinguishable from each other in low-dimensional coordinate
projections. To adjust the degree of overlap, the clusters are rescaled such that the
geometrical gap between each pair of clusters satisfies the user-prescribed minimal
requirements. By contrast, the method developed by Steinley and Henson (2005) relies on a
more precise definition of the degree of overlap between any two clusters, which requires
the analytical estimation of the intersection of the marginal probability distributions of the
clusters. This latter approach can produce variable datasets, containing clusters with
different covariance matrices and objects sampled from different probability distributions,
but it depends heavily on the user input. For example, it is necessary to specify the desired
degree of overlap or separation between each pair of clusters, with the restriction that no
cluster can overlap with more than two other clusters. The procedure of Maitra and
Melnykov (2010) uses a similar measure of exact overlap: for any two clusters, their overlap
score is calculated as the combined probability of misclassification. This measure offers a
very accurate description of the clustering difficulty of a dataset, but it is rather difficult to
calculate for generic clusters. Consequently, the authors restrict their approach to mixtures
of multivariate normal distributions with random covariance matrices.
27
2.
Generation of synthetic datasets
Proposed method
As the methods outlined above show, there is a trade-off between the capacity to
generate datasets containing clusters of different types or shapes, and the ability to specify
the degree of separation or overlap between the clusters. In the present work, we propose a
novel procedure for the creation of synthetic datasets, which offers an optimal balance
between these two aspects. Similarly to the method of Steinley and Henson (2005), our
algorithm can construct clusters with different sizes, shapes, and orientations (see Section
III.3.3), consisting of objects sampled from different probability distributions (Section
III.3.2). As in the method developed by Procopiuc et al. (2002), some or all clusters can be
well-defined in restricted subsets of dimensions. However, unlike previous methods, our
proposed algorithm allows some clusters to also have curvilinear shapes (see Section
III.2.2), or multimodal distributions (Section III.3.3). Despite this flexibility, which allows
users a wide variety of choices in the construction of clusters, the proposed generator has a
relatively low number of inputs, that are intuitive and are accessible through a user-friendly
interface (see Sections III.3.1-III.3.2). The clusters are distributed within the data space
using a greedy geometrical procedure (Section III.2.3). The overall degree of cluster overlap
is adjusted by scaling the clusters, as in Qiu and Joe (2006). Although heuristical, our
approach is highly effective in prescribing the cluster overlap (see Section III.4.2).
Furthermore, it can be extended to allow for the production of datasets with nonoverlapping clusters with defined degrees of separation.
2.1.
Dataset generation algorithm
We propose a bottom-up approach for generating synthetic datasets containing
clusters (see Figure III-1), where clusters are broadly defined as groups of objects that are
similar to each other, but dissimilar to the other objects in the dataset. In the present work,
we consider several specific types of clusters. Unimodal or center-defined clusters have
density functions with a unique, global maximum, and can therefore be described in
reference to that point. By contrast, multimodal clusters have density functions with
multiple maxima, and can be interpreted as collections of overlapping unimodal subclusters
centered at the local maxima points. Rotated or ellipsoidal clusters have the principal axes
rotated with respected to the coordinate system, either orthogonally or obliquely, and
therefore have non-diagonal covariance matrices. Nonlinear or curvilinear clusters have
one or more principal axes curvilinear or mapped to a non-trivial manifold. Subspace or
projected clusters are well-defined, and hence distinguishable from the rest of the data, in
only a subset of dimensions, while on the remaining, non-relevant dimensions they follow
noise distributions spanning the entire data space. With the exception of unimodal and
multimodal, all types of clusters presented above are non-exclusive.
The proposed algorithm starts by randomly choosing the size, modality,
dimensionality, shape and orientation of each cluster from the set of possible values
prescribed by the user. In the next step, the algorithm constructs the clusters sequentially
according to the selected parameters. For each unimodal cluster, the values taken by all its
Iulian Ilieș
PhD Thesis
28
objects on each of the attributes are sampled from independent probability distributions,
and scaled such that the value ranges are conforming to the desired aspect ratio and volume
of the cluster (Section III.3.3). Multimodal clusters are built as sub-datasets consisting of
several overlapping unimodal subclusters (Figure III-1). The subclusters are generated
similarly to unimodal clusters, and combined with the same procedure as that employed for
assembling the entire dataset (see below). Next, the current cluster may be subjected to a
nonlinear distortion (see Section III.2.2), followed by a random rotation, either orthogonal
or oblique. To generate arbitrary rotations, the algorithm uses a variation of the method
developed by Anderson et al. (1987) (see Section III.3.3). After all clusters have been
created, the final dataset is assembled by placing the clusters in the data space. Here, we use
a custom procedure that allows for the precise control of the degrees of cluster overlap and
of spatial heterogeneity of the dataset (Section III.2.3).
Figure III-1: Schematic representation of the proposed procedure for
generating synthetic datasets with clusters. Rectangular cells represent
calculation steps, rounded cells represent optional calculation steps, and
hexagonal cells represent conditional nodes. Unimodal clusters (blue cells),
and subclusters of multimodal clusters (purple cells), are constructed
consecutively through the application of several linear and nonlinear
transformations to blocks of data sampled from independent distributions
on each dimension. Subclusters are combined to form multimodal clusters,
which can then be subjected to further alterations. The dataset is then
assembled by combining the generated clusters with a prescribed degree of
overlap. Optionally, noise objects or dimensions can be added during dataset
construction. In addition, the order of the objects or dimensions can be
randomized (yellow cells).
29
Generation of synthetic datasets
In the next stage, noise objects and/or dimensions sampled from predefined
probability distributions with range equal to the size of the data space can be appended to
the data in the desired proportions. In addition, if the dataset contains subspace clusters,
the coordinates of their objects on the non-relevant dimensions are also generated at this
step. In the last stage, the dataset is formatted for output, e.g. by randomly permuting the
objects and/or dimensions, or by normalizing the values recorded on each attribute. With
this procedure, it is possible to produce datasets of varied degrees of clustering difficulty
and containing clusters with highly variable characteristics (see Section III.4).
2.2.
Construction of nonlinear clusters
Clusters with nonlinear principal axes can be obtained from generic, axes-parallel
clusters through the application of various space-distorting transformations such as
functions which send lines to curves or planes to convex surfaces. Such transformations
involve the mapping of subspaces to manifolds of equal dimensionality embedded in higher
dimensional subspaces. The coordinate system of the transformed subspace is mapped to
the locally defined set of directions tangent to the manifold. Correspondingly, the
dimensions from the embedding subspace that do not belong to the transformed subspace
are mapped to the set of directions normal to the manifold. Dimensions outside the
embedding subspace, if any, will remain unchanged. In the present work, we consider
transformations involving up to three dimensions, and for which the target manifolds have
fixed curvatures. Under these conditions, three different types of distortions are possible:
mapping planes to cylindrical surfaces, by transforming lines into either circles or helixes;
mapping planes to convex surfaces; and mapping planes to concave surfaces. In the
following, each of these cases will be described in more detail.
Let C be a cluster considered for nonlinear deformation. Denote by m the number of
objects contained, and by s the dimensionality of the cluster. Assume the cluster is
represented in numerical matrix form, with objects on rows and attributes on columns.
Then ( ) are the values that the i-th object takes on all attributes, while ( ) is a list of
the values taken by all objects on the j-th attribute. Let denote the range of the cluster C
on dimension j, i.e.
( (
))
the cluster is centered on each dimension: |
( (
)). Without loss of generality, we assume
( (
))|
|
( (
))|
*
+.
The simplest nonlinear transformation we consider is the deformation of parallel
lines into concentric circles. Let and
*
+ be two randomly selected dimensions,
and assume we intend to curve the axis in the
plane. Let
be the desired angle
that the cluster will subtend after distortion. To ensure that does not intersect with
itself after transformation, the angle
must not exceed . The corresponding curvature
⁄ ), where the latter term is
radius of the principal axis of is
( ⁄
included to ensure injectivity of the nonlinear transform over the convex hull of . To make
larger
angles possible, the involved dimensions should be selected such that
.
The desired distortion can be achieved by mapping the objects in to circles centered at
Iulian Ilieș
PhD Thesis
30
(
), with radii equal to their distance from this reference point on the
angles proportional to the ratio between their values and the curvature radius
(
)
((
)
(
)
(
)
(
))
*
axis, and
:
+
According to this transformation, lines parallel to the axis in the original setting are
mapped to circles centered at (–
), while lines parallel to the axis are mapped to
radial spokes passing through this reference point. Furthermore, the mapping is injective
)|
| |
over the region {(
(
)}, and hence over the convex hull of .
The projection of the transformed cluster on the -plane will be an annular or, more
rarely, a circular sector. Correspondingly, three-dimensional projections including the and
axes will show as a cylindrical fragment (see Figure III-2B, C).
Since lines with constant values that lie closer to (–
) are mapped to circles of
smaller radii, and hence lengths, than lines further away, different areas are compressed or
enlarged depending on their position relative to the reference point. Therefore, the object
densities are changed, and no longer reflect the original probability distributions of in the
new coordinate system. For example, if is uniformly distributed on both the and
directions, the transformed cluster will exhibit a uniform angular distribution of objects,
but its density will decrease with the radius. To maintain the original distributions, the area
changes must be compensated by modifications of opposite magnitude in object density.
Here, we eliminate a fraction of the objects lying in the compressed regions, proportional to
the observed changes in the areas of surface elements, i.e. proportional to the Jacobian
determinant of the distortion map. Let ( ) be the surface element near the point (
)
from the original coordinate system, and
(
) be the surface element near the image
)
⁄ in the new polar
point lying at radius ( )
and with angle (
system. Then the relative area change, ( ), is:
(
)
(
)
( )
(
) |
⁄
⁄
|
(
) |
⁄
|
Let us set the probability of objects being retained after the application of the nonlinear
deformation to the following:
(
)
*
+
(
)
( )
⁄
,
⁄
⁄ -
With this adjustment, the down-sampled, transformed cluster
probability distributions of in the new coordinates.
will correctly display the
A variation of the curvilinear transformation outlined above is the mapping of lines
to three-dimensional curves. We consider here the special case of helixes, which have
constant curvature and constant pitch. The corresponding nonlinear deformation is a
composition of the circular distortion described above and a shearing function. Let , and
*
+ be three randomly selected attributes, and assume we intend to curve the axis
31
Generation of synthetic datasets
in the
plane, while shearing with respect to . Let
(
- be the desired angle that
cluster will subtend after distortion in
projection. As above, the corresponding
⁄ ). Furthermore, let
curvature radius of the principal axis of is
( ⁄
⁄( ) be the pitch of the helix. The lower limits for
and ensure that the
mapping will be injective over the convex hull of . The desired nonlinear function is then:
(
)
((
)
(
)
(
)
(
)
)
*
+
Depending on the relative sizes of , , , and , the transformed cluster will have
the appearance of a true helix, a spiral ramp, or a sheared cylinder fragment (e.g. Figure
III-2D). As before, the correct transfer of the original probability distribution of to the new
cylindrical coordinate system can be achieved by weighted down-sampling, by setting the
probability that any object from remains in to:
(
)
*
⁄
+
The second class of nonlinear transformations that can be performed in a threedimensional space consists of functions that send two-dimensional planes to convex
surfaces. Such mappings can be constructed from two successive cylindrical distortions. Let
, and
*
+ be three randomly selected dimensions, and assume we intend to
curve the and axes with respect to the axis. Without loss of generality, assume that the
distortion of the axis is performed first. Let
and
be the desired curvature angles. To
prevent the cluster from intersecting with itself after deformation, it is necessary that
both
and
do not exceed . Additionally, in order to ensure that the distorted cluster
will not have concave regions, it is further necessary that
. The corresponding
curvature radii are:
(
(
)
(
) (
(
)))
for the principal and axes of , respectively. Here, the minimal values dependent on
are introduced to make sure that the reference points of the two cylindrical distortions,
), are located outside the convex hull of the cluster. Combining the
(
) and (
two nonlinear mappings yields the following:
(
)
((
)
(
)
(
)
((
)
(
)
(
)
((
(
(
)
)
(
(
(
(
) (
)
)
(
(
)
(
(
)
)
)
)
(
)
*
))
+
)
(
)
)
(
))
Iulian Ilieș
PhD Thesis
32
The deformation function presented above compresses the regions of cluster lying
at negative values and large absolute values, and stretches the regions with positive
coordinates and values close to zero (see Figure III-2G, H). In order to map the probability
distribution of correctly, a proportion of the objects belonging to the compressed regions
must be discarded. This can be achieved by setting the probability of object retention
proportional to the relative change of the local volume. Given that the two deformations are
) between the final volume element
(
)
applied sequentially, the ratio (
and original volume element (
) can be calculated as the product of the two
successive alterations in the areas of
and surface elements:
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
( )
(
)
(
)
(
)
( ⁄ )
(
) (
)
⁄
⁄ ] | ⁄ |
⁄(
⁄
For fixed values and
)
, the ratio
[
(
) has its maximum at
(
) to
. Consequently, setting to zero reduces
,
(
) (
⁄ ) (
⁄ ). For
⁄
⁄ a quadratic function in ,
(
) is monotonically increasing, and is therefore
and
, the function
⁄ . Thus, the probability of objects not being discarded can be calculated
largest at
as:
(
)
(
)
(
)
,
[
⁄
⁄
⁄ ⁄ ]
(
( ⁄
)
⁄
)
*
⁄
+
The final convex shape of the transformed cluster depends on the order in which the two
deformations are performed. The differences are particularly noticeable at large distortion
angles and similar curvature radii (see Figure III-2G, H). As an alternative, we developed a
method for mapping planes to a spherical surface in a more symmetrical manner. Let
⁄ ) be the curvature radius of the principal and
(
(
)⁄
(
)
axes of the cluster . The following function performs both distortions at the same time,
mapping -planes in directly onto spheres centered at the point (
):
√
(
)
(
√
)
(
(
(
)
)
(
)
√
(
√
*
)
(
)
√
)
+
33
Generation of synthetic datasets
Under this transformation, lines parallel to the or axes, as well as circles
centered at the origin in the
plane, are mapped to circles lying on spheres centered at
(–
). Correspondingly, lines parallel to the axis are mapped to radial spokes
passing through this reference point. The distortion function compresses the regions of
cluster lying at negative values and large absolute and values, and stretches the
regions with positive coordinates and and values close to zero (see Figure III-2I). To
maintain the original probability distributions in the new coordinate system, it is necessary
to eliminate some of the objects, proportional to the observed variations in local volume
) be the volume element near the point (
elements. Let (
) from the original
coordinate system, and
(
) be the volume element near the image point lying at
radius (
(
)
change, (
(
)
)
) √
inclination (
⁄
( ⁄ ) in the new spherical coordinate system.
), is:
(
)
(
)
⁄
⁄
⁄
(
)
)) | ⁄
( (
⁄
⁄
,
(
)
√
(
) |
⁄
)
√
(
⁄
⁄
⁄
⁄√
⁄(
(
, and azimuth angle
Then the relative volume
|
⁄
)
√
|
⁄(
)
)
√
) is a product of two independent terms: a quadratic function of , and
In this case, (
a multiple of the function ( )⁄ , where is defined as above. The former function is
monotonously increasing over the range of on the axis, while the latter has a global
maximum at
. The corresponding probability of object retention is:
(
)
(
)
(
)
,
[
,
⁄
⁄
⁄
(
(
⁄ ⁄ ]
⁄ -
)
⁄ )
√
*
+
√
(
)
In addition, to ensure that the cluster does not intersect with itself at large distortion angles,
)| ⁄
⁄
all objects outside the cylindroid {(
} are discarded. With these
adjustments, the down-sampled, transformed cluster correctly exhibits the original
probability distributions of in the new spherical coordinate system.
The third class of three-dimensional nonlinear transformations considered here
consists of functions that map planes to concave surfaces with constant curvatures.
Iulian Ilieș
PhD Thesis
Figure III-2: Effects of different nonlinear transformations on a sample
three-dimensional cluster. Wire frames represent the example cluster,
while black lines denote its principal axes. Several types of transformation
can be applied, including cylindrical, concave, and convex. A: Original,
undistorted cluster. B, C: Examples of cylindrical deformations, with one axis
curved towards one of the other axes. D: Example of helicoidal deformation,
with one axis curved towards one of the other axes, and the third axis
sheared with respect to the curved one. E: Example of saddle deformation,
with two axes curved successively towards opposite directions of the third
axis. F: Example of concave deformation, with one axis curved towards one
of the other axes, and the third axis curved towards the previously curved
one. G, H: Examples of convex deformations, with two axes curved
successively towards the same direction of the third axis. Note the visible
effect of performing the distortions of the two axes in reverse order
(compare panels H and G). I: Example of spherical deformation, with two
axes curved simultaneously towards the same direction of the third axis.
34
35
Generation of synthetic datasets
Similarly to the mappings of planes to convex surfaces discussed above,
deformations of planes into concave surfaces can also be constructed from two successive
cylindrical distortions. Let , and
*
+ be three randomly selected dimensions, and
assume we intend to curve the and axes with respect to the axis, but in opposite
directions. Let
(
- and
(
- be the desired angles that the cluster will
⁄ )
subtend after deformation. The corresponding curvature radii are
( ⁄
( ⁄
⁄ ) for the principal axis of , where the
for the principal axis and
⁄ lower limit is enforced in order to ensure that the distortion map is injective over .
Without loss of generality, assume that the distortion of the dimension with respect to
(
) is performed first:
(
)
((
)
(
)
(
)
(
)
)
(
)
The second cylindrical deformation sends lines which are parallel to the axis to circles
) in projection:
centered at the point (
(
)
(
(
)
(
)
(
)
(
))
*
+
The composite map is therefore:
(
)
(
(
(
(
(
)
)
(
(
)
)
(
)
)
(
))
) (
)
*
(
)
+
Following this deformation, the
projection of the cluster is a saddle-like shape
with thickness equal to
(e.g. Figure III-2E, Figure III-8E). However, since different
regions of are distorted to different degrees, the original probability distributions are
distorted as well. As in the case of convex deformations, it is possible to calculate the pointby-point change of the volume element, (
), as the product of the two changes in the
area of
and
surface elements induced by the two successive cylindrical
transformations:
(
)
( ⁄ )
(
)
(
) (
)
(
) does not depend
As in the case of mappings to convex surfaces discussed above,
) is linear in
on . Furthermore, for fixed values, (
( ⁄ ) with negative slope.
(
) has its maximum at
⁄ .
Given that the angle ⁄ is limited to ,
-,
) to a quadratic function in . If ⁄
Setting to ⁄ reduces (
, this function
⁄ on the axis. Its maximum
is convex with the critical point located to the left of
⁄
⁄ - is attained at
⁄ . For ⁄
⁄
value on ,
,
(
) is actually
⁄
⁄ -, with the largest
linear in with positive slope. Therefore, it is increasing on ,
⁄
value registered at ⁄ . If ⁄
, (
) is quadratic with negative leading
⁄
⁄ - range depends on the position of
term. The position largest value over the ,
the critical point, which is also the global maximum. In summary,
Iulian Ilieș
PhD Thesis
36
⁄
⁄
⁄
⁄
⁄
⁄
⁄
|
|
⁄
⁄
(
⁄
{
The corresponding probabilities of objects being retained after distortion are:
(
)
(
)
(
)
,
[
⁄
⁄
)
⁄ ⁄ ]
(
(
)
( ⁄
)
(
)
⁄(
*
))
+
In the nonlinear transformation presented above, both elementary deformations are
performed about the same axis. However, that is not compulsory in order to construct
concave surfaces. Let , and
*
+ be three randomly selected dimensions, and
assume we first deform the axis with respect to the axis. Next, instead of curving the
(
axis with respect to the opposite direction, we distort with respect to . Let
- be the desired curvature angles. The corresponding curvature radii are:
(
and
(
)
(
⁄
(
)
(
(
)))
for the principal and axes of , respectively. Similarly to previous transformations, the
enforced lower bounds ensure that the distortion map is injective over the convex hull of .
The first distortion maps lines parallel to the axis to circles centered at (
) in
projection, while the second sends lines parallel to the axis to circles centered at (
)
in projection.
(
)
((
(
)
(
)
(
(
)
)
(
(
)
(
)
(
)
)
(
(
)
)
*
))
+
Combining the two nonlinear mappings yields the following:
(
)
((
)
(
(
(
)
)
(
(
(
))
(
)
))
(
*
(
))
)
+
As before, the original probability distributions of can be preserved by down(
):
sampling proportional to the observed changes in the size of volume elements,
(
)
( ⁄ )
(
)
(
) (
)
For fixed values of ,
| ⁄ |
value,
(
) is linear in
( ⁄
) with positive slope. Given that
⁄ ). With fixed to this
, the maximal value is attained at
( ⁄
(
) becomes a convex quadratic function of , whose critical points lies to the
37
Generation of synthetic datasets
⁄ on the axis. Consequently, it achieves its maximum over the ,
left of
interval at ⁄ . The corresponding object retention probability is:
(
)
(
)
(
)
,
[
⁄
⁄
⁄ -
⁄
⁄ ⁄ ]
(
( ⁄
)
)
*
+
⁄ )
⁄(
(
(
( ⁄
)))
The transformed cluster has the appearance of a toroidal segment (Figure III-2F), but of a
different type than if it would had been deformed with the previous method (Figure III-2E).
⁄
2.3.
Construction of the dataset
The process of synthetic data generation serves the primary purpose of providing
testing material for the evaluation of clustering and classification algorithms: sample
datasets with variable size, structure, and clustering complexity. Consequently, a
fundamental concern is having the ability to control in a reliable manner the clustering
difficulty of the simulated data – usually defined as a measure of the overlap between the
different clusters. Existent approaches (Section III.1) range from very simplistic solutions
such as setting the distances between cluster centers depending on the cluster variances
(e.g. Procopiuc et al., 2002), to probabilistic methods that iteratively scale the clusters until
the degree of overlap between each pair of clusters falls in a prescribed interval (Maitra &
Melnykov, 2010).
Here, we rely on a greedy algorithm that tries to minimize the separation between
clusters, while preventing their overlap. For each cluster already appended to the dataset,
the algorithm stores its minimal and maximal values on each of the d dimensions,
*
+, as well as the position of its midpoint
( ( )) and
( ( )),
within the data space,
*
+. In addition,
(
( ( ))
( ( )))⁄
for each cluster and oriented dimension ⃗ or ⃗, the algorithm stores the size, location,
and contact face of the hyper-rectangular cell of maximal volume which extends from in
that direction and does not intersect any of the other already placed clusters (e.g. Figure
III-3A–F). For each new cluster , the algorithm examines all currently stored cells ,
checking whether would fit in the available space. If so, it then calculates the increase in
the volume of the data space,
(
( ( )))
(
( ( )))
*
+
*
+
(
)
∏
(
( ( )))
(
( ( )))
*
+
*
*
+
+
and the corresponding deviation from the desired aspect ratio,
(
)
(
*
*
+
(
+
( (
)))
*
(
+
, of the data space,
( (
)))
)
∏
*
+
Iulian Ilieș
PhD Thesis
38
that would result from placing within this cell, adjacent to the contact face. Consequently,
the algorithm inserts the cluster in the cell which minimizes the objective function
(
) (
)
(
)
(
)
, - is proportional with k. Thus, for earlier clusters the emphasis is
where the weight
on limiting the increase in volume of the data space, while for later clusters the emphasis
changes to meeting the targeted aspect ratio of the data space. If is the first cluster, then
its midpoint is positioned randomly within the unit cube. After the placement of , any cell
that is associated with the previous clusters and that intersects
is downsized
correspondingly. In addition, a new set of cells, extending from to infinity, is created. The
intersections of these new cells with the previously placed clusters are similarly trimmed.
The process outlined above is repeated until all clusters are positioned in the data
space. The resulting dataset generally exhibits compactly distributed clusters (e.g. Figure
III-3G). The spatial variability of the dataset can be increased through the random
interspersion of a number of empty spacers of sizes similar as the actual clusters. Thus,
some of the spaces reserved for clusters will not be used, and the dataset will have a more
heterogeneous appearance (e.g. Figure III-3H). Depending on their shapes, orientations, and
probability distributions, some clusters may have common borders, but they will never
overlap. To produce datasets of increased clustering complexity, a part or all of the clusters
can be expanded such that they exceed their assigned space and thus overlap with their
immediate neighbors. Alternatively, the same effect can be achieved by scaling down the
information about the ranges of the clusters passed to the cluster placing algorithm. The
larger the scaling factors, the more difficult it would be to distinguish between neighboring
clusters, and conversely. Furthermore, different clusters can be assigned different scaling
factors, both supra- and sub-unitary, thus yielding heterogeneous datasets with large spatial
variability in cluster overlap. Following the observations of Steinley and Henson (2005) on
the asymptotic relationship between marginal and joint overlap, we use maximal scaling
factors proportional to the number of dimensions. Thus, datasets of different
dimensionalities that are constructed with the same overlap settings, have similar degrees
of full-dimensional cluster overlap.
If the dataset contains subspace clusters which are only defined on some of the
dimensions, the algorithm described above cannot be used directly. The main impediment
is that, because such clusters follow noise distributions that span the entire data space on
the non-relevant dimensions, it is not possible to know all their value ranges beforehand.
Moreover, subspace clusters can intersect with each other, e.g. if they have no relevant
dimension in common. Therefore, the cluster placement algorithm is modified such that the
subspace of each new cluster is treated as a separate setting. Instead of storing a global list
of unoccupied, hyper-rectangular cells which are updated after each cluster addition, the
modified algorithm constructs a different list of cells before placing each new cluster .
Previously placed clusters that do not have relevant dimensions in common with are
ignored. For all other clusters , and for all relevant dimensions j that each of these clusters
shares with , the modified algorithm defines cells extending from in both the positive
and the negative directions of j within the subspace of (see Figure III-4A–F).
39
Generation of synthetic datasets
Figure III-3: Algorithm for placing full-dimensional clusters in the data
space. A – F: The placement of the first six clusters in a sample twodimensional dataset. Clusters are represented by dashed ellipses. New
clusters (orange) are placed adjacently to previously existing ones (purple).
Continuous lines represent the contact regions of places available for the
addition of further clusters. The center of the first cluster placed in the data
space is marked with a black cross. G: The final configuration of the sample
dataset, after 15 iterations. Note the compact placement of the clusters
within the data space. H: The corresponding dataset, superimposed on the
configuration illustrated in panel G (light blue). Objects are represented by
crosses. Note that several empty clusters were randomly inserted in the list
passed to the placing algorithm. The spaces assigned to these clusters
remain unoccupied, increasing the spatial variability of the dataset. Also,
note that clusters do not overlap unless they are expanded beyond their
assigned spaces.
Iulian Ilieș
PhD Thesis
Figure III-4: Algorithm for placing subspace clusters in the data space.
A – F: The placement of the first six clusters in a sample two-dimensional
dataset. Clusters are represented by dashed ellipses. New clusters (orange)
are added adjacently to already existing ones (purple). The center of the first
cluster is marked with a black cross. The contact regions of spaces that are
available for the placement of new clusters are indicated by continuous
lines. Note that new subspace clusters can only be added along their
relevant dimensions, because they grow with the dataset in any other
direction. G: The final configuration of the sample dataset, after 15
iterations. H: The corresponding dataset, superimposed on the configuration
illustrated in panel G (light blue). Objects in full-dimensional clusters are
indicated by crosses. Objects in clusters which are defined only on the
horizontal axis are represented by circles. Squares denote objects in clusters
defined only on the vertical axis. As in the example shown in Figure III-3,
spaces assigned to empty clusters remain unoccupied, thus increasing the
spatial variability of the dataset.
40
41
Generation of synthetic datasets
The intersections of these cells with the projections of the relevant clusters
within the considered subspace are subsequently removed. For this step, the minimal and
maximal values on the irrelevant dimensions of the selected clusters are set to
and
,
respectively, in order to ensure that will not overlap with any of them within its
subspace. Consequently, the algorithm examines each of the cells, measuring its suitability
for hosting , and then places the new cluster in the optimal location. If no other
previously placed cluster was defined on any of the dimensions relevant for , then its
midpoint is positioned randomly within the unit cube.
This procedure is repeated for every cluster added to the dataset. In general, the
resulting dataset consists of compactly placed clusters (Figure III-4G). Indeed, if all clusters
would be full-dimensional, the cluster placement produced by the modified algorithm
would be very similar to that produced by the basic, unaltered version, although the process
would be slower. The only noticeable differences are the presence of clusters that span the
entire data range on some dimensions, and the fact that such subspace clusters can intersect
each other if they are defined in different subspaces (e.g. Figure III-4G). The methods
described above for adjusting the spatial heterogeneity and clustering difficulty of the
dataset can be employed with no further modifications. Interspersing empty cluster spacers
can create datasets looking less uniform (e.g. Figure III-4H), while expanding some or all of
the clusters results in datasets with various degrees of cluster overlap.
3.
Practical implementation
The developed methodology is implemented in MATLAB 7 (R2010a; The
MathWorks, Inc., Natick, MA), in the form of a collection of “.m” files. In addition to the main
function which carries out the complete dataset generating procedure outlined in Figure
III-1, the package includes several stand-alone functions which perform the different steps
of the algorithm. The basic functions are: creating unimodal clusters or subclusters (see
Section III.3.3), anisotropic scaling (Section III.3.3), applying nonlinear transformations
(Section III.2.2), applying orthogonal or oblique rotations (Section III.3.3), and placing fulldimensional or subspace clusters in the data space (Section III.2.3). In addition, we have
implemented a graphical interface for the dataset generating function (see Figure III-5). The
interface facilitates the configuration of a variety of parameters for the construction of
clusters and datasets, from common inputs such as the number of clusters, to more
advanced, optional settings such as the degree of distortion of nonlinear clusters (Section
III.3.1). For numerical parameters, the interface allows for the setting of both minimal and
maximal limits, as well as choosing how values are selected within that interval (see Section
III.3.2). Related parameters, such as the dataset dimensionality, the proportion of noise
dimensions, and the cluster subspace size, are dynamically linked in order to prevent the
occurrence of incompatibilities. Basic information about each of the parameters is displayed
in tooltips, and more details are available through a companion help tool (Figure III-5).
Furthermore, the interface allows for the saving and loading of parameter configurations to
and from disk, as well as the batch production of synthetic datasets.
Iulian Ilieș
PhD Thesis
3.1.
42
Configurable program parameters
The set of configurable parameters is distributed within the graphical interface in
four panels with different background colors, based on their complexity and importance
(Figure III-5).
The main panel, which is located in the upper left part of the interface, includes
several necessary parameters pertaining to the construction of the clusters and the dataset.
This panel allows users to define the numbers of objects, dimensions, and clusters in the
dataset, as well as the degrees of cluster overlap and of spatial variability in the placement
of clusters within the data space. The degree of cluster overlap controls the scaling of
clusters with respect to the spaces assigned by the cluster placing algorithms (Section
III.2.3). It is measured on an ordinal scale, and can take values from –50 for well-separated
clusters, to 0 for adjacent, but non-overlapping clusters, and up to 50 for highly overlapping
) ⁄ to
clusters. The corresponding scaling factors applied to the clusters range from (
(
) ⁄ , where d is the average cluster dimensionality. The degree of spatial variability
influences the number of cluster spacers used by the cluster placing algorithms to produce
datasets looking less uniform. It can range from 0 for compactly-placed clusters to 100 for
using a number of spacers equal to the number of clusters in the dataset. The main panel
also permits the selection of the probability distributions employed in the construction of
clusters (see Section III.3.2), as well as in the determination of the relative sizes, scales, and
aspect ratios of the clusters (Section III.3.3). In addition, the main panel includes measures
of the overall scale and aspect ratio of the dataset. These parameters are sufficient for the
generation of highly variable synthetic datasets containing full-dimensional clusters with
principal axes parallel to the coordinate axes.
The second and third panels contain advanced settings that allow for the presence
of more specific types of clusters and of noise. All parameters present in these panels are
optional. They are deactivated by default, and appear grayed out in the interface. They can
be switched on and off via the checkboxes located in the vicinity of their names in the
interface (see Figure III-5).
The second panel, located in the upper right part of the interface, includes two
groups of parameters. The inputs in the first group allow users to specify the proportions of
noise objects and dimensions, as well as to select the probability distribution of noise. The
second group of parameters in this panel pertains to cluster rotations. Users can define the
proportions of rotated and obliquely rotated clusters, expressed as percentages, as well as
the corresponding degrees of rotation and of oblique distortion, measured on ordinal scales
from 0 (no rotation/distortion) to 100 (maximal rotation/distortion). The degree of
rotation is proportional to the number of elementary two-dimensional rotations applied to
each cluster, and to the average angle of these rotations. Correspondingly, the degree of
obliqueness is proportional to the number of distorted elementary rotations, and to the
average angular deviation from orthogonality of these rotations (see Section III.3.3).
43
Generation of synthetic datasets
Figure III-5: Graphical interface for the proposed synthetic dataset
generator. The interface allows users to configure a variety of settings for
the construction of clusters and datasets. Based on their complexity and
relevance, the parameters are grouped in several panels: parameters
concerning the structure of the dataset and the sizes and shapes of clusters
(upper left corner); optional parameters pertaining to noise and cluster
rotations (upper right corner); optional specifications pertaining to
particular cluster types (lower left corner); and options related to the output
format (lower right corner). Program functions such as saving the current
settings for later use, or retrieving help information about the parameters,
are also accessible from the interface (lower panels).
The third panel, located in the lower left part of the interface, contains three groups
of parameters, pertaining to subspace, multimodal, and nonlinear clusters, respectively. For
subspace clusters, the interface allows users to specify the proportion and subspace
Iulian Ilieș
PhD Thesis
44
dimensionality. For multimodal clusters, it is possible to define the number, dimensionality,
and type of subclusters within multimodal clusters (Section III.3.3), as well as to specify the
proportion of clusters in the dataset which are multimodal. For nonlinear clusters, the panel
permits the selection of the types and degrees of nonlinear distortions applied to the
clusters (Section III.2.2), in addition to the specification of the proportion of distorted
clusters. In the latter case, the degree of distortion is proportional to the curvature of the
nonlinear distortion applied to the cluster, and can take values from 0 for no distortion to
100 for distortions to closed shapes, e.g. cylinders, spheres or tori.
The fourth and last panel, located in the lower right corner of the interface, contains
several basic parameters concerning the format of the generated datasets. Up to three
different randomization procedures can be applied to each dataset: the random
permutation of objects, dimensions, and clusters. For normalization, the coordinates of all
objects can be centered at zero on each dimension, have the lower bound at zero on each
dimension, or occupy different ranges with arbitrary offsets from zero on the different
dimensions. Furthermore, the data can be discretized with a prescribed step size.
Additionally, the last panel allows users to set the number of datasets to be generated, and
to define the names of the variables storing the datasets and the corresponding lists
specifying the cluster membership of each object. For simplicity, a numerical index is
automatically appended to the variable names if more than one dataset is generated.
3.2.
Generation of random numbers
For the sampling of random data from specific probability distributions, our
algorithms rely on the two built-in functions from the core MATLAB package “randfun”,
“rand” and “randn”. Starting with MATLAB R2008b, the “rand” function uses the Mersenne
Twister algorithm (Matsumoto & Nishimura, 1998) for generating uniformly distributed
pseudorandom numbers. Correspondingly, the “randn” function relies on uniformly
distributed numbers to produce normally distributed values with the ziggurat method
(Marsaglia & Tsang, 2000). The Mersenne Twister generator has a very long period of
, and can be initialized from
different seeds. To ensure
independence between different sessions, each of the functions from our package (see
above) changes the seed to a value dependent on the time indicated by the system clock.
The method developed here uses the MATLAB “rand” function for the uniform
selection of various weights or parameter values within specified limits (e.g. Figure III-6C),
as well as for generating uniformly distributed clusters and noise. Similarly, the “randn”
function is employed in the generation of normally distributed clusters and noise, as well as
in the Gaussian selection of parameter values within a given range. For convenience, the
employed normal distribution is restricted to the middle 99% (Figure III-6A). Any values
that lie outside this finite support are repeatedly resampled until they fall within the
desired range. Two additional, narrower Gaussian distributions (see Figure III-6A) are used
to generate weights centered at 1, when defining the relative sizes and scales of clusters, or
the relative ranges of clusters along the different attributes (see Section III.3.3).
45
Generation of synthetic datasets
Figure III-6: Probability density functions of the distributions
employed by the proposed synthetic dataset generator. All distributions
are mapped to the [–1, 1] interval. Note the different scales of the vertical
axes. A: Truncated Gaussian distributions with different degrees of kurtosis,
as measured by the standard deviation (σ). B: Truncated lognormal
distributions with various degrees of skewness, as measured by the
standard deviation (σ) of the respective underlying Gaussian distributions.
C: Uniform distribution. D: Linear distributions with different ratios (ρ)
between the maximal and minimal probabilities. E: Truncated Laplace
(double exponential) distribution. F: Exponential distributions, truncated to
different degrees. The distributions illustrated in A (dotted line), B–E, and F
(dotted line) are used in the production of clusters and noise. The
distributions depicted in A, C, and with continuous and dashed lines in F are
used in the selection of weights and parameter values.
Iulian Ilieș
PhD Thesis
46
In order to increase the variability within and between the generated datasets, we
also use values sampled from several other probability distributions that can be accessed
indirectly through the “rand” and “randn” functions. An array of lognormal distributions
with various degrees of skewness are available for the construction of clusters and noise. To
sample values from these distributions, we simply apply the exponential function to
normally distributed numbers. Here, we set the mean of the underlying Gaussian to zero,
- interval. Figure III-6B illustrates
and we select the standard deviation within the ,
three such lognormal distributions, having the minimum, average, and maximum degrees of
skewness supported by our methods. As in the case of the Gaussian distributions discussed
above, we truncate the lognormal distributions to the leftmost 99%, in order to ensure
finite support. In addition, a collection of linearly varying distributions with different
degrees of skewness (see Figure III-6D), constructed from the uniform distribution, are also
available for the creation of both clusters and noise. To sample numbers from linear
distributions, we apply a square root transformation to uniformly distributed values.
Moreover, we rely on the inverse probability integral transform to generate
exponentially and Laplace distributed values from the uniform distribution. Taking the
negative logarithm of values sampled from the uniform distribution on , - yields
exponentially distributed numbers with scale parameter 1. Further multiplication of these
values with a randomly chosen sign, either –1 or +1, produces numbers distributed
according to the double-exponential Laplace distribution with mean 0 and scale parameter
1. The exponential and Laplace distributions are truncated to the leftmost 99% (Figure
III-6F) and the middle 99% (Figure III-6E), respectively, and then used in the production of
clusters, as well as for the generation of noise. In addition, several more severely truncated
exponential distributions (e.g. Figure III-6F) are used in the selection of various weights or
parameter values within specified limits (see Section III.3.3).
3.3.
Construction of different types of clusters
Within the proposed synthetic dataset generator, the construction of unimodal
clusters is a process comprising of several steps (see Figure III-1). In the first stage, each
cluster is constructed by sampling the coordinates of all objects independently on each
dimension from one of the six supported probability distributions (Section III.3.2). The
distribution used can be the same for all dimensions of a given cluster (e.g. Figure III-7A–F),
or it can be randomly selected for each dimension.
For subspace clusters, only relevant dimensions are included, while their nonrelevant dimensions are filled with noise at a later stage (see Section III.2.1). The number of
objects belonging to each cluster is calculated in advance, and depends on a random weight
selected for the cluster from one of the available probability distributions (see Figure III-6A,
C, and F). However, for clusters that will undergo a nonlinear distortion, the initial number
of objects is increased by a factor proportional to the distortion type, in order to offset the
subsequent density correcting down-sampling (Section III.2.2). For convenience, the
- interval. Consequently, each generated cluster has
sampled values are scaled to the ,
equal range on all relevant dimensions after the first step.
47
Generation of synthetic datasets
In the second stage, each cluster is scaled anisotropically to a randomly defined
aspect ratio. The distribution used for sampling the scaling weights of each dimension
depends on the specific shape assigned to the cluster from those supported by our program.
Spheroidal clusters have ranges distributed normally around 2 with standard deviation 0.1.
Ellipsoidal and eccentric clusters also have ranges distributed normally around 2, however
with standard deviation 0.2 and 0.4, respectively (Figure III-6A). To create clusters having a
few long principal axes and the remaining axes short, the target ranges are sampled from
- interval, i.e.
the standard exponential distribution, truncated to the ,
approximately to the median 25% of the distribution (Figure III-6F). With this choice of
minimal and maximal scaling weights, clusters with ratios between the longest and the
shortest principal axes of up to 20 can be produced. Such elongated clusters are ideal
candidates for cylindrical and helicoidal nonlinear distortions outlined in Section III.2.2 (e.g.
Figure III-7H). Correspondingly, to generate clusters having the majority of the principal
axes longer than the average, the scaling weights are obtained by taking the reciprocals of
values sampled from the same truncated exponential distribution. Such flattened clusters
are most appropriate for convex and concave deformations (Figure III-7I). Finally, to obtain
clusters with more general aspect ratios, the scaling weights are sampled from the uniform
-.
distribution on ,
The resulting, transformed clusters can take a variety of shapes (e.g. Figure III-7AF), but they have similar volumes. For increased dataset variability, each cluster is scaled by
an additional factor on all dimensions. Similarly to cluster sizes, the values of these scaling
factors are sampled from one of the six probability distributions available for weight
generation (see Figure III-6A, C, F). If desired, the scaling procedure can be followed by a
nonlinear distortion of specified type and angle (e.g. Figure III-7G).
In the last, optional stage, clusters undergo rotations of various amplitudes and
degrees of obliqueness. To generate random rotations with specific average amplitudes, we
use an approach similar to that described by Anderson et al. (1987). We select a random
subset of pairs of dimensions, and for each selected pair we uniformly choose a rotation
angle and a direction of rotation (clockwise or anticlockwise). The minimal and maximal
allowed angles are chosen such that the expected value of the average rotation angle across
all possible pairs of dimensions is proportional to the prescribed degree of rotation. Thus,
clusters with small degrees of rotation are produced by using few such elementary
rotations with small angles, while clusters with large degrees of rotation are created by
using a large number of elementary rotations with large angles. To generate obliquely
rotated clusters, a proportion of the employed elementary rotations are modified in such a
way that the orthogonality of the involved axes is not preserved. The fraction of altered
rotations and the magnitudes of the changes are chosen such that the average angular
deviation from orthogonality across all pairs of dimensions is proportional to the desired
degree of obliqueness. The sign of the angular deviation is chosen such that the correlation
between the two directions involved is increased by the distortion process. For any given
pair of dimensions and corresponding rotation angle, the largest allowed oblique distortion
is obtained when rotating only one of the axes and keeping the other one fixed.
Iulian Ilieș
PhD Thesis
Figure III-7: Examples of clusters produced by the proposed synthetic
dataset generator. Objects within the clusters are represented by crosses.
Level sets of the corresponding object densities (orange contours) are
calculated using smoothed histograms. A: Rotated cluster with eccentric
ellipsoidal shape and normally distributed objects. B: Obliquely rotated
cluster with ellipsoidal shape and uniformly distributed objects. C:
Spheroidal cluster with objects sampled from a Laplace distribution. D:
Mirrored ellipsoidal cluster with objects sampled from lognormal
distributions. E: Rotated spheroidal cluster with objects sampled from linear
distributions. F: Obliquely rotated cluster with eccentric shape and objects
distributed exponentially. G: Multimodal cluster consisting of several highly
overlapping subclusters with objects distributed normally. H: Elongated
cluster with normally distributed objects, nonlinearly distorted to a helix. I:
Flattened multimodal cluster transformed through a cylindrical distortion.
Similarly to the example shown in G, the subclusters have normally
distributed objects, but they overlap to a smaller degree.
48
49
Generation of synthetic datasets
To ensure that clusters subjected to rotations with larger angles consistently appear
more rotated to human observers than clusters subjected to rotations with smaller angles,
we limit the used angles to ⁄ . Any elementary rotation with a larger angle can be
decomposed into a rotation with angle equal to a multiple of ⁄ , obtainable via axis
⁄ -. Thus, to
reflections and/or permutations, and a rotation with angle within , ⁄
ensure that all possible rotations of a given cluster can be achieved, we combine the product
of the elementary rotations with a random non-orientation-preserving permutation of the
dimensions. The resulting transformation is applied to the cluster objects, yielding the final,
rotated cluster (e.g. Figure III-7A, B, E, F).
As mentioned in Section III.2.1, the construction of multimodal clusters is a more
complex process, carried out recursively. Each multimodal cluster is built as a sub-dataset
consisting of several overlapping sub-clusters (Figure III-1). The procedure employed is the
same as for the actual datasets, however with several restrictions. The selection of
subcluster shapes is limited to three alternatives: spheroidal subclusters, with similar
numbers of objects and scaling factors; ellipsoidal subclusters, with variable relative sizes
and scales; and arbitrary subclusters, with uniformly selected relative sizes, scales, and
aspect ratios. Most importantly, the overlap between subclusters must be larger than the
overlap between clusters, in order to be able to distinguish between multimodal clusters
and groups of overlapping unimodal clusters. Additionally, subclusters must exist in the
same subspace if they are not full-dimensional (e.g. Figure III-7I), and cannot undergo
nonlinear transformations, although the resulting multimodal cluster may be subjected to a
curvilinear distortion (e.g. Figure III-7I).
3.4.
Computational complexity of the algorithm
Let n be the total number of objects, d be the total number of dimensions, and
be the total number of clusters. Without loss of generality, assume that the data has no noise
objects or dimensions, since they are not involved in more complex procedures such as the
rotation of clusters. Also, for simplicity, assume that all clusters are unimodal and fulldimensional. The cumulative computational complexity of the different data generation and
rescaling processes, such as creating clusters and noise or normalizing the final dataset, is
( ). The further application of nonlinear distortions has maximal complexity ( ),
when all clusters are transformed (Section III.3.3). Generating random rotations is an ( )
process (Anderson et al., 1987), while rotating a cluster consisting of
objects is of
). Assuming that all clusters are rotated, the combined computational cost is of
order (
)
(
)
(
). Computing the positions of full-dimensional clusters in
order (
). Increasing the spatial variability of
the data space (Section III.2.3) has complexity (
the dataset by the addition of cluster spacers at this step, as outlined in Section III.2.3,
modifies this cost by a proportional factor. Overall, the maximal complexity for constructing
(
)
(
)
datasets containing unimodal, full-dimensional clusters is ( )
(
). This formula remains valid also when the dataset contains noise objects and/or
dimensions. Correspondingly, the cost of generating a full-dimensional multimodal cluster
Iulian Ilieș
PhD Thesis
50
),
consisting of
objects distributed in
subclusters is of at most (
(
)
( )
(
) if it were unimodal. Thus, constructing
compared to ( )
multimodal clusters instead of unimodal ones leads to an increase in the computational cost
by a factor proportional to the number of subclusters. Therefore, the overall complexity for
constructing datasets containing combinations of rotated unimodal and multimodal full), with an additional factor proportional to the
dimensional clusters remains (
average number of subclusters. If some of the clusters are not full-dimensional, then the
modified, slower algorithm for placing subspace clusters in the data space is used (see
), where l is the average
Section III.2.3). Its average computational complexity is (
subspace size across all clusters. In the worst case, when most clusters are full-dimensional,
). Correspondingly,
the computational cost of this process reaches its maximum of (
the computational complexity of generating a dataset that contains all supported types of
)
(
)
(
). If none of the clusters is rotated, the
clusters is (
). If, in addition to that, all clusters are unimodal
computational cost is reduced to (
).
and full-dimensional, the complexity is further reduced to (
4.
Experimental validation
The developed generator can produce a wide variety of synthetic datasets. Several
low-dimensional and high-dimensional examples are illustrated in Figures III-8 and III-9.
Clusters can be constructed using several different probability distributions (e.g. Figure
III-7A-F) or combinations thereof (Figure III-8A, F). Moreover, different clusters can have
different sizes, aspect ratios, and orientations (Figure III-8B), or can be distorted into
various nonlinear shapes (Figure III-8D, E). Clusters can be both unimodal as well as
multimodal, the latter consisting of several overlapping subclusters (e.g. Figure III-7G). The
spatial distribution of clusters can range from well-separated (Figure III-8A), through
adjacent (e.g. Figure III-8B) and moderately overlapping (Figure III-8C) to highly
overlapping (Figure III-9D). Some of the clusters can be well-defined only within specific
subspaces, usually in the case of high-dimensional datasets (Figure III-9) but not exclusively
so (e.g. Figure III-8F). The datasets can also contain a variable proportion of noise objects
(Figure III-9A, D) or dimensions (Figure III-9A, B).
51
Generation of synthetic datasets
Figure III-8: Examples of low-dimensional datasets created with the
proposed synthetic dataset generator. A: Dataset containing 15 wellseparated clusters with various probability distributions and volumes
(crosses). The dataset also contains 5% uniformly distributed noise objects
(circles). B: Dataset containing 15 non-overlapping clusters of uniformly
distributed objects (crosses). The clusters have various aspect ratios and
orientations. The dataset also includes 10% Gaussian noise (circles). C:
Dataset containing 15 moderately-overlapping clusters with normally
distributed objects. The clusters have various densities and orientations. D:
Dataset containing unimodal clusters (crosses), multimodal clusters
(squares), and nonlinearly distorted clusters (circles). E: Dataset containing
three-dimensional nonlinear clusters. Such clusters can be deformed into
cylindrical (crosses), convex (circles), or saddle shapes (squares). Black lines
indicate the coordinate axes. F: Dataset containing one full-dimensional
cluster (crosses), three two-dimensional subspace clusters (circles), and one
one-dimensional subspace cluster (squares). All clusters have normal
distributions within their subspaces and uniform distributions on their nonrelevant dimensions. Black lines indicate the coordinate axes.
Iulian Ilieș
PhD Thesis
Figure III-9: Examples of high-dimensional datasets produced with the
proposed synthetic dataset generator. Image plots of sample datasets,
with objects shown on rows and dimensions shown on columns. Each
dataset consists of 10000 objects described on 20 dimensions, and contains
10 subspace clusters (separated by black lines). The clusters have normal
distributions on all dimensions, both relevant and non-relevant. A: Dataset
containing well separated clusters with similar subspace dimensionality (3–
7). The dataset includes 20% noise dimensions and 20% noise objects. B:
Dataset containing non-overlapping clusters which share the same subspace.
The dataset includes 20% noise dimensions. C: Dataset containing
moderately overlapping clusters with similar subspace dimensionality (7–
11). D: Dataset containing overlapping clusters with variable subspace
dimensionality (1–11). The dataset includes 20% noise objects.
52
53
4.1.
Generation of synthetic datasets
Cluster rotations
The first validation experiment assessed the effectiveness of the proposed
implementation of cluster rotations (see Section III.3.3). We constructed 2070 clusters with
different degrees of rotation and obliqueness, and calculated the average absolute
correlation between all possible pairs of dimensions for each cluster. Each cluster consisted
of 1000 objects defined in 5 to 20 dimensions, sampled from normal distributions. The
clusters were scaled to arbitrary aspect ratios (Section III.3.3) and subsequently rotated,
either orthogonally or obliquely. The degrees of rotation were selected uniformly between
0, corresponding to no rotation, and 100, corresponding to the successive application of
two-dimensional rotations with angles of ⁄ for all possible pairs of dimensions in a given
cluster.
Furthermore, for each cluster, the degree of oblique distortion was set randomly to
one of the three following values: 0 for orthogonal rotations, half of the assigned degree of
rotation for moderately oblique rotations, or equal to the degree of rotation for highly
oblique rotations. Since rotations, in particular oblique ones, introduce linear dependencies
between the coordinate axes, we expected the average correlations to be proportional to
the prescribed degrees of rotation and obliqueness. Thus, to quantify the results, we
calculated the Pearson correlation coefficient between the prescribed degrees of rotation
and the observed average absolute within-cluster correlations for each of the three types of
degrees of obliqueness (N = 690 clusters within each category). Additionally, we fitted
linear regression models without constant terms to each of the three sets of measurements.
The results (Figure III-10A) indicated that the employed cluster rotation method is
performing as intended. The average within-cluster correlation increased with the
prescribed degree of rotation, reaching maximal values of approximately 0.6 for clusters
with large degrees of rotation and obliqueness. This linear relationship was less
pronounced for clusters undergoing orthogonal rotations (Pearson’s
), where the
rate of increase in within-cluster correlations diminished at larger degrees of rotation. For
moderately oblique rotations, the increase in within-cluster correlations resulting from the
deviations from orthogonality contributed to the maintenance of a linear association with
the degree of rotation (Pearson’s
; linear regression coefficient
). Correspondingly, the dependence between the average within-cluster correlations
and the prescribed degrees of rotation reached the largest proportionality factor for
clusters undergoing highly oblique rotations (regression coefficient
;
Pearson’s
). Overall, the within-cluster correlations were found to increase both
with the degree of rotation applied to the cluster, and with the degree of oblique distortion
applied to the rotation, as intended. Interestingly though, for low degrees of rotation there
was virtually no difference between the average within-cluster correlations of clusters with
different degrees of distortion (Figure III-10A). The most likely reason is that, in our
implementation, oblique distortion degrees cannot exceed the actual rotation degrees. Thus,
for low degrees of rotation, the maximal possible deviations from orthogonality are also
small, and hence do not affect the within-cluster correlations to a noticeable degree.
Iulian Ilieș
PhD Thesis
4.2.
54
Cluster overlap
In a second experiment, we tested whether expanding clusters with respect to the
spaces assigned to them by the cluster placing algorithm has the intended effect of altering
the overlap between clusters in a consistent manner. We generated 2070 datasets
consisting of different types of convex clusters scaled to different degrees of overlap, and
measured the resulting overlap between the clusters in each case. Here, we calculated the
actual overlap as the average proportion of objects from each cluster that were located
inside the convex hulls of other clusters. For subspace clusters, only their relevant
dimensions were included in this analysis. To compute the convex hulls of clusters, we used
the MATLAB function “convhulln”, which implements the Quickhull algorithm of Barber et
al. (1996). Because the computational cost of calculating the convex hull of a cluster grows
exponentially with the number of dimensions, we restricted our analysis to small datasets
consisting of 1000 objects grouped in 5 to 15 clusters with subspace dimensionality at most
5. Clusters were either full-dimensional in a space with 1–5 dimensions, or they inhabited
subspaces of dimensionality 1–5 in a 20-dimensional space. In one third of the datasets, the
clusters were unimodal with ellipsoidal shapes and similar sizes and scales, and did not
undergo rotations or nonlinear transformations (see Section III.3.3). In another third of the
datasets, the clusters had more variable sizes, scales, and aspect ratios, and were rotated to
medium degrees. In the remaining one third of the datasets, the clusters could be both
unimodal and multimodal, and were all rotated obliquely. For each dataset, the number of
cluster spacers employed by the cluster placing algorithm for increasing the spatial
variability of the dataset (Section III.2.3) was set randomly to either 0, half of the number of
clusters, or equal to the number of clusters in the dataset. Furthermore, the degree of
overlap of each dataset was selected uniformly between –5, corresponding to clusters
slightly contracted with respect to the spaces assigned by the cluster placing algorithm, and
50, corresponding to clusters expanded two to three times over the assigned space on each
dimension. Within each dataset, the overlap scaling factors assigned to the different clusters
were allowed to vary within
of the average value.
The results indicated that our approach was effective in creating datasets with
increased clustering difficulty. The measured average cluster overlap increased linearly
with the prescribed degree of overlap, and the proportionality factor was independent of
the proportion of cluster spacers (Figure III-10B), as well as of the type of the clusters
present in the dataset (data not shown). The correlation between the intended and achieved
degrees of overlap was very strong across the entire sample of 2070 datasets (Pearson’s
), suggesting that our implementation of cluster overlap is relatively accurate. The
linearity of this relationship was further confirmed by a linear regression analysis. The
resulting regression coefficient had a value of 0.014, while the intercept equaled –0.05
(
). Correspondingly, the observed average proportion of objects from each cluster
lying within the convex hulls of the other clusters in the dataset increased by approximately
7% for every increase of 5 points in the prescribed degree of overlap, up to a maximum of
65% at the largest degree of overlap implemented (Figure III-10B).
55
Generation of synthetic datasets
Figure III-10: Evaluation of the effectiveness of the implementation of
cluster rotations and cluster overlap. A: Average within-cluster
correlations, in absolute values, for clusters with various degrees of rotation
and oblique distortion. Clusters undergoing orthogonal rotations (purple)
exhibit moderate correlation values. Clusters which undergo moderate
oblique distortions, with angular deviations from orthogonality equal to half
the rotation angles (green), have higher correlations. Rotated clusters with
large degrees of obliqueness, having angular deviations from orthogonality
equal to their rotation angles (orange), show the highest correlations. Note
that for low degrees of rotation there is virtually no difference in the values
of the correlations obtained for the three cluster categories shown. Each
point of the graphs represents an average over 15 clusters with arbitrary
aspect ratios, each containing 1000 objects in 5–25 dimensions. Error bars
represent standard errors of the means. B: Observed cluster overlap as a
function of the prescribed degree of overlap. Datasets with increased
degrees of cluster overlap were obtained by expanding all clusters with a
factor proportional to the desired degree. The actual overlap was calculated
as the average proportion of objects from each cluster that are located
within the convex hulls of other clusters. Note that the effectiveness of the
implementation is maintained regardless of the proportion of cluster
spacers employed to increase the spatial variability of the datasets. Each
point on the graphs represents an average over 6–31 datasets, each
consisting of 1000 objects distributed in 5–15 clusters of various types and
dimensionalities. Error bars denote standard errors of the means.
4.3.
Timing performance
To verify the theoretical dependencies of the time requirements on the input
parameters outlined in Section III.3.4, we performed three sets of tests. For the first set of
tests, we constructed 600 different datasets with both the number of clusters and the
number of dimensions set to 20, and the number of objects selected uniformly between
1000 and one million. For the second set of tests, 600 datasets with 10000 objects, 20
clusters, and between 1 and 1000 dimensions were generated. For the third set of tests, we
Iulian Ilieș
PhD Thesis
56
constructed 600 datasets consisting of 10000 objects in 20 dimensions, distributed into 1–
100 clusters. In each of the three sets of tests, the 600 datasets were further partitioned into
six subsets of 100 datasets. In each of the subsets, at most one of the following advanced
options (see Section III.3.1) was enabled: cluster rotations, subspace clusters, multimodal
clusters, nonlinear clusters, addition of cluster spacers, noise. For each dataset in each test
session, we measured the total time needed for its production. Subsequently, for each
subset of 100 datasets, we assessed the degree of dependence between the required
processing time and the altered parameter, by fitting power functions of the form
( )
with real-valued coefficients. To confirm the inferred degree of
dependence, we subsequently fitted the measurements with polynomial functions of
degrees equal to the integer nearest to the coefficient . All tests were run on a computer
with 2.4 GHz dual-core Intel processor and 4 GB of RAM.
The time requirements increased linearly with the number of objects, independently
on the types of clusters included in the dataset (Figure III-11A, B). The exponents of the best
fitting power functions were very close to 1, ranging from 0.99 for datasets with unimodal
clusters to 1.03 for datasets with multimodal clusters with 10 modes each (all
).
The observed linear dependence of the time needed to generate the datasets on the number
of objects was in general of very small amplitude. For simple unimodal clusters defined in
20 dimensions, an increase in the number of objects from 1000 to one million corresponded
to an increase in the required time of approximately 4 seconds (regression coefficient
intercept
,
). As expected, using only a proportion of the
objects in the construction of clusters and treating the remainder as noise had limited
impact (
). However, if all clusters were nonlinearly distorted, the
slope increased five-fold to
, while the intercept remained virtually unchanged
(simple linear regression,
). The observed increase in slope was approximately
equal to the implemented increase in the number of objects assigned initially to each cluster
in order to offset the down-sampling associated with nonlinear transformations (see
Section III.2.2). Therefore, the computational cost of the actual application of nonlinear
distortions is dominated by the cost of sampling more objects. By contrast, if all clusters
were multimodal or if they were treated as subspace clusters, there was no variation in the
slope, but the constant terms increased from 0.9 for unimodal clusters to 6.7 (simple linear
regression,
) and 4.5 (simple linear regression,
), respectively. For
multimodal clusters, the increase was caused by the additional steps of combining the
subclusters within each cluster, which depend only on the number and dimensionality of
the subclusters, and not on the number of objects. Correspondingly, for subspace clusters
the additional processing was an effect of using the slower cluster placing algorithm (see
Section III.2.3), which is also independent on the number of objects in the dataset.
Interestingly, for rotated clusters there was also no change in the slope due to the additional
calculations needed to apply the rotations to each object. It is likely that these calculation
steps are dominated by the main process of creating clusters through sampling objects from
different probability distributions, as observed for nonlinear clusters. The constant term did
however increase to 1.9 (simple linear regression,
), most likely due to the
additional cost of generating rotation matrices for each of the clusters in the dataset.
57
Generation of synthetic datasets
Figure III-11: Experimental evaluation of time requirements for the
proposed generator of synthetic datasets. A, B: Time requirements
increase linearly as a function of the number of objects included in the
dataset. The observed linear relationship is independent of the type of
clusters present in the dataset and of the presence of noise objects (squares
in B). Each point represents a dataset comprising 20 clusters in 20
dimensions. Note the different scales on the time axis. C, D: The time
requirements increased quadratically with the number of dimensions when
the cluster dimensionality was equal to or proportional with the total
dimensionality, and linearly (circles in D) or super-linearly (squares in D)
when cluster dimensionality was fixed. Each point represents a dataset
consisting of 10000 objects grouped in 20 clusters. Note the different scales.
E, F: The time requirements depended cubically on the number of clusters if
they were subspace clusters (squares in E), and quadratically if all clusters
were full-dimensional. Using cluster spacers to have a less uniform
distribution of clusters within the dataset (circles in E), or including
Iulian Ilieș
PhD Thesis
58
multimodal clusters (circles and squares in F) increased the time
requirements proportionally, but did not affect the degree of the
relationship. Each point represents a dataset consisting of 1000 objects
defined on 20 dimensions.
The relationship between the required time and the dimensionality of the dataset
depended on the average cluster subspace size (Figure III-11C, D). If the dimensionality of
the clusters was equal to the total number of dimensions, the dependence was quadratic
(Figure III-11C), with the best fitting power functions having exponents ranging between
1.89 and 1.93 (all
). To assess the exact dependencies, we fitted the
measurements with linear regression models with three predictors: a constant term, the
number of dimensions, and the square of the number of dimensions. For unimodal fulldimensional clusters, the leading coefficient had a value of
(linear regression,
), corresponding to construction times ranging from 1s for three-dimensional
datasets to approximately 15s for 100-dimensional datasets comprising of 10000 objects
and 20 clusters. If all clusters were considered to be subspace clusters, and therefore a
more complex cluster placing algorithm was used (see Section III.2.3), the leading
coefficient increased five-fold to
(linear regression,
). Similarly, if all
clusters underwent rotations, the leading coefficient increased to
(linear
regression,
). This increase shows that, for datasets with sufficiently large
dimensionality, the computational cost of rotating clusters is no longer dominated by the
cost of generating the clusters (see above). However, the expected third-order dependency
of time on the number of dimensions arising from the process of generating random
rotations (see Section III.3.4) was not detected. Most likely, as long as the clusters have
much more objects than dimensions, the time needed to calculate the rotation matrices is
much smaller than that needed to apply the rotations to the clusters.
We further examined the dependence of the dataset construction time on the
number of dimensions for different configurations of subspace clusters (Figure III-11D). If a
constant proportion of dimensions were designated as noise, the dependence remained
quadratic, with the best fitting power function having an exponent of 1.57 (
).
Compared to the set of full-dimensional clusters, the leading coefficient of the
corresponding optimal polynomial fit decreased markedly, down to
(linear
regression,
) when 95% of the dimensions were noise. Further limiting the
number of relevant dimensions to a maximal value changed the dependence to linear. If the
number of non-noise dimensions was set to 20, the exponent of the best fitting power
function decreased to 1.28 (
). Accordingly, the measurements could be
approximated by a linear fit with slope
(simple linear regression,
), a
variation resulting from the increased cost of creating objects with higher dimensionality.
Interestingly, if the cluster dimensionality was set to 20 but the clusters were allowed to
inhabit independent subspaces, the dependence became non-monotonic (see Figure
III-11D). The time requirements increased with the dataset dimensionality, reached a
maximum at approximately 100 dimensions, and then started to decrease asymptotically.
The best fit for the recorded measurements was a rational function with cubic numerator
and quadratic denominator (
). This dependence can be decomposed in a linear
59
Generation of synthetic datasets
term, corresponding to the process of generating the objects in the dataset, and an inverse
quadratic term, corresponding to the cluster placing procedure (Section 2.3). In our
implementation, the placement of each subspace cluster requires a number of computations
proportional to the square of the number of clusters that share dimensions with it. The
latter number of clusters is in turn inversely proportional to the total number of dimensions
if the subspace size is fixed.
In general, time requirements increased quadratically with the number of clusters
(Figure III-11E, F). The exponents of the best fitting power functions ranged from 1.44 for
datasets with multimodal clusters to 1.83 for datasets with unimodal clusters placed nonuniformly in the dataset (all
). In the most basic parameter configuration
examined, with simple unimodal clusters distributed compactly in a 20-dimensional space,
the time needed to construct a dataset with 10000 objects increased quadratically from 0.1s
if the objects were grouped in 2 clusters, to 1s for 20 clusters and 15s for 100 clusters. The
corresponding coefficient of the quadratic term calculated via linear regression was of
(
). When cluster spacers were used to distribute the clusters less
uniformly in the data space, the time requirements increased proportionally. If the number
of spacers equaled the number of clusters, and therefore the cluster placing algorithm had
to process twice as many potential clusters, the leading coefficient of the polynomial fit
increased four-fold to
(linear regression,
). By contrast, the coefficient
of the quadratic term did not change significantly if all clusters were multimodal. When
each cluster consisted of 5 subclusters, the leading coefficient remained at
(linear
regression,
), while when each cluster consisted of 10 subclusters the coefficient
increased slightly to at
(linear regression,
). However, the coefficients of
the linear term changed more strongly, increasing from 0.03 for unimodal clusters to 0.13
and 0.34 for multimodal clusters with 5 and 10 subclusters, respectively. The extents of
these increases were roughly proportional to the square of the average number of
subclusters per cluster, and corresponded to the additional steps of placing the subclusters
within each cluster using the cluster placing algorithm (Section III.2.3). If all clusters were
rotated to large degrees, the results were very similar to those observed when the clusters
were multimodal. For rotated clusters, the coefficient of the linear term increased to 0.08,
while the coefficient of the quadratic term did not change as compared to the value
recorded for simple unimodal clusters (linear regression,
). Most probably, this
increase corresponds to the additional steps of constructing rotation matrices for each
cluster. The only exception from the above pattern of quadratic dependence functions was
the set of measurements performed on datasets containing subspace clusters, where the
exponent of the power function that fitted best the observations had a value of 2.63
(
). Consequently, the measurements in this set could be optimally described by a
polynomial function of the third degree (linear regression,
), as expected given
that a different, slower algorithm was used for the placement of clusters in the data space.
IV.
Applications of cluster analysis
in automatic image classification
1.
Background
The World Wide Web is home to a constantly increasing collection of images. For
fast access and efficient usage of this wealth of visual data, new indexing and retrieval
methods are needed, particularly since most of the existing approaches do not exploit all
available information. The prevalent approach in current web search engines is to associate
images with text, e.g. file names and captions, other metadata such as manually-added tags,
or keywords extracted from the surrounding articles. This method restricts the set of
searchable images to those associated with text, and can lead to errors if the existent
associations are incomplete or incorrect. To enable more efficient queries based on visual
information, alternative procedures relying on image processing have been proposed. In
semantic search (e.g. Schober et al., 2005), images are annotated automatically by object or
scene recognition algorithms. While this strategy is certainly appealing, progress within the
field of image understanding remains limited, with algorithms being successful only in
restricted settings such as face recognition (e.g. Bolme et al., 2003). Another candidate is the
pure visual search, which relies on low-level image features such as color or textures (e.g.
Hermes et al., 2005). However, it is of limited usefulness in textual queries, unless the
extracted features represent real concepts.
Our goal is to combine the advantages of these different methods into an integrated
framework for automatic image search and retrieval that would support both visual queries
– e.g. finding images similar to, or keywords relevant for, a given image –, as well as
standard text-based queries – e.g. finding images associated with the input keyword, either
directly or through similar images. Consequently, the focal point of the work presented here
is the establishment of a dual-layered linkage system between different images, based on
common keywords and on similar visual features, or, equivalently, the definition of a cooccurrence-based association scheme between keywords and visual features. Such
association rules between textual and visual concepts can indeed be exploited in queries
(Jacobs et al., 2007), but, more importantly, they can be used as an annotation system. Thus,
images whose textual information does not match the represented objects or visual features
can have their associations corrected. More importantly, images with no captions or tags
can be automatically labeled, and therefore made available for text-based queries.
If a large, heterogeneous set of images with sufficiently accurate textual descriptions
is available for training, such a linkage scheme can be easily constructed (Figure IV-1). In
the first stage, relevant keywords are extracted from the associated text of each training
image by using natural language processing tools. This can be done with specialized concept
61
Automatic image classification
detectors such as named entity recognizers (Drozdzynski et al., 2004), or by using term
relevance measures such as TF-IDF (Salton & Buckley, 1988). In the latter case, it may be
necessary, however, to use dimensionality reduction techniques such as singular value
decomposition, which can properly correct for synonymy or multi-word expressions
(Deerwester et al., 1990). In the second stage, visual equivalents of keywords are extracted
from the training images using image processing algorithms. This can be achieved using
specialized detectors, such as face recognition algorithms (e.g. Wiskott et al., 1997), as
shown in our previous work (Jacobs et al., 2008). Alternatively, visual words can be built
through vector quantization of low-level features found by generic detectors (Lowe, 1999),
as described in the visual vocabulary approach of Sivic and Zisserman (2003). In the last
step (see Figure IV-1), the original associations between images and their captions or labels
are translated into links between visual and textual concepts.
Figure IV-1: Construction of a linkage scheme between textual and
visual concepts. Keywords are extracted from captions (step 1) and visual
features are extracted from images (step 2) by using concept detectors, and
clustered into groups of related words or similar features if necessary. The
associations existent in the training data are translated into a set of relations
between textual and visual concepts (step 3) which can be subsequently
employed as an automatic image classifier. Purple cells represent data, while
yellow and blue cells represent algorithms.
2.
Methodology and data
2.1.
Image classification system
We propose an automatic system for classifying novel images into text-derived
categories, even when there is no co-occurring text available for extrapolation (see Figure
IV-2). The classifier relies on a set of associations between textual concepts and prototypical
visual features inferred from a sufficiently large set of representative images and their
related text, e.g. captions. In the first stage, the most relevant concepts and features are
Iulian Ilieș
PhD Thesis
62
extracted from the training data (Section IV.2.2). The image captions are analyzed with
natural language processing tools focused on the detection of keywords or concepts; for the
work reported here, we use the named entity recognizer of Drozdzynski et al. (2004).
Similarly, the training images are examined using image understanding techniques that
search for salient visual features; here, we use the scale-invariant feature transform of Lowe
(1999). The extracted features are subsequently grouped into similarity-based clusters
(Section IV.2.3); each such cluster represents a prototypical visual feature, or visual word
(Sivic & Zisserman, 2003). In the next step, the linkage degrees between the resulting
textual concepts and clusters of features are calculated. These represent the central
component of the proposed classifier. The natural, co-occurrence-based associations
between captions and textual concepts are transferred directly to the related images, and
further to the visual features found in these images. The linkage degrees between each
cluster and the different textual concepts are then obtained by aggregating the association
values of the basic visual features belonging to that cluster (Section IV.2.4). Once the
cluster-concept associations are calculated, this process can also be executed in the
opposite direction: by directly transferring the linkage degrees with textual concepts from
the clusters of features to any visual feature assigned to that cluster, and then by averaging
across all features belonging to an image to find out which concepts are associated to that
image (Section IV.2.5). Consequently, the constructed classifying system is able to annotate
new images with text-derived concepts using only their visual features (Figure IV-2, right).
Figure IV-2: Constructing and evaluating the proposed image classifier.
Visual words are obtained by clustering the visual features extracted from
training images (yellow cells). Keywords are extracted from the
corresponding training captions, and then sequentially associated with
images, their visual features, and visual words (blue cells). Conversely, the
associations can be transferred from visual words to new visual features,
and then to the test images from which these features were extracted
(purple cells).
63
2.2.
Automatic image classification
Data collection and preprocessing
The employed data set consisted of 1054 images and associated text captions
published on the “www.faz.de” and “www.tagesschau.de” German news websites between
2006 and 2007. Such sites feature strongly structured articles that have well defined
relations between images and co-occurring text, and can therefore be parsed automatically.
The images were harvested and preprocessed as described by Jacobs et al. (2007). The SIFT
algorithm (Lowe, 1999) was used to identify and quantify interest points in the images:
local, low-level visual features such as edges, corners, and other regions of variable contrast.
Each such interest point was described by a 128-dimensional vector representing the local
image gradient distribution at eight uniformly-spaced angles, starting from the direction
with the largest gradient. For each orientation, the image gradients were measured at the
nodes of a four-by-four regular grid using pixel differences, after normalization for color
and scale. In total, 174370 interest point descriptors were extracted from the available
image data set, with a median of 110 (range = 27–1336) descriptors per image. In addition,
a named entity detector was used on the image captions (Drozdzynski et al., 2004). More
than 50 distinct person names were identified in the considered set of images, including
personalities from politics (e.g. George W. Bush, Angela Merkel, Tony Blair) and sports (e.g.
Jan Ullrich, Patrik Sinkewitz). Every image was associated with at least one and at most four
names, with 9% of the images having more than one person mentioned in the caption. The
number of images associated with each person ranged from 1 to 98 (median = 18). For
simplicity, the least frequent persons were merged into one category. This procedure
resulted in 51 different concepts: 50 named persons, and “other”. The resulting imageconcept associations (Figure IV-3A) were verified manually, and found to be fairly accurate:
compared to the ground truth (Figure IV-3B; data from Jacobs et al., 2008), the proportion
of correct associations, i.e. the precision, was of 81%, while the proportion of actual
associations recovered, i.e. the recall, was of 87%.
Figure IV-3: Associations between images and concepts within the
employed data set. A: Concepts detected in the image captions. B: Manual
image annotations.
Iulian Ilieș
PhD Thesis
2.3.
64
Visual vocabulary construction
The visual vocabulary is an image representation system proposed by Sivic and
Zisserman (2003), focused on structuring images in such a way that they can be analyzed
with established information retrieval algorithms, such as bag-of-words classifiers (e.g.
Joachims, 1998). The system allows for the coding of images or portions of images as word
frequency vectors, similarly to text documents. To construct the vocabulary, the interest
point descriptors extracted from the set of images available for training are subjected to
cluster analysis. Each resulting cluster is considered to be a prototypical visual feature, or
visual word, and is represented by its average descriptor. The optimal vocabulary size is
however strongly dependent on the problem addressed, and might be considered a
trainable parameter if finding clusters in high-dimensional data would not require
significant processing power (Sivic & Zisserman, 2003). To find the optimal representation
for our setting, we varied the number of clusters in several steps between 100 and 2500,
and tested four different clustering algorithms: the commonly used k-means (Hartigan &
Wong, 1979); k-medians (Mulvey & Crowder, 1979), a similar method that employs
Manhattan distances and is therefore more suitable for high-dimensional data such as the
128-dimensional descriptors produced by the SIFT algorithm; TwoStep cluster analysis, a
fast algorithm from the SPSS software package based on the method of Zhang et al. (1997);
and the fast projection-based partitioning algorithm described in Chapter II.
2.4.
Forward concept propagation
Taking advantage of the additional information provided by the image captions, we
extended the visual vocabulary framework by calculating linkage degrees between visual
words and the concepts extracted by the named entity recognizer (Figure IV-2, left). Firstly,
the natural associations between each concept and the captions it was found in were
transferred directly to the corresponding training images, and further to all visual features
found in those images. Secondly, we calculated the average relative frequencies of the
different concepts within each cluster of visual features, as well as across the entire training
set. Lastly, the association probabilities between each feature cluster and each textual
concept were calculated as a function of the corresponding local and global concept
frequencies. For this purpose, we defined and implemented four different calculation
methods (see Figure IV-4). In general, pairs of prototypical visual features and textual
concepts with local frequencies similar to the global ones were assigned association
probabilities near 0.5, pairs with local frequencies less than one half of the global ones
received linkage degrees close to 0, while pairs with local frequencies more than twice as
large as the global ones had linkage scores close to 1.
The first method (Figure IV-4A) was a heuristic principle that was observed to
perform well during preliminary testing (Ilies et al., 2009). For each cluster-concept pair, it
truncated the ratio between the local and global concept frequencies to the [0.5, 1.3]
interval, then simply mapped the result linearly to the interval [0, 1]. The second method
relied on a logistic sigmoid transform (Figure IV-4B): it applied the function (
65
Automatic image classification
(
(
))) to the ratio between the local and global frequencies of each examined
pair, where the steepness coefficient was a trainable parameter. The third method (Figure
IV-4C) employed a rational sigmoid-like transformation: the function
(
) applied
to the considered frequency ratios, with the trainable exponent controlling the steepness.
The fourth method (Figure IV-4D) was a statistical approach based on Pearson’s chi-square
test, the only criterion to take into account the sizes of the feature clusters. It first calculated
the significance level of obtaining the local concept frequency, assuming that the expected
rate was the global frequency of that concept. The corresponding linkage degree was then
set to
if the local frequency was smaller than the global one, and to
otherwise.
Figure IV-4: Proposed methods for calculating the association degrees
between clusters of visual features and textual concepts. A: Linear,
heuristic-based criterion. B: Sigmoid transform. C: Rational transform. D:
Statistical criterion using the chi-square test. All methods take into account
the relative frequencies of the concepts over the entire set of visual features
(horizontal axis), as well as within each cluster (vertical axis). Note that the
scale on both axes is logarithmic.
Iulian Ilieș
PhD Thesis
2.5.
66
Reverse concept propagation
The procedure outlined above for transferring associations with textual concepts
from images to visual words could be easily adapted for execution in the opposite direction,
therefore providing the desired classifier functionality. For every image from the test set,
the interest point descriptors extracted by the visual feature detecting algorithm were
mapped to the feature cluster with the nearest average descriptor. The metric employed
depended on the clustering algorithm used for the construction of the visual vocabulary:
Euclidean distance for k-means, Manhattan distance for k-medians, and log-likelihood for
TwoStep clustering (Section IV.2.3). For cluster systems constructed with the partitioning
method described in Chapter II, the test descriptors were assigned to their clusters using
the decision trees provided by the algorithm. Each test visual feature inherited the linkage
degrees from the most similar prototypical feature, i.e. from the cluster it was assigned to.
These values were subsequently averaged across all visual features belonging to each test
image, yielding association probabilities between that image and the different concepts. To
obtain crisp associations, the resulting averages were contrasted to a minimal threshold.
2.6.
Algorithm validation
To test the classifier outlined above, as well as its underlying concept propagation
techniques, we employed a hybrid cross-validation design. The available data set was
partitioned into six random subsets of similar sizes (range = 163–197). Images were
distributed to the subsets in a stratified manner, based on their ground-truth associations
with the 51 concepts (Figure IV-5). The first five subsets were employed in a standard fivefold cross-validation scheme, while the sixth was reserved for testing. Thus, for each of the
examined classifier variants, e.g. using different visual vocabularies (Section IV.2.3) or
different concept propagation strategies (Section IV.2.4), five separate test runs were
performed. In each of the runs, four of the data subsets, or approximately two thirds of the
images and their associated text, were used for training, while the remaining one third of
images was used for testing.
The number of trainable parameters depended on the forward concept propagation
method employed by the examined classifier. If using the heuristic or the chi-square
criterions, the only trainable parameter was the minimal probability threshold applied to
the associations between text images and textual concepts (Section IV.2.5), with possible
values between 0.5 and 1. If relying on the sigmoid or rational transforms, then the
steepness coefficient of the respective function (Section IV.2.4) was also trainable; its
-. For each run, the values of the parameters were chosen such
support interval was ,
that the F1 score, the harmonic average of precision and recall, was maximized on the
training images, either for each concept separately, or for all concepts simultaneously. Since
the F1 score is a discrete-valued function, a fact especially noticeable on the relatively small
sets of images used in this study, we employed an optimization procedure that can handle
discontinuities – the simplex search method of Lagarias et al. (1998). To ensure that the
algorithm always reached the global optimum point, it was initialized from the best position
67
Automatic image classification
found on an equidistant grid with 25 points per dimension, and covering the parameter
ranges. The performance of the resulting classifier was subsequently measured by
calculating the overall F1 score on the corresponding set of test images.
Figure IV-5: Concept frequencies within the cross-validation subsets.
The available images were distributed into six stratified subsets of similar
sizes, based on their ground-truth concept associations. A–F: The number of
images (vertical axis) from the corresponding subset which were associated
to each of the 51 concepts (horizontal axis). Subsets A–E (purple) were
employed in a five-fold cross-validation scheme, while subset F (green) was
used only for testing.
3.
Experimental results
3.1.
Associations between visual words and concepts
Several preliminary tests were conducted in order to assess whether the visual
vocabulary approach is applicable to our data. For the first test, we considered the
simplified task of discriminating between two categories. To maximize the chances of
observing significant effects, we selected two persons that would differ in both gender and
clothing: the politician Angela Merkel, and the cyclist Patrik Sinkewitz. Consequently, we
restricted our analysis to the interest-point descriptors extracted from images associated
with either of the persons. We partitioned the reduced set of descriptors into 100 clusters
with the k-means algorithm. Within each resulting cluster, we calculated the proportions of
descriptors associated with either person. We then compared the obtained within-cluster
person frequencies to the overall frequencies using the chi-square test. The differences
were significant for 6 (16) of the clusters (
; with and without Bonferroni
Iulian Ilieș
PhD Thesis
68
correction for multiple comparisons, respectively), thus showing that it could be possible to
distinguish between persons by using only the visual information stored in interest-point
descriptors. In order to verify that the above observations hold in a more general context,
we repeated the test using an extended dataset, consisting of all descriptors found in images
associated with the most frequent 13 persons. As above, we partitioned the selected set of
descriptors into 100 clusters using the k-means algorithm, and measured the within-cluster
and global frequencies of each of the 13 persons. The observed person frequency
distributions were significantly different from the global ones for 86 (96) out of the 100
clusters (
; with and without Bonferroni correction for multiple comparisons,
respectively). Essentially, almost every cluster exhibited a bias towards one or several of
the 13 persons considered for this preliminary test.
3.2.
Differences between classification procedures
The first experiment was designed to assess the effectiveness of the different
classification procedures that were developed. For each of the five cross-validation sets, we
clustered the visual features found in the training images into a simple vocabulary of 100
visual words using the k-means algorithm. For classifier construction, we employed both
ground-truth and caption-based image-concept associations, in combination with each of
the four forward concept propagation methods outlined in Section IV.2.4. In each case, the
parameter values were optimized locally for each individual concept, globally for all
concepts simultaneously, or were fixed to specific values based on the results of preliminary
tests. For each cross-validation set and each classification procedure, we computed the F1
scores of the concept associations of the training and the test sets of images with respect to
the ground-truth. The obtained values were compared through a repeated-measures
ANOVA, with three between-subjects factors – type of training associations, forward
concept propagation method, and parameter optimization strategy –, and one withinsubjects factor – training vs. test images.
Interestingly, there was only a modest difference in performance between classifiers
using the different concept propagation methods (
),
indicating that the exact formula of the probability calculation function was actually less
important than simply having the correct monotonicity. Consequently, the results of the
four concept propagation methods were aggregated (see Figure IV-6). Classifiers
constructed from ground-truth associations were better than those relying on the image
captions (
). This result was expected, considering that
the captions were only approximately 84% correct (Section IV.2.2). The difference between
the two categories of classifiers was however less prominent when measuring the accuracy
on test images (F1 scores of 30% and 28%, respectively) than when measuring the fitting of
the training data (F1 scores of 49% and 44%, respectively; Wilks’
). Classifiers that optimized parameters for each concept
separately were better than those that optimized the parameters globally, which were in
turn better than those with fixed parameters (
, with
Scheffé correction for multiple comparisons). This effect was larger on the training images
69
Automatic image classification
(F1 scores of 53%, 44%, and 42%, respectively) than on the test ones (F1 scores of 33%,
27%, and 26%, respectively; Wilks’
). Overall,
the classifiers that performed best were those using ground-truth associations for training,
and having their parameters optimized for each concept separately. They achieved an
average F1 score of 56% on the training images, and were able to predict the associations of
test images with 34% accuracy.
Figure IV-6: Performance of different classification procedures, with a
fixed visual vocabulary of 100 feature clusters. Classifiers constructed
from ground-truth associations were better than those constructed from
caption-based associations, both on training (green) and on test images
(yellow). Moreover, optimizing training parameters for each concept
separately yielded better results than doing so globally or setting the
parameters to fixed values. Values indicate average F1 scores (N = 20
measurements per data point). Error bars represent standard deviations.
3.3.
Differences between visual vocabularies
In the second experiment, we investigated the effects of the size and structure of the
visual vocabulary (see Section IV.2.3) on the efficiency of the proposed image classification
system. We fixed the training data to the more realistic caption-based associations, and the
parameter optimization strategy to the method that performed best in the previous
experiment, i.e. choosing the optimal values for each concept separately. We then varied the
clustering algorithm and the number of clusters.
In the first stage, we complemented the 100-cluster vocabularies used in the first
experiment with cluster systems of the same size produced by the k-medians and TwoStep
algorithms. While the resulting classifiers performed rather similarly (see Figure IV-7A), kmeans and k-medians were markedly slower than TwoStep: the average processing times
per cross-validation set were of approximately 6 hours, 10 hours, and 10 minutes,
respectively, on a computer with 2 GHz dual-core Intel processor and 2 GB of RAM (see
Figure IV-7B). Therefore, we used only the TwoStep method in the construction of larger
Iulian Ilieș
PhD Thesis
70
vocabularies, having 500, 1000, and approximately 1900 clusters. For each of these six
clustering systems and for each cross-validation set, we constructed four different
classifiers, using the four different concept propagation methods outlined in Section IV.2.4.
We subsequently calculated the F1 scores of each classifier on both training and test images.
The resulting values were analyzed using a repeated-measures ANOVA test with two
between-subjects factors – clustering system, and forward concept propagation method –,
and one within-subject factor – training vs. test images.
The results (Figure IV-7A; values aggregated over cross-validation sets and concept
propagation methods) showed that the performance of the classifiers increased
significantly with the number of clusters, while the employed algorithm had only a minimal
effect (
, Scheffé post-hoc comparisons). The average F1
scores improved from 50-51% on training images and 27-32% on test images for the 100cluster vocabularies, up to 83% and 65%, respectively, for the 1900-cluster vocabulary.
This trend was paralleled by the processing times needed for the construction of the
TwoStep clustering systems (Figure IV-7B), which increased from 10 minutes per dataset at
100 clusters to 2 hours at 1900 clusters. These values remained below the levels registered
for k-means and k-medians in the initial 100-cluster setting. However, the time
requirements of assigning the test visual features to their nearest clusters had similar
values (Figure IV-7B), suggesting that using a complex, log-likelihood based distance
measure may not be particularly efficient in a real application. Nevertheless, the TwoStep
algorithm proved to be able to prevent over-fitting the training data. In the 1900-cluster
setup, the algorithm was not able to generate the desired number of 2000 clusters for all
cross-validation sets, due to the insufficient variance of the corresponding sets of visual
feature descriptors. The number of clusters produced ranged between 1850 and 2000, with
an average of 1920. With these visual vocabularies, the training, caption-based associations
were reproducible with an accuracy of 97%. The large number of clusters (1900) needed
for the optimal representation of the 51 classes is most likely an effect of the complexity of
the task. Since each person is defined by a combination of visual features, the corresponding
clusters are probably highly non-linear. Such clusters are subdivided by the clustering
algorithms used here, with the number of fragments proportional to the dimensionality.
3.4.
Differences between classifiers with an optimal visual vocabulary
In the third experiment, we reevaluated all the proposed classification procedures,
using the optimal visual vocabulary size found in the second experiment (Section IV.3.3).
For each of the five cross-validation sets, we used the maximal vocabularies of between
1850 and 2000 visual words, which were previously constructed using the TwoStep
algorithm. As in the first experiment (Section IV.3.2), we examined every possible
combination of training association type, concept propagation method, and parameter
optimization strategy. We then computed the F1 scores for training and test images for each
such classifier and each cross-validation set, and analyzed the results using a repeatedmeasures ANOVA test. Similarly to the first experiment, the differences between the results
of classifiers using the various concept propagation procedures were very small (within
71
Automatic image classification
2%), even if significant (
). Therefore, the reported values
were averaged over the cross-validation sets and over the concept propagation procedures.
Figure IV-7: Assessment of different visual vocabularies. All classifiers
were constructed from caption-based associations, and used parameters
optimized for each concept separately. A: The performance, as measured by
the F1 scores on training (green) and test images (yellow), increased with
the number of clusters, but was not affected by the algorithm used. B: The
time necessary to construct the clusters (green) and to assign test objects
(yellow) followed a similar trend, although the clustering algorithm had a
greater impact. Note the logarithmic scale of the time axis. All values are
shown as means ± standard deviations (N = 20 measurements per data
point).
Iulian Ilieș
PhD Thesis
72
Figure IV-8: Performance of different classification procedures, with an
optimal visual vocabulary of 1900 feature clusters. The best classifiers
(left-most columns) were able to fit the training data almost perfectly (F1
score of 97%), and to predict the associations of test images with 70%
accuracy. In general, classifiers constructed from ground-truth associations
were better than those constructed from caption-based associations, but less
so on test images (yellow) than on training images (green). Optimizing
training parameters for each concept separately was only marginally better
than doing so globally or using fixed parameters. Values indicate means ±
standard deviations (N = 20 measurements per data point).
The results (Figure IV-8) followed a similar pattern as in the first experiment,
although the recorded F1 scores were noticeably larger. All classifiers fitted their training
data almost perfectly, but since the initial caption-based associations were only 84%
accurate, classifiers constructed using ground-truth associations performed better
(
). Nonetheless, this difference was significantly smaller
on test images (average F1 scores of 69% if using ground-truth associations for training, and
65% if using caption-based associations) than on training images (F1 scores of 95% and
82%, respectively; Wilks’
). The best
parameter optimization strategy was choosing values for each concept separately (F1 scores
of 90% at training and 68% at testing), followed by choosing values globally (F1 scores of
89% and 68%, respectively), while using predefined resulted in the worst performance (F1
scores of 88% and 66%, respectively; overall
, Scheffé
post-hoc test). The differences between parameter optimization strategies were however
diminished within test images (Wilks’
), as well
as in classifiers constructed from caption-based concept associations (
). In both categories, there was no difference in accuracy between the local
and global parameter optimization strategies. For example, in the setup most similar to a
real application, caption-based classifiers that were trained globally achieved an average
accuracy of 66% on test images, while those trained for each concept separately had an
73
Automatic image classification
accuracy of 65.5%. In comparison, the best performing classifiers – based on ground-truth
associations, and with parameters optimized for each concept separately – reached an
average F1 score of 70%.
3.5.
Differences between visual vocabularies with optimal size
In the fourth experiment, we reassessed the influence of the structure of the visual
vocabulary on the performance of the proposed image classifier, while keeping the number
of visual words close to the optimal value found in the second experiment (see Section
IV.3.3). We fixed the training data to the more realistic caption-based associations, and
limited the parameter optimization strategies to those with trainable parameters. We then
complemented the 1900-cluster vocabularies constructed with the TwoStep algorithm with
cluster systems of similar size produced with the fast projection-based partitioning method
of Ilies and Wilhelm (2010). We used a weak cluster split criterion and allowed the
algorithm to successively divide all clusters with more than 125 descriptors. The resulting
clustering systems consisted of 1890 to 1960 clusters, with an average of 1935 over the five
cross-validation sets. Compared to when using the TwoStep algorithm, the process was
significantly faster, taking less than one minute per cross-validation set. To further confirm
that the vocabulary sizes suggested by TwoStep are indeed optimal, we constructed another
set of vocabularies with increased size by using the same partitioning method. By
decreasing the maximal cluster size to 100 descriptors, we constructed vocabularies of sizes
ranging from 2320 to 2410 clusters, with an average of 2380. For each of these three
clustering systems and for each cross-validation set, we constructed eight different
classifiers, using the four different concept propagation methods and with the parameters
optimized either locally or globally. We subsequently calculated the F1 scores of each
classifier on both training and test images. The resulting values were analyzed using a
repeated-measures ANOVA test with three between-subjects factors – clustering system,
and concept propagation method –, and one within-subject factor – training vs. test images.
The results (Figure IV-9; values aggregated over concept propagation methods and
cross-validation sets) showed that using different clustering procedures for constructing
the dataset has a limited effect on the performance of the image classifier if the vocabularies
are of optimal size. The average F1 scores recorded in the two 1900-cluster settings were
roughly equal, with values of 83% for training images and 66% for test images. Classifiers
based on visual vocabularies of 2400 clusters showed a minor but significant increase in
performance, reaching F1 scores of 83% on training images and 67% on test images (overall
, Scheffé post-hoc test). This result confirms that, for the
analyzed cross-validation datasets, the ideal visual vocabulary size is approximately 2000,
and that further increasing the number of visual words may only lead to the over-fitting of
the training data. Furthermore, as also observed in the previous experiment (see Section
IV.3.4), the differences between training strategies are greatly diminished for large visual
vocabularies. Optimizing parameters for each concept separately rather than globally had
no effect on the performance on test images (average F1 score of 66% for both strategies),
Iulian Ilieș
PhD Thesis
74
and minimally increased the fitting of the training images (average F1 scores of 82.5% if
optimizing globally and 83% if optimizing locally; overall
).
Figure IV-9: Assessment of different visual vocabularies with large
numbers of feature clusters. All classifiers were constructed from captionbased associations. There was no difference between classifiers based on
vocabularies with the optimal size of 1900 words that were constructed
with different clustering algorithms. Further increasing the number of
clusters had a minimal effect on the performance. Optimizing training
parameters for each concept separately was only marginally better than
doing so globally. Values indicate means ± standard deviations (N = 20
measurements per data point).
3.6.
Summary
The experiments presented here indicated that the developed image classification
procedure is feasible, being able to annotate new images with textual concepts with an
accuracy of up to 70% under optimal training conditions. The most influential factor proved
to be the size of the visual vocabulary. Classifiers constructed with an optimal number of
visual words made approximately half as many errors as those having an arbitrarily
selected number (Section IV.3.3). By contrast, the differences in performance between
classifiers built with several different concept propagation functions and following different
training strategies were of small amplitude. The differences in performance were even
smaller when the number of visual words was appropriately representing the variability of
the set of visual features (Figure IV-8).
In this latter setup (Sections IV.3.4, IV.3.5), the F1 scores measured on the sets of test
images did not differ between the classifiers optimized for each concept separately and
those trained for all concepts simultaneously. The effect of using the less accurate captionbased associations as training data was also significantly diminished (Figure IV-8). These
results are especially relevant in the context of a real application, given that more complex
training procedures require significantly more processing resources (Figure IV-10). A
75
Automatic image classification
reliable image classifier can thus be built more efficiently by using a simpler concept
propagation procedure, such as the statistical chi-square criterion, and by optimizing the
association probability thresholds globally rather than for each concept separately. If the
captions-based concept associations are known to be sufficiently accurate, and if the
clustering algorithm chosen for building the visual vocabulary can select the number of
clusters automatically, as e.g. TwoStep or the algorithm described in Chapter II.2, then the
entire classifier construction process can be executed without external input.
Figure IV-10: Timing differences between different training strategies.
Classifiers with two parameters were markedly slower to train than those
with only one parameter. Furthermore, optimizing the parameters for each
concept separately necessitated more time than doing so globally, or than
setting the parameters to fixed values. By contrast, the differences between
classifiers with large visual vocabularies (green) and those with small
vocabularies (yellow) were of smaller amplitude. Note the logarithmic scale
of the time axis. Values indicate means ± standard deviations (N = 40
measurements per data point).
V.
Discussion
1.
Cluster analysis for high-dimensional datasets
In Chapter II, we described a novel algorithm for partitioning large, highdimensional data, combining an innovative feature selection technique and fast correlation
and density estimators into a projection pursuit approach. Results indicate that our method
can recover fast and efficiently groups of objects that are well defined (i.e. separable from
the other clusters) in their corresponding subspace. The method’s performance does not
decrease if different groups share dimensions. Clusters that are clearly separated from the
other objects in their subspace or are lying in larger subspaces will typically be extracted
more accurately. The basis for this effect is that the quality of the separation on the high
variance projections increases with the number of relevant dimensions employed in the
PCA. Considering only a subset of the associated dimensions may impede the accurate
recovery of a cluster, as such lower-dimensional projections may overlap with the rest of
the data. This suggests that redundant attributes should not necessarily be discarded, in
opposition to what is recommended in the literature (e.g. Becher et al., 2000; Hennig, 2004).
Nonetheless, feature selection is necessary when working in a high-dimensional
context if aiming for low processing times and memory requirements, especially if
employing PCA. The implemented feature selection rule (Section II.2.2) guarantees that the
number of dimensions included in the analysis is proportional with the average cluster
subspace dimensionality and the number of subclusters, rather than the total number of
attributes. This “effective dimensionality” term decreases rapidly when descending in the
cluster tree, and hence the algorithm may need to calculate a large correlation matrix only
for the first several steps. Overall, the time and memory requirements for calculating
correlations are very low (Section II.4.4). This is especially important if analyzing datasets
with high dimensionality compared to the number of objects (e.g. gene expression data),
where even if the datasets fit into memory, attempting to perform a PCA on the full set of
attributes may generate out-of-memory errors.
Importantly, our method does not require the user to provide the number of
clusters. If no parameters are specified, the algorithm will produce a cut through the cluster
tree at a level defined by the default split score threshold. This constitutes a preview of the
structure in the data, and will typically contain several large clusters with visible
characteristics (e.g. narrow value ranges on some dimensions), and a number of
considerably smaller clusters representing grid cells with sparse populations (e.g. Figure
II-7). Such small clusters could be either discarded as noise (as suggested by Hinneburg &
Keim, 1999), or retained as expressions of interesting patterns in the data (e.g. subgroups of
objects with additional internal cohesion). In the former case, a second analysis of the data
with the inferred parameters (optimal number of clusters, and minimal cluster size) would
77
Discussion
yield a superior solution. Alternatively, the production of small clusters could be inhibited
by modifying the scoring system such that asymmetric cases are penalized more heavily,
e.g. by using the minimum of the excess masses (Stuetzle & Nugent, 2010).
In addition, the order of the objects has no disruptive effect on performance, since
object sampling is randomized. The algorithm makes no prior assumptions on the object
index. Even if the ordering may expose data structures (e.g. if all objects in a cluster have
consecutive indices), it does not affect the final solution. Furthermore, since it is based on
correlations, our method is invariant with scaling: the data can have different ranges on the
various attributes with no impact on the solution provided. On the downside, since our
algorithm relies on hyper-planes for partitioning the data, it is intrinsically linear, and
cannot separate interlocked clusters. Such clusters will however be subdivided into smaller
groups (Figure II-6C), and hence could be recovered at post-processing, for example by
allowing neighboring groups with continuous subspace densities to merge (Hinneburg &
Keim, 1999). Alternatively, the algorithm could be modified to incorporate non-linear
projections that could separate clusters better, e.g. kernel PCA (Schölkopf et al., 1998) or
other support vector machine techniques. Such additional processing steps would of course
be rather costly, and may not offer many advantages: in high-dimensional spaces, even if
clusters of such irregular shapes exist, they are more difficult to interpret, and
decomposition into smaller groups that are easier to describe may be preferable.
2.
Generation of synthetic datasets
In Chapter III, we described a new method for constructing synthetic datasets with
clusters. Compared to previously developed methods, our procedure can generate a wider
variety of datasets, featuring clusters of different types and with different degrees of
separation or overlap (e.g. Figure III-8). As most other procedures, it permits the addition of
noise objects and dimensions. However, our method is one of the few allowing for the usage
of probability distributions other than normal and uniform in the construction of clusters or
noise (other examples are Waller et al., 1999, Steinley & Henson, 2005), as well as one of the
few allowing for the presence of subspace clusters (e.g. Zaït & Messatfa, 1997, Procopiuc et
al., 2002). To our knowledge, it is the first dataset generator to feature multimodal clusters,
nonlinearly distorted clusters, and obliquely rotated clusters (Figure III-6).
Correspondingly, we are forced to rely on a more simplistic definition of cluster overlap
(Section III.2.3), compared to other methods using more precise implementations (e.g.
Maitra & Melnykov, 2010; Steinley & Henson, 2005). Nevertheless, our approach is effective
in generating datasets with specific degrees of cluster overlap (see Figure III-10B).
Moreover, it permits the precise definition of the degrees of cluster separation for datasets
or parts of datasets with non-overlapping clusters (as in Qiu & Joe, 2006).
The proposed generator of synthetic datasets has relatively low computational
complexity (see Section III.3.4). The time necessary to construct a dataset depends linearly
on the number of objects, at most quadratically on the number of dimensions, and at most
cubically on the number of clusters (Figure III-11). Large datasets, consisting of thousands
of objects defined in dozens of dimensions and grouped in dozens of clusters, can be
Iulian Ilieș
PhD Thesis
78
generated in less than one minute on a personal computer. As an additional advantage, our
generator is easy to use: all input parameters are accessible through an intuitive graphical
interface (Figure III-5), and groups of related advanced options can be enabled and disabled
concurrently. Considering the short dataset generation times and the variety of cluster
configurations which can be produced, our method should prove an excellent tool for the
evaluation and calibration of both traditional (e.g. k-means, Hartigan & Wong, 1979) and
modern clustering algorithms (e.g. Zhang et al., 1997, Procopiuc et al., 2002, Ilies & Wilhelm,
2010). Additionally, it may be used for assessing the robustness of techniques used for
determining the number of clusters in a dataset (e.g. Atlas & Overall, 1994) in the presence
of various proportions of noise and for clusters with different degrees of separation or
overlap.
3.
Automatic image classification
In Chapter IV, we described a novel system for classifying images, which relies on a
linkage scheme between keywords found in file captions and representative visual features
appearing in the actual images (Section IV.2.1). We first presented a framework for
propagating associations with textual concepts from the set of training images to the visual
vocabulary of prototypical features, and then to the images from the test set (Sections
IV.2.4–IV.2.5). Subsequently, we employed the developed methodology in a complex person
classification task (Sections IV.3.2–IV.3.4) conducted on a non-standardized, heterogeneous
data set of images harvested from news websites (Section IV.2.2). The top classifiers were
highly effective, with almost perfect characterization and no over-fitting of the training data
(average F1 score of 97%), and could be used to annotate novel images with 70% accuracy.
While the results may have been affected by the presence of duplicate images, a likely
occurrence in such a realistic setting, many of the copies found had different scales, and
therefore, from an image processing point of view, they were different images, with
different sets of detectable and quantifiable visual features.
In general, classifiers that optimized the minimal probability thresholds applied to
image-concept associations for each concept separately performed better than those that
used a global optimization strategy (Figure IV-6), but they took longer to construct (Figure
IV-10). The increased effectiveness of locally trained classifiers was likely an effect of the
different concepts being represented to different extents in the data set. Indeed, during
preliminary testing, the optimal association probability thresholds appeared to be
moderately correlated to the global concept frequencies. Interestingly, the differences
between parameter optimization strategies diminished with increasing the size of the
vocabulary (Figure IV-8). Presumably, the impact of having a visual vocabulary of
appropriate size is so high (Figure IV-7A) that the influence of the other parameters is
minimized if that is the case, and accentuated otherwise. This is made especially evident by
the fact that the choice of the clustering algorithm employed in the construction of the
visual vocabulary has no particular importance (Sections IV.3.3, IV.3.5). Therefore, faster
algorithms such as the partitioning method described in Chapter II (Ilies & Wilhelm, 2010)
can be used, even if the provided solution is not optimal. As Stommel & Herzog (2009) have
79
Discussion
observed, it may even be possible to use clusters centered at uniformly distributed random
points. Unfortunately, this flexibility translates into low interpretability of the visual words:
for a human observer, it may be difficult to relate the visual feature clusters to recognizable
objects, or to understand why some images are classified correctly and others are not (see
examples in Figure V-1 below).
Figure V-1: Example images from the dataset employed in Chapter IV.
The images are grouped by subset (training vs. test) and by the accuracy of
the associations generated by one of the tested classifiers. Visual features
detected by the image processing algorithm are marked with overlaid
contours of appropriate location. The size of each contour is proportional to
the scale of the corresponding feature. Visual features associated with the
person shown in the images (Angela Merkel) are represented by squares,
while those associated with other persons are represented by circles.
4.
Future developments
4.1.
Projection-based partitioning algorithm
While fully functional, the proposed partitioning method could be extended in a
number of directions. A necessary improvement would be to include a post-processing
stage for fine-tuning the solution, e.g. by revising the assignment of objects near cluster
boundaries, removing outliers and/or creating a noise category. Equally important would
be to add support for missing values and for more varied data formats. Beside continuous
numerical data, discrete attributes could be employed after conversion to a numerical scale,
if ordinal, and restricted noising. The latter step is necessary to avoid over-binning: the
occurrence of numerous empty bins in the histograms, caused by the data having only a
limited number of possible values (as compared to the group-size dependent number of
bins). A low-cost solution would be to displace each object on all attributes by a random
-) of the corresponding step size.
fraction (uniformly distributed in ,
Iulian Ilieș
PhD Thesis
80
By construction, the proposed algorithm is able to process large datasets in
relatively short times. For example, partitioning a dataset of 150000 objects and 128
dimensions into approximately 2000 clusters takes just under one minute on a normal
computer, compared with several hours for TwoStep or k-means (see Section IV.3). The
method can be easily adapted to working with datasets of size larger than the available
system memory, as it needs only a fraction of the data at any given time. If data is stored on
the hard disk, the algorithm will need to perform only several (three to five) block reads per
cluster analyzed, and therefore ( ( )) reads of the entire dataset (where is the total
number of clusters found), a process which does not entail significant computational costs.
Performance can be further increased by parallelization: if the data and the current cluster
system are stored in a common repository, any number of different processors could work
in parallel on different subsets, as there is no object redistribution involved.
4.2.
Synthetic datasets generator
While the presented method for generating synthetic datasets is fully functional and
should therefore constitute a significant contribution to the field, there are several
computational aspects that would benefit from further investigation. The algorithm for
placing subspace clusters in the data space (Section III.2.3) requires increased processing
resources for datasets containing large numbers of clusters with high subspace
dimensionality. The process could be made more efficient, for example by exploiting the fact
that calculations made for a given subspace are also valid for its sub-subspaces. In addition,
the nonlinear distortion of clusters (Section III.2.2) requires the initial generation of
significantly more objects in order to counteract the subsequent down-sampling necessary
for the preservation of the probability density functions of distorted clusters. This is a
resource-costly process, especially for clusters with skewed distributions which are
particularly affected by the down-sampling. The computational cost could be reduced by
pre-computing the dimensions involved in the nonlinear distortion, and only generating
additional values on these dimensions. Other possible improvements include the inclusion
of more general cluster deformations, the implementation of user-defined subspace overlap
for subspace clusters (as in Procopiuc et al., 2002), as well as the use of distributions with
prescribed degrees of skewness and kurtosis (as in Waller et al., 1999).
4.3.
Automatic image annotation system
A generalization of the proposed image classification system to nonspecific concepts
obtained via latent semantic analysis (Deerwester et al., 1990) of captions, conducted on a
significantly extended set of images, is currently in progress. Other potential developments
include refining the final cluster-concept linkage probabilities by relying on a more selective
transfer process of associations from images to visual features, and enabling the system to
work in an online setting, e.g. as part of web crawler that continuously processes and labels
image files from the Internet. The former objective could be achieved geometrically, by
calculating linkage degrees between textual concepts and visual sentences, i.e. groups of
81
Discussion
visual features co-occurring at similar relative positions (Leibe & Schiele, 2008).
Alternatively, the cluster-concept linkage probabilities could be improved analytically, by
first emphasizing the associations of training visual features when both the source image
and the cluster agree on the represented concept, and then recalculating the cluster-concept
linkage degrees. In order to enable the system to work in an online setting, given that all
preliminary processing and classifier training steps are automatic, the only component that
would need alterations is the visual vocabulary constructor. Consequently, we aim to extend
the partitioning algorithm presented in Chapter II such that it can process data streams.
VI.
References
Aggarwal, C. C., & Yu, P. S. (2000). Finding generalized projected clusters in high
dimensional spaces. Proceedings of the 2000 ACM SIGMOD International Conference
on Management of Data (pp. 70-81). New York, NY: ACM.
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance
metrics in high dimensional spaces. Proceedings of the 8th International Conference
on Database Theory (pp. 420-434). Berlin, Germany: Springer.
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for
projected clustering. Proceedings of the 1999 ACM SIGMOD International Conference
on Management of Data (pp. 61-72). New York, NY: ACM.
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering
of high dimensional data for data mining applications. Proceedings of the 1998 ACM
SIGMOD International Conference on Management of Data (pp. 94-105). New York,
NY: ACM.
Anderson, T. W., Olkin, I., & Underhill, L. G. (1987). Generation of random orthogonal
matrices. SIAM Journal on Scientific and Statistical Computing, 8(4), 625-629.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering points to
identify the clustering structure. Proceedings of the 1999 ACM SIGMOD International
Conference on Management of Data (pp. 49-60). New York, NY: ACM.
Asuncion, A., & Newman, D. J. (2007). UCI Machine Learning Repository.
http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of
California at Irvine, School of Information and Computer Science.
Atlas, R., & Overall, J. (1994). Comparative evaluation of two superior stopping rules for
hierarchical cluster analysis. Psychometrika, 59(4), 581-591.
Barber, C. B., Dobkin, D. P., & Huhdanpaa, H. T. (1996). The quickhull algorithm for convex
hulls. ACM Transactions on Mathematical Software, 22(4), 469-483.
Becher, J., Berkhin, P., & Freeman, E. (2000). Automating exploratory data analysis for
efficient data mining. Proceedings of the 6th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 424-429). New York, NY: ACM.
Berkhin, P. (2006). A survey of clustering data mining techniques. In J. Kogan, C. Nicholas, &
M. Teboulle, Grouping Multidimensional Data: Recent Advances in Clustering (pp. 2571). Berlin, Germany: Springer.
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin, Germany: Springer.
Boley, D. L. (1998). Principal direction divisive partitioning. Data Mining and Knowledge
Discovery, 2(4), 325-344.
Bolme, D., Beveridge, R., Teixeira, M., & Draper, B. (2003). The CSU face identification
system: Its purpose, features and structure. International Conference on Vision
Systems, (pp. 304-311).
83
References
Breunig, M., Kriegel, H.-P., Kröger, P., & Sander, J. (2001). Data bubbles: quality preserving
performance boosting for hierarchical clustering. Proceedings of the 2001 ACM
SIGMOD International Conference on Management of Data (pp. 79-90). New York, NY:
ACM.
Candillier, L., Tellier, I., Torre, F., & Bousquet, O. (2005). SSC : Statistical subspace clustering.
Machine Learning and Data Mining in Pattern Recognition: 4th International
Conference (pp. 100-109). Berlin, Germany: Springer.
Chang, W. C. (1983). On using principal components before separating a mixture of two
multivariate normal distributions. Journal of the Royal Statistical Society, Series C
(Applied Statistics), 32(3), 267-275.
Cheng, C., Fu, A., & Zhang, Y. (1999). Entropy-based subspace clustering for mining
numerical data. Proceedings of the 5th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, (pp. 84-93). San Diego, CA.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6), 391-407.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society B (Methodological),
39(1), 1-38.
Drozdzynski, W., Krieger, H. U., Piskorski, J., Schäfer, U., & Xu, F. (2004). Shallow processing
with unification and typed feature structures - foundations and applications.
Künstliche Intelligentz, 18(1), 17-23.
Faloutsos, C., & Lin, K. (1995). FastMap: A fast algorithm for indexing, data mining and
visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM
SIGMOD International Conference on Management of Data (pp. 163-174). New York,
NY: ACM.
Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering:
a cluster ensemble approach. Proceedings of the 20th International Conference on
Machine Learning (pp. 186-193). Menlo Park, CA: AIII Press.
Fraley, C., & Raftery, A. E. (1999). MCLUST: Software for model-based analysis. Journal of
Classification, 16(2), 297-306.
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., & French, J. (1999). Clustering large
datasets in arbitrary metric spaces. Proceedings of the 15th International Conference
on Data Engineering (pp. 502-511). Washington, DC: IEEE Computer Society.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large
databases. Proceedings of the 1998 ACM SIGMOD International Conference on
Management of Data (pp. 73-84). New York, NY: ACM.
Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the
American Statistical Association, 76(374), 388-394.
Hartigan, J. A. (1987). Estimation of a convex density contour in two dimensions. Journal of
the American Statistical Association, 82(397), 267-270.
Hartigan, J. A., & Hartigan, P. M. (1985). The dip test of unimodality. The Annals of Statistics,
13(1), 70-84.
Iulian Ilieș
PhD Thesis
84
Hartigan, J., & Wong, M. (1979). Algorithm AS136: A k-means clustering algorithm. Applied
Statistics, 28(1), 100-108.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of statistical learning: Data
mining, inference and prediction (3rd ed.). Berlin, Germany: Springer.
Hennig, C. (2004). Asymmetric linear dimension reduction for classification. Journal of
Computational and Graphical Statistics, 13(4), 930-945.
Hermes, T., Miene, A., & Herzog, O. (2005). Graphical search for images by PictureFinder.
International Journal of Multimedia Tools and Applications, 27(2), 229-250.
Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering large multimedia
databases with noise. Proceedings of the 4th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, (pp. 58-65). New York, NY.
Hinneburg, A., & Keim, D. A. (1999). Optimal grid-clustering: Towards breaking the curse of
dimensionality in high-dimensional clustering. Proceedings of the 25th International
Conference on Very Large Data Bases (pp. 506-517). San Francisco, CA: Morgan
Kaufmann.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58, 13-30.
Huber, P. J. (1985). Projection pursuit. Annals of Statistics, 13(2), 435-475.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.
Ilies, I., & Wilhelm, A. (2008). Projection-based clustering for high-dimensional data sets.
COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany:
Physica Verlag.
Ilies, I., & Wilhelm, A. (2010). Projection-based partitioning for large, high-dimensional
datasets. Journal of Computational and Graphical Statistics, 19(2), 474-492.
Ilies, I., Jacobs, A., Wilhelm, A., & Herzog, O. (2009). Classification of news images using
captions and a visual vocabulary. Technical Report No. 50, Universität Bremen, TZI,
Bremen, Germany.
Jacobs, A., Hermes, T., & Wilhelm, A. (2007). Automatic image annotation by association
rules. Electronic Imaging and the Visual Arts EVA 2007, (pp. 108-112). Berlin,
Germany.
Jacobs, A., Herzog, O., Wilhelm, A., & Ilies, I. (2008). Relaxation-based data mining on images
and text from news web sites. Proceedings of IASC2008, (pp. 736-743). Yokohama,
Japan.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing
Surveys, 31(3), 264-323.
Joachims, T. (1998). Text categorization with support vector machines. Proceedings of the
European Conference on Machine Learning. Graz, Austria.
Kim, M., Yoo, H., & Ramakrishna, R. S. (2004). Cluster validation for high-dimensional
datasets. Proceedings of the 11th International Conference on Artificial Intelligence:
Methodology, Systems, Application, (pp. 178-187).
Kogan, J. (2007). Introduction to clustering large and high-dimensional data. New York, NY:
Cambridge University Press.
Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 9, 1464-1479.
85
References
Kriegel, H.-P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on
subspace clustering, pattern-based clustering, and correlation clustering. ACM
Transactions on Knowledge Discovery from Data, 3(1), 1-58.
Lagarias, J. C., Reeds, J., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the
Nelder-Mead simplex method in low dimensions. SIAM Journal of Optimization, 9(1),
112-147.
Leibe, B., & Schiele, B. (2004). Scale-invariant object categorization using a scale-adaptive
mean-shift search. Proceedings of the 26th DAGM Symposium on Pattern Recognition,
(pp. 145-153). Tübingen, Germany.
Liu, B., Xia, Y., & Yu, P. S. (2000). Clustering through decision tree construction. Proceedings
of the 9th International Conference on Information and Knowledge Management (pp.
20-29). New York, NY: ACM.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. Proceedings of the
7th IEEE International Conference on Computer Vision, (pp. 1150-1157). Kerkyra,
Greece.
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture
modelling and clustering algorithms. Journal of Computational and Graphical
Statistics, 19(2), 354-376.
Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM
News, 23(5), 1-18.
Marsaglia, G., & Tsang, W. W. (2000). The ziggurat method for generating random variables.
Journal of Statistical Software, 5, 1-7.
Matsumoto, M., & Nishimura, T. (1998). Mersenne twister: A 623-dimensionally
equidistributed uniform pseudorandom generator. ACM Transactions on Modelling
and Computer Simulation, 8(1), 3-30.
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of highdimensional data
sets with application to reference matching. Proceedings of the 6th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, (pp. 169-178).
Boston, MA.
McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the
minimum-variance clustering procedure. Multivariate Behavioral Research, 15(2),
225-238.
Milenova, B. L., & Campos, M. M. (2002). O-cluster: scalable clustering of large high
dimensional data sets. Proceedings of the 2002 IEEE International Conference on
Data Mining (pp. 290-297). Washington, DC: IEEE Computer Society.
Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika,
50(1), 123-127.
Milligan, G. W. (1996). Clustering validation: Results and implications for applied analyses.
In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering and Classification (pp. 341375). River Edge, NJ: World Scientific.
Mulvey, J. T., & Crowder, H. P. (1979). Cluster analysis: An application of Lagrangian
relaxation. Management Science, 25(4), 329-340.
Iulian Ilieș
PhD Thesis
86
Nagesh, H., Goil, S., & Choudhary, A. (2001). Adaptive grids for clustering massive data sets.
Proceedings of the first SIAM conference on data mining. Philadelphia, PA: SIAM.
Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of
clusters in hierarchical cluster analysis. Applied Psychological Measurement, 16(2),
119-128.
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A
review. ACM SIGKDD Explorations Newsletter, 6(1), 90-105.
Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M. (2002). A Monte Carlo algorithm
for fast projective clustering. Proceedings of the 2002 ACM SIGMOD International
Conference on Management of Data (pp. 418-427). New York, NY: ACM.
Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation.
Journal of Classification, 23(2).
Quinlan, L. R. (1986). Induction of decision trees. Machine Learning, 1(a), 81-106.
Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the
American Statistical Association, 101(473), 168-178.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
Information Processing and Management, 24, 513-523.
Savaresi, S. M., Boley, D. L., Bittanti, S., & Gazzaniga, G. (2002). Cluster selection in divisive
clustering algorithms. Proceedings of the 2nd SIAM International Conference on Data
Mining (pp. 299-314). Philadelphia, PA: SIAM.
Schober, J. P., Hermes, T., & Herzog, O. (2005). PictureFinder: Description logics for semantic
image retrieval. Proceedings of the 2005 IEEE International Conference on
Multimedia and Expo, (pp. 1571-1574). Amsterdam, Holland.
Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10(5), 1299-1319.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. New
York, NY: Wiley.
Siebert, J. P. (1987). Rule-based vehicle recognition. Research Memorandum, Turing
Institute, Glasgow, UK.
Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989). Classification of radar returns
from the ionosphere using neural networks. Johns Hopkins APL Technical Digest,
10(3), 262-266.
Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching
in videos. Proceedings of the 9th IEEE International Conference on Computer Vision,
(pp. 1470-1477). Nice, France.
Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with
known overlap. Journal of Classification, 22(2), 221-250.
Stommel, M., & Herzog, O. (2009). SIFT-based object recognition with fast alphabet creation
and reduced curse of dimensionality. Image and Vision Computing New Zeeland, (pp.
136-141).
Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal
spanning tree of a sample. Journal of Classification, 20(1), 25-47.
87
References
Stuetzle, W., & Nugent, R. (2010). A generalized single linkage method for estimating the
cluster tree of a density. Journal of Computational and Graphical Statistics, 19(2),
397-418.
Teboulle, M., & Kogan, J. (2005). Deterministic annealing and a k-means type smoothing
optimization algorithm for data clustering. Proceedings of the Workshop on
Clustering High Dimensional Data and its Applications (pp. 13-22). Philadelphia, PA:
SIAM.
Ultsch, A., & Herrmann, L. (2006). Automatic clustering with U*C. Technical Report, PhilippsUniversität, Department of Mathematics and Computer Science, Marburg, Germany.
Waller, N. G., Underhill, J. M., & Kaiser, H. A. (1999). A method for generating simulated
plasmodes and artificial test clusters with user-defined shape, size, and orientation.
Multivariate Behavioral Research, 34(2), 123-142.
Wann, C. D., & Thomopoulos, S. A. (1997). A Comparative study of self-organizing clustering
algorithms Dignet and ART2. Neural Networks, 10(4), 737-743.
Wiskott, L., Fellous, J. M., Kruger, N., & von der Malsburg, C. (1997). Face recognition by
elastic graph matching. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(7), 775-779.
Woo, K.-G., Lee, J.-H., Kim, M.-H., & Lee, Y.-J. (2004). FINDIT: a fast and intelligent subspace
clustering algorithm using dimension voting. Information and Software Technology,
46, 255-271.
Zaït, M., & Messatfa, H. (1997). A comparative study of clustering methods. Future
Generation Computer Systems, 13(2-3), 149-159.
Zhang, T., Ramakrishnan, R., & Livny, M. (1997). BIRCH: A new data clustering algorithm
and its applications. Data Mining and Knowledge Discovery, 1(2), 141-182.