Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Analysis for Large, High-Dimensional Datasets: Methodology and Applications by Iulian V. Ilieș A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Statistics Approved, Thesis Committee: Prof. Dr. Adalbert F.X. Wilhelm Prof. Dr.-Ing. Lars Linsen Prof. Dr. Patrick J.F. Groenen Date of defense: December 1, 2010 School of Humanities and Social Sciences ii Acknowledgements I am grateful to my supervisor, Prof. Dr. Adalbert Wilhelm, for his trust and support. Thank you for providing me with the scientific support I needed, and for allowing me so much freedom of research. I would further like to thank Prof. Dr. Lars Linsen and Prof. Dr. Patrick Groenen for their support as members of my dissertation committee. I am grateful to my collaborator Arne Jacobs, and to Prof. Dr. Otthein Herzog, for their support and productive discussions. I am also grateful to Ruxandra Sîrbulescu for her continuous support, both within and outside of the academic environment. This work was supported by grants WI1584/8-1 and WI1584/8-2 of the Deutsche Forschungsgemeinschaft (DFG). iii List of publications Publications included in this thesis Ilies, I., & Wilhelm, A. (2010). Projection-based partitioning for large, high-dimensional datasets. Journal of Computational and Graphical Statistics, 19(2), 474-492. (Reprinted here with permission from the Journal of Computational and Graphical Statistics. Copyright 2010 by the American Statistical Association. All rights reserved.) Ilies, I., & Wilhelm, A. (submitted). Simulating cluster patterns. Ilies, I., Jacobs, A., Herzog, O., & Wilhelm, A. (in press). Combining text and image processing in an automated image classification system. Computing Science and Statistics. Reports partially included in this thesis Ilies, I., Jacobs, A., Wilhelm, A., & Herzog, O. (2009). Classification of news images using captions and a visual vocabulary. Technical Report No. 50, Universität Bremen, TZI, Bremen, Germany. Ilies, I. (2008). A divisive partitioning toolbox for MATLAB. Technical Report, Jacobs University Bremen, Germany. Submitted to the 2009 Student Paper Competition of the Statistical Computing and Statistical Graphics Sections of the ASA. Other related publications Ilies, I., & Wilhelm, A. (2008). Projection-based clustering for high-dimensional data sets. COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany: Physica Verlag. Ilies, I., & Jacobs, A. (in press). Automatic image annotation through concept propagation. In P. Ludes (Ed.), Algorithms of Power – Key Invisibles. Jacobs, A., Herzog, O., Wilhelm, A., & Ilies, I. (2008). Relaxation-based data mining on images and text from news web sites. Proceedings of IASC 2008. iv Abstract Cluster analysis represents one of the most versatile methods in statistical science. It is employed in empirical sciences for the summarization of datasets into groups of similar objects, with the purpose of facilitating the interpretation and further analysis of the data. Cluster analysis is of particular importance in the exploratory investigation of data of high complexity, such as that derived from molecular biology or image databases. Consequently, recent work in the field of cluster analysis, including the work presented in this thesis, has focused on designing algorithms that can provide meaningful solutions for data with high cardinality and/or dimensionality, under the natural restriction of limited resources. In the first part of the thesis, a novel algorithm for the clustering of large, highdimensional datasets is presented. The developed method is based on the principles of projection pursuit and grid partitioning, and focuses on reducing computational requirements for large datasets without loss of performance. To achieve that, the algorithm relies on procedures such as sampling of objects, feature selection, and quick density estimation using histograms. The algorithm searches for low-density points in potentially favorable one-dimensional projections, and partitions the data by a hyperplane passing through the best split point found. Tests on synthetic and reference data indicated that the proposed method can quickly and efficiently recover clusters that are distinguishable from the remaining objects on at least one direction; linearly non-separable clusters were usually subdivided. In addition, the clustering solution was proved to be robust in the presence of noise in moderate levels, and when the clusters are partially overlapping. In the second part of the thesis, a novel method for generating synthetic datasets with variable structure and clustering difficulty is presented. The developed algorithm can construct clusters with different sizes, shapes, and orientations, consisting of objects sampled from different probability distributions. In addition, some of the clusters can have multimodal distributions, curvilinear shapes, or they can be defined only in restricted subsets of dimensions. The clusters are distributed within the data space using a greedy geometrical procedure, with the overall degree of cluster overlap adjusted by scaling the clusters. Evaluation tests indicated that the proposed approach is highly effective in prescribing the cluster overlap. Furthermore, it can be extended to allow for the production of datasets containing non-overlapping clusters with defined degrees of separation. In the third part of the thesis, a novel system for the semi-supervised annotation of images is described and evaluated. The system is based on a visual vocabulary of prototype visual features, which is constructed through the clustering of visual features extracted from training images with accurate textual annotations. Consequently, each training image is associated with the visual words representing its detected features. In addition, each such image is associated with the concepts extracted from the linked textual data. These two sets of associations are combined into a direct linkage scheme between textual concepts and visual words, thus constructing an automatic image classifier that can annotate new images with text-based concepts using only their visual features. As an initial application, the developed method was successfully employed in a person classification task. v Table of contents Acknowledgements ............................................................................................................................ ii List of publications............................................................................................................................. iii Abstract .................................................................................................................................................. iv Table of contents ................................................................................................................................. v I. Introduction ........................................................................................................................ 1 Scope of this thesis.............................................................................................................................. 3 II. Cluster analysis techniques for large, high-dimensional datasets.................. 4 1. State of the art ......................................................................................................................... 4 1.1. Data summarization ......................................................................................................... 4 1.2. Data sampling ..................................................................................................................... 5 1.3. Domain decomposition ................................................................................................... 5 1.4. Space partitioning ............................................................................................................. 6 1.5. Projected clusters .............................................................................................................. 7 1.6. Mixture models................................................................................................................... 8 1.7. Machine learning approaches ...................................................................................... 8 2. Proposed method ................................................................................................................... 9 2.1. Theoretical background of the algorithm..............................................................10 2.2. Variable selection based on multimodality likelihood.....................................11 2.3. Sampling-based determination of high variance components .....................13 2.4. Smoothed histograms as density estimators .......................................................14 2.5. Local minima scoring .....................................................................................................14 3. Practical implementation..................................................................................................15 3.1. Graphical interface..........................................................................................................16 3.2. Partitioning parameters ...............................................................................................17 4. Experimental results...........................................................................................................18 4.1. Large, high-dimensional synthetic datasets .........................................................19 4.2. Comparison with common approaches .................................................................21 4.3. High-dimensional real datasets .................................................................................22 4.4. Theoretical and empirical algorithm complexity ...............................................23 III. Generation of synthetic datasets with clusters .................................................... 25 1. State of the art .......................................................................................................................25 2. Proposed method .................................................................................................................27 2.1. Dataset generation algorithm ....................................................................................27 2.2. Construction of nonlinear clusters ...........................................................................29 2.3. Construction of the dataset .........................................................................................37 3. Practical implementation..................................................................................................41 3.1. Configurable program parameters ..........................................................................42 3.2. Generation of random numbers ................................................................................44 vi 3.3. 3.4. Construction of different types of clusters ........................................................... 46 Computational complexity of the algorithm ........................................................ 49 4. Experimental validation.................................................................................................... 50 4.1. Cluster rotations .............................................................................................................. 53 4.2. Cluster overlap ................................................................................................................. 54 4.3. Timing performance ...................................................................................................... 55 IV. Applications of cluster analysis in automatic image classification ............. 60 1. Background ............................................................................................................................ 60 2. Methodology and data ....................................................................................................... 61 2.1. Image classification system ........................................................................................ 61 2.2. Data collection and preprocessing........................................................................... 63 2.3. Visual vocabulary construction................................................................................. 64 2.4. Forward concept propagation ................................................................................... 64 2.5. Reverse concept propagation .................................................................................... 66 2.6. Algorithm validation ...................................................................................................... 66 3. Experimental results .......................................................................................................... 67 3.1. Associations between visual words and concepts ............................................ 67 3.2. Differences between classification procedures .................................................. 68 3.3. Differences between visual vocabularies.............................................................. 69 3.4. Differences between classifiers with an optimal visual vocabulary .......... 70 3.5. Differences between visual vocabularies with optimal size ......................... 73 3.6. Summary ............................................................................................................................ 74 V. Discussion.......................................................................................................................... 76 1. Cluster analysis for high-dimensional datasets ...................................................... 76 2. Generation of synthetic datasets ................................................................................... 77 3. Automatic image classification ...................................................................................... 78 4. Future developments ......................................................................................................... 79 4.1. Projection-based partitioning algorithm .............................................................. 79 4.2. Synthetic datasets generator ..................................................................................... 80 4.3. Automatic image annotation system ...................................................................... 80 VI. References ......................................................................................................................... 82 I. Introduction One of the most extensively explored problems in statistical research is that of cluster analysis, the unsupervised grouping of objects from a dataset according to their similarity. Commonly, this summarization process has the purpose of facilitating the interpretation and any subsequent analyses of the data. From an applied perspective, the clustering task can be related to other research fields such as pattern recognition, data compression, or density estimation (Bishop, 2006; Hastie et al., 2003). Consequently, a large number of algorithms, originating from both statistics and computer science, have been proposed over the years (Jain et al., 1999; Berkhin, 2006). Following the recent technological progress, it is possible to produce everincreasing amounts of data of high complexity (e.g. sales histories or molecular biology data) (Hinneburg & Keim, 1998). This results in datasets consisting of millions of objects with tens to hundreds of dimensions, which are difficult to analyze. This impediment is particularly evident in the context of analyzing data collected from the World Wide Web (e.g. document collections, image databases, or traffic referral data). These types of datasets are large in two or three directions simultaneously, i.e. in terms of the number of objects, the number of dimensions, and, in some situations, the number of clusters. Most importantly, such data mining imposes severe computational constraints (Berkhin, 2006). However, traditional cluster analysis algorithms such as k-means (Hartigan & Wong, 1978) do not usually address the problem of processing large datasets with a limited amount of resources (i.e. system memory and processing time). Consequently, during the last twenty years there has been growing emphasis on exploratory analysis in very large databases (VLDBs) (Zhang et al., 1997). Attempts to extend standard clustering methods to this setting have focused on reducing the working data by squashing or sampling (e.g. Guha et al., 1998, Ganti et al., 1999), and/or on requiring only one data pass (incremental mining). The most notable data reduction algorithm is BIRCH (Zhang et al. 1997), which summarizes the data into a height-balanced tree. The basic implementation, running an agglomerative hierarchical procedure on the leaves of the tree, is available as TwoStep clustering in recent versions of SPSS. Breunig et al. (2001) modified the summarization procedure such that additional information is stored in the leaves. This allows for the usage of more complex, density-based clustering procedures that provide solutions of increased quality, such as the OPTICS algorithm (Ankerst et al., 1999). More recently, Teboulle and Kogan (2005) proposed a three-step clustering procedure consisting of BIRCH, PDDP (Boley, 1998), and a smoothed version of the k-means algorithm. High dimensionality of data presents additional problems beyond the computational complexity. On one hand, the effect of the “curse of dimensionality” (Huber, 1985; Aggarwal et al., 2001) is observed: the data become sparse, and the concept of proximity loses meaning in more than 15 dimensions. The Euclidean distance to the nearest objects Iulian Ilieș PhD Thesis 2 becomes of the same order as the distance to any other object, and the proportion of populated regions decays rapidly, even for the coarsest space-partitioning grids (Hinneburg & Keim, 1999). Interestingly, fractional distance metrics ( metrics with ) seem to provide more meaningful similarity measures (Aggarwal et al., 2001), but have not been sufficiently explored so far. The likely reason is that such metrics retain the high computational cost associated with calculating distances in high dimensionality. Indeed, the FastMap algorithm (Faloutsos & Lin, 1995), which maps objects to a low-dimensional space in an almost distance-preserving manner, has proven rather successful (e.g. Ganti et al., 1999). On the other hand, the higher the number of attributes, the more likely it is to have an increased number of irrelevant ones, and the clusters become impossible to find (Berkhin, 2006). The apparent solution is to reduce the dimensionality of the data (for a survey, see Becher et al., 2000). However, the direct application of feature transformation or selection techniques is susceptible to problems. If there are numerous irrelevant dimensions (i.e. very high noise level), the effectiveness of factor analysis is significantly decreased (Parsons et al., 2004). Similarly, since clusters usually reside in different subspaces, it is difficult to restrict the set of dimensions without pruning some that are relevant to only a few of the clusters. These problems motivated the development of a variety of subspace clustering algorithms during the last decade (Kriegel et al., 2009). CLIQUE (Agrawal et al., 1998) and its extension MAFIA (Nagesh et al., 2001) are searching for maximally-connected dense unions of (hyper-) rectangular cells in subspaces of increasing dimensionality. They use a recursive bottom-up procedure, with higher-dimensional dense cells obtained by joining lower-dimensional cells with common faces. ProClus (Aggarwal et al., 1999) and its derivatives OrClus (Aggarwal & Yu, 2000) and DOC (Procopiuc et al., 2002) are partitionrelocation methods that construct clusters as subset-subspace pairs rather than just subsets. The required condition is that the projection of every subset into the corresponding subspace constitutes a cluster with low internal average Manhattan distance. OptiGrid (Hinneburg & Keim, 1999), its variation O-Cluster (Milenova & Campos, 2002), and CLTree (Liu et al., 2000) partition the data set recursively by multi-dimensional grids passing through low-density regions, thus constructing a tree of high-density projected clusters. They are essentially extensions of the mode analysis approach (e.g. Hartigan, 1981) to a high-dimensional context – cluster separation is done using density level sets and, similarly to hierarchical methods, no prior knowledge on the number or geometry of the clusters is required. More recently, several model-based subspace clustering methods have been developed: Fern and Brodley (2003) proposed a cluster-ensemble approach which combines mixture models obtained using different random projections of the data on lowdimensional spaces; Raftery and Dean (2006) designed a variable selection method that finds an optimal low-dimensional subspace where the actual clustering is performed; finally, the SSC algorithm (Candillier et al., 2005) drastically reduces the number of parameters by assuming that all covariance matrices are diagonal. Due to the wide variety of methods available, it is imperative to conduct an appropriate assessment of the strengths and weaknesses of the different algorithms, in order to select the most appropriate method for each context. However, the evaluation of 3 Introduction clustering algorithms has often been criticized as either improper or insufficient, because simplistic measures are used and few or no comparisons to other available methods are performed (Kriegel et al., 2009). Moreover, many algorithms are only assessed on sample empirical datasets with known classifications, despite the fact that such an approach has limited generalizability (Maitra & Melnykov, 2010). Scope of this thesis The present thesis aims to develop improved methods for the clustering of highdimensional datasets, as well as further applications of such algorithms in practical settings. In the first part of the thesis (see Chapter II), a more detailed review of the representative clustering algorithms focused on the analysis of very large or highdimensional datasets is provided. Subsequently, a newly developed method for this purpose (Ilies & Wilhelm, 2010) is described and evaluated. The developed algorithm is based on the principles of projection pursuit and grid partitioning, and focuses on reducing computational requirements for large datasets without loss of performance. In the second part of the thesis (see Chapter III), a novel method for generating synthetic datasets with variable structure and clustering difficulty, that is aimed at evaluating clustering algorithms is presented. In the third part of the thesis (see Chapter IV), the applications of cluster analysis to the field of automatic image classification are investigated. A novel system for the semisupervised annotation of images is described and evaluated. The system is based on a vocabulary of clusters of visual features extracted from images with known classification. The thesis concludes with a summary and discussion of the described methods in the context of the current state of the art (see Chapter V). Possible directions for future research are also discussed. II. Cluster analysis techniques for large, high-dimensional datasets 1. State of the art 1.1. Data summarization The algorithm BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies; Zhang et al., 1997) is a data reduction method developed for very large datasets. It addresses the case when the amount of memory available is much smaller than the size of the data, and it aims at minimizing the cost of input-output operations during clustering. This is achieved by summarizing the data into a height-balanced tree, whose nodes represent tight groups of objects; the tree is constructed with only one pass through the data. For each node, the number of objects, their centroid, and the total sum of squares are stored; these values are sufficient for calculating typical measures such as the distance between two clusters or the intra-cluster variance. The final leaves are the input to a clustering algorithm of choice. Additional data passes can be done to refine the solution (by reassigning points to the best possible clusters), or to resolve potential irregularities (e.g. identical points ending up in different clusters; this may happen since the initial tree construction is data-ordering dependent). BIRCH is very versatile, and can be easily adapted to suit the employed clustering method. For example, Breunig et al. (2001) modified the summarization procedure such that additional information is stored in the leaves (e.g. distance to the nearest neighbors). This permits the usage of more complex, density-based clustering procedures, for instance their algorithm OPTICS (Ordering Points To Identify the Clustering Structure; Ankerst et al., 1999), that provide solutions of increased quality as compared to the default hierarchical methods. A more recent application was developed by Teboulle and Kogan (2005), who proposed a three-step clustering procedure consisting of BIRCH, PDDP (Boley, 1998), and a smoothed version of the k-means method. Their SMOKA algorithm was shown to be particularly effective on collections of documents (Kogan, 2007). BUBBLE (Ganti et al., 1999) is a more general data squashing method that was developed for arbitrary metric spaces. In contrast to vector spaces, in this context it is not possible to add or subtract objects (and hence to calculate centroids), however one can still calculate distances. Similarly to BIRCH, the algorithm summarizes the data into a heightbalanced tree whose leaves represent groups of similar objects. For each leaf, the algorithm stores the number of contained objects, a centrally located object and several other representatives, and the radius (the distance from the central object to most of the others). The central objects of the final set of leaves are the then subjected to a hierarchical clustering algorithm of choice; all remaining data points are distributed into clusters based Cluster analysis for large, high-dimensional datasets 5 on their leaf’s center. To deal with expensive distance functions (e.g. for strings, or in high dimensional settings) more efficiently, Ganti et al. (1999) developed a variant of the method that incorporates the FastMap algorithm (Faloutsos & Lin, 1995). Given a set of objects and a distance function, this procedure returns (in a linear time) a set of equal size of vectors in a low-dimensional Euclidian space (typically less than 10), with the property that the distance between any two such image vectors is approximately equal to the distance between the corresponding objects. This allows for an easy approximation of distances when descending new objects into the tree, and reduces computation times significantly with small impact on the overall performance. 1.2. Data sampling The algorithm CURE (Clustering Using Representatives; Guha et al., 1998) has an inbuilt sampling mechanism that makes it scalable to VLDBs. It is a modified agglomerative hierarchical method of the nearest-neighbor type; in contrast to the standard single linkage algorithm, it employs subsets of several well-scattered representatives rather than all objects when deciding which clusters to merge. This provides CURE with robustness in the presence of outliers, while allowing it to detect non-spherical clusters and to separate correctly between clusters of different sizes and densities. As other hierarchical methods, CURE has quadratic complexity with respect to the number of objects, and hence direct application on VLDBs would not be very efficient. Instead, the method relies on sampling: the main procedure runs on a random sample of sufficient size such that all clusters are represented (calculated via Chernoff bounds; Hoeffding, 1963). To further reduce computations, this sample is split into several partitions that are pre-clustered to a certain level. The resulting sets of small clusters are combined into one set, and then the merging process continues until the desired number of clusters is achieved. Afterwards, the rest of the data is distributed into clusters based on the nearest representatives. 1.3. Domain decomposition When the number of expected clusters is very large (hundreds or thousands), an interesting alternative approach to reducing computation times arises in the form of problem (domain) decomposition: dividing the data into subsets, and then running a clustering algorithm of choice only within these sets. This reduces significantly the number of distance computations, and, provided that the domains are chosen in an intelligent way, clustering accuracy can be preserved. When using an agglomerative underlying clustering algorithm, the solution is identical if every extant cluster can be covered by one domain or a connected set of domains (McCallum et al., 2000). Critical to this approach is that the initial partitioning into domains is fast enough to offer computational advantages, and good enough to preserve accuracy. McCallum et al. (2000) proposed an efficient procedure for constructing overlapping domains, using an inexpensive distance metric, e.g. the number of dimensions on which objects are closer than some threshold for numerical data, or the number of common words for text data. Their method covers the data iteratively with Iulian Ilieș PhD Thesis 6 disjoint balls of a given radius (using an appropriate fast metric), enlarges the balls, and employs them as domains. 1.4. Space partitioning CLIQUE (Clustering In Quest; Agrawal et al., 1998) is one of the first algorithms that attempts to find clusters within subspaces of the data set. It searches for dense units (elementary rectangular cells) of increased dimensionality via a recursive bottom-up procedure, by joining lower dimensional units with common faces. To avoid an exponential explosion of the search space, all subspaces that fall below a minimal coverage criterion (i.e. containing few dense units) are pruned before increasing the dimension. In each selected subspace, the algorithm tries to form clusters as maximally connected dense regions (constructed with a greedy scheme). A different implementation (Cheng et al., 1999) measures densities and coverage indirectly, via entropy. The end result is a series of cluster systems in different subspaces, rather than a partitioning of the data; clusters may overlap each other, and some objects may not belong to any cluster. The CLIQUE algorithm is rather robust, being able to find clusters of various shapes and dimensionalities, provided that the input parameters (initial grid size and density threshold) are well tuned to the data. Its extension, MAFIA, (Merging of Adaptive Finite Intervals; Nagesh et al., 2001) solved the parameter dependency (by using adaptive grids and relative density criterions, respectively), while also focusing on parallelization. The algorithm OptiGrid (Optimal Grid Partitioning) was introduced by Hinneburg and Keim (1999) as an extension to high dimensional data of their previously developed density-based method DENCLUE (Hinneburg & Keim, 1998). The data is partitioned recursively by multi-dimensional grids passing through low-density regions; only the highly populated grid cells are further refined. The algorithm returns as clusters all final grid cells with density above a certain noise threshold. Their algorithm is particularly interesting since it requires no knowledge on the number, dimensionality or relative density of the clusters. The mechanism for selecting cutting points is however somewhat cumbersome. On each selected projection (typically the coordinate axes), the algorithm calculates the object densities (with kernel estimators), searches for the left- and rightmost density maxima that are above a certain noise level, and then for the minimal density between these two. The procedure was simplified by Milenova and Campos (2002) – their algorithm looks simply for a pair of maxima with a minimum between them where the difference between bin counts is statistically significant (assessed with the chi-squared test). PDDP (Principal Direction Divisive Partitioning; Boley, 1998) is a partitioning method developed for document collections. At each step, the current group of objects is bisected by a hyperplane passing through its centroid and orthogonal to the first principal component. The direction is calculated via singular value decomposition (SVD) of the zerocentered data matrix. To reduce computational demands when analyzing large datasets, PDDP relies on a special SVD implementation that calculates only the first eigenvector. The main shortfall of their approach is that the true clusters can be frequently fragmented, the reason being that the splitting plane always passes through a fixed point. Cluster analysis for large, high-dimensional datasets 7 1.5. Projected clusters ProClus (Projected Clustering; Aggarwal et al., 1999) is an iterative algorithm for finding projected clusters: subset-subspace pairs with the property that the projection of the subset into the corresponding subspace constitutes a cluster with low internal average Manhattan distance. At each step, the cluster subspaces are calculated, the objects are reassigned to the nearest cluster based on within-subspace distances, and the cluster with lowest quality (large intra-cluster average distance) is pruned and replaced with a new random one. The selection is restricted to a set of possible medoids constructed greedily before the iterative phase. The algorithm was extended by Aggarwal and Yu (2000) to subspaces that are not necessarily parallel to the axes. A different alteration, by Kim et al. (2004), partially addressed the difficulty in specifying the input parameters (the number of clusters, and most importantly the average cluster dimensionality). They constructed a heuristics-based method for calculating the number of associated dimensions of any given medoid. A more interesting variation was developed by Procopiuc et al. (2002), who improved the actual search mechanism. Instead of constructing all clusters at once, their algorithm DOC (Density-based Optimal projected Clustering) builds clusters one at a time, following a greedy scheme. DOC finds the best projected cluster by comparison with a reference sample of the current data (extracted using Monte Carlo techniques). The associated dimensions are those where the distance from the current center to the sample is larger than a certain threshold, and the cluster contains all points in the corresponding inner hypercube. After convergence, the process restarts using the remaining data. The algorithm FINDIT (a Fast and Intelligent subspace clustering algorithm using Dimension voting; Woo et al., 2004) represents a different approach to the dimension selection process. They introduced a dimension-oriented similarity index for objects: the total number of attributes on which the distance is lower than some threshold. This measure is employed in determining the nearest neighbors of cluster medoids. In turn, the thus selected neighbors determine the key dimensions of the clusters – each neighbor votes for all attributes on which it lies close to the medoid, and then the most voted dimensions are selected. The algorithm starts with the random selection of two samples: a representative set that contains points from all clusters (of size calculated via Chernoff bounds; Hoeffding, 1963), and a smaller set of possible medoids. To do that, the algorithm does not require the number of clusters, but the minimal cluster size. The algorithm then finds the nearest neighbors of medoids from the distribution sample, calculates the associated dimensions, and assigns all points in the sample to the nearest medoid. Next, these clusters are merged until the minimal distance between them exceeds some userdefined threshold. The process is repeated for different values of the distance threshold employed in the similarity measure, and the best solution is retained. Finally, the cluster membership is extended in a natural way to the rest of the data. Iulian Ilieș PhD Thesis 1.6. 8 Mixture models The SSC method (Statistical Subspace Clustering; Candillier et al., 2005) represents a probabilistic approach for finding projected clusters in high dimensional data. It uses the expectation-maximization algorithm (EM; an iterative, two-step method for log-likelihood optimization; Dempster et al., 1977) to find the best mixture-of-distributions model for the data. The method supports both discrete and continuous attributes; data is assumed to follow Gaussian distributions on the continuous dimensions, and multinomial distributions on the discrete ones. To achieve linear dependence on dimensionality, the algorithm makes the additional assumption that clusters follow independent distributions on each dimension. Since the EM algorithm is very sensitive to the initial conditions, the clustering procedure is run repeatedly with randomized starting points, and the best solution is kept. To speed up the process, the standard stopping condition is replaced with a fuzzy k-meanslike criterion (the most probable cluster for each object does not change), which guarantees a faster convergence. The EM phase is followed by a post-processing stage, where a minimal description of the clusters (in terms of the associated dimensions) as a set of rules is achieved. Other model-based approaches have focused on examining different subsets of attributes in order to find an optimal representation of the entire set of clusters. Fern and Brodley (2003) proposed a cluster ensemble approach which combines mixture models obtained using different projections of the data. The underlying assumption is that different projections uncover different parts of the structure in the data, and hence complement each other. Consequently, their method performs several clustering runs, each consisting of a random projection of the high dimensional data, followed by the clustering of the reduced data using the EM algorithm. These results are aggregated in a similarity matrix, which is then used to produce the final clusters via an agglomerative hierarchical procedure. In a more rigorous approach, Raftery and Dean (2006) incorporated a variable selection mechanism in the basic mixture-model clustering paradigm, by recasting the problem of comparing two nested sets of attributes as a model comparison problem. Their algorithm performs a greedy search to find the optimal low dimensional representation of the clustering structure, starting with the pair of attributes that shows the most evidence of multivariate clustering. Their method is therefore able to select dimensions, the number of clusters, and the clustering model simultaneously. Regrettably, the approach is very slow, requiring hours of processing for even moderately large datasets. 1.7. Machine learning approaches The decision tree approach (Quinlan, 1986) is a supervised learning method for the classification of multivariate data into known classes. The algorithm constructs a tree where non-terminal nodes correspond to value tests on single attributes, while the leaves indicate the class. In essence, it represents a partitioning of the data space in hyper-rectangular regions, some of which correspond to the extant classes of objects, while others are empty. The partitioning is done by a divide and conquer algorithm: at each step, the algorithm Cluster analysis for large, high-dimensional datasets 9 chooses the best cut, following an information maximization criterion. CLTREE (Clustering based on decision Trees; Liu et al., 2000) is a modification of the basic algorithm, adapted for the task of cluster analysis. Hypothetical uniform noise is added to the data, to ensure that the chosen decision tree algorithm can be used with no problems. Furthermore, the best cut is imposed the additional condition of passing through a low density region. The procedure continues until no improvement can be made, resulting in a complex tree that often partitions the space more than necessary. To correct for that, a user-driven pruning of the final leaves is performed. The process is controlled by two factors: the minimal cluster size, and a minimal relative density for merging of adjacent leaves. While the final solution is very sensitive with respect to these two parameters, the basic algorithm is fairly robust and scales well with increasing number of objects or dimensions. U*C (Ultsch, 2005) is an unsupervised clustering algorithm employing selforganizing neural networks (SOMs; Kohonen, 1990). Using an intrinsic system of competitive generation and elimination of attraction centers (activated neurons excite their neighbors and inhibit the rest), SOMs are able to group similar objects by mapping them to neighboring units. The U*C algorithm distances itself from previous SOM-based methods (see Wann & Thomopoulos, 1997, for a few examples) in that it does not attempt to directly identify neurons with clusters. Their implementation sports a SOM with a large number of units (typically in the range of thousands), which provides a topographical projection of the high-dimensional input data onto the plane. For each unit, the algorithm calculates its density (number of objects) and the distance to its neighbors (the average distance between the objects mapped to it and those mapped to the neighboring units). This information can be displayed as a two-dimensional “geographical” map, which allows for the immediate identification of clusters. Typically hard to distinguish clusters (e.g. linearly non-separable) are correctly recognized. Ultsch and Hermann (2006) developed an automatic method for determining the number of clusters and their members. 2. Proposed method We propose a projection-based clustering method following the fundamental principles of grid partitioning, as outlined by Hinneburg and Keim (1999). Our algorithm searches for low-density points (local minima) in selected one-dimensional projections, and divides the current data by an orthogonal hyperplane passing through the best split point found, if any. The process terminates when no more splits can be found for any of the extant subsets (see schematic representation of the algorithm in Figure II-1). To find good projections, we follow the heuristics of projection pursuit (Huber, 1985), in locating the directions with the highest variances via principal component analysis (PCA). To avoid introducing noise from irrelevant attributes, the PCA is restricted to the subspace of dimensions with possible multimodal distributions that are involved in the largest correlations (see Sections II.2.2-2.3). If no split is found, the search is extended to the coordinate projections, in decreasing order of likelihood of multimodality. Aiming for a method truly applicable to large datasets, we focused on reducing both memory and processor loads wherever possible. Solutions include a non-recursive Iulian Ilieș PhD Thesis 10 implementation of the algorithm (extant clusters are stored as a dynamic list of structures), relying on sampling of objects and dimensions for quick finding of optimal projections, and using average shifted histograms (ASH; Scott, 1992) for easy detection and scoring of local minima. These procedures are presented in more detail below (Sections II.2.2 – II.2.5). Similarly to other grid partitioning procedures (see Section II.1.4), our algorithm does not require the user to specify the expected number or size of clusters, and therefore constitutes a powerful tool for exploratory analysis. Such parameters are nevertheless supported as optional inputs, and can influence the final solution (e.g. by prohibiting very small clusters; note though that, due to methodological constraints, the minimal size threshold is not absolute, and smaller clusters may still be produced when splitting larger clusters). If the number of clusters is limited, the decision on which group to divide next is based mainly on the quality of the split (Savaresi et al., 2002), with a logarithmic correction term that favors very large clusters (more than 10000 objects) over smaller ones. Figure II-1: Schematic representation of the proposed partitioning algorithm. Rectangular cells represent calculation steps, while hexagonal cells represent conditional nodes. Dark cells mark the partition-finding process: modality- and PCA-guided search for good projections, histogrambased search and scoring of split points, and cluster division following the best split. This process is bypassed for clusters smaller than the user-defined threshold, if any. The procedure continues until the maximum number of clusters is reached, if specified, or until no more splits are found. 2.1. Theoretical background of the algorithm The idea of partitioning objects using one-dimensional projections has been employed in classification for a relatively long time. The decision trees approach (e.g. Quinlan, 1986) splits the training data into subsets by value tests on single attributes, essentially partitioning the data-space in hyper-rectangular regions that either correspond to classes of objects or are empty. Recently, this principle has been imported in cluster 11 Cluster analysis for large, high-dimensional datasets analysis (e.g. Liu et al., 2000) in the context of high-dimensional data. Here, the cutting planes are drawn through low-density regions with increased likelihood of discriminating between clusters. The success of this approach stems from the reliance on contracting projections; these have the intrinsic property that the density at any point in the projected space constitutes an upper bound for the densities at all points from the original space that project onto it. Therefore, objects that constitute a cluster in a given subspace will also be part of a cluster in any sub-subspace (Agrawal et al., 1998), and low-density regions that do not lie at the borders of the projected space will be good candidates for cutting planes. Furthermore, under the additional assumptions that clusters are spherical and of similar sizes, the misclassification rate when using cutting planes parallel to the coordinate axes decreases exponentially with cluster dimensionality (Hinneburg & Keim, 1999). In general, denser areas are better preserved, while smaller or multimodal clusters will be subdivided, and this fragmenting behavior is augmented if multiple cuts are performed at the same time (Milenova & Campos, 2002). While the framework outlined above does not make any assumption on the projections other than that they are contracting, the employment of projections other than the coordinates has not been explored in detail. Perhaps the only significant recent application is PDDP (Boley, 1998), a document clustering procedure that splits the data successively with a plane orthogonal to the principal component at its midpoint. The shortfalls of the approach are immediately noticeable – since the splitting plane always passes through a fixed point rather than through a low density point, fragmentation of clusters can occur very frequently. Although the PDDP implementation is not optimal, the guiding heuristic of PCA-driven projection (or projection pursuit; Huber, 1985) remains valid. PCA provides the directions of highest variance (i.e. the natural axes) of the data via an eigenvalue decomposition of the covariance or correlation matrix. Assuming that the data contains a structure describable by only a few variables (e.g. a subspace cluster), and that the observed attributes are linear combinations of these underlying variables and noise, then leading principal components tend to pick projections with interesting characteristics (e.g. good discrimination between clusters), while the noise is relegated to the trailing components. Using correlations rather than covariances provides the advantage that the solution is both balanced (all dimensions are treated equally) and scale-invariant (since correlations are not affected by the rescaling of attributes), thereby significantly diminishing the influence of noise variables with dominant variances. Additionally, the decision on how many components to consider for further analysis is facilitated: since the initial coordinates are all normalized to unit variance, one can simply select all components that are equivalent to at least one of the original dimensions, up to some error (i.e. with eigenvalues > 0.95 in our implementation). 2.2. Variable selection based on multimodality likelihood To avoid introducing noise from irrelevant attributes, the search for projections should be restricted to a relevant subset. Optimally, the selected subset would consist of dimensions where at least one cluster is at least partially distinguishable from the rest of Iulian Ilieș PhD Thesis 12 the data, and would include sufficiently many dimensions to separate at least one cluster. These conditions can be reduced to the simpler requirement that the selected dimensions exhibit some degree of bi- or multimodality. In order to quickly quantify the potential multimodality of each attribute, we defined a simple criterion based on low-order statistics. We use the ratio between the standard deviation and the average absolute deviation, ) √∑ ( ⁄ √ ∑ | | R is a kurtosis-like measure that is lower for symmetric bimodal distributions than for unimodal ones. The value obtained is corrected for skewness by subtracting the fraction of objects lying between the mean and the median. This second term can be computed in a linear time as half the absolute average value of the sign function over the centered data: ∑ | ( )| ⁄ The difference allows for an efficient modality likelihood assessment: most multimodal distributions score below 1.2, with only uniform-like unimodal distributions scoring in the same range (see Figure II-2 for several examples). Multimodal distributions with large number of modes tend to score similarly to the uniform distribution, near 1.15. While other, more exact modality tests exist (e.g. the Dip Test; Hartigan & Hartigan, 1985), our criterion is overall faster to compute. Furthermore, the final result of the partitioning procedure is unaffected by any accidental selection of unimodal distributions at this step, since the algorithm will not be able to find split points within unimodal projections. Figure II-2: Attribute multimodality likelihood assessment. A: Location of several univariate distributions in a pseudo kurtosis-skewness space. Horizontal axis: the ratio between the standard deviation and the average absolute deviation, a kurtosis-like measure. Vertical axis: the fraction of objects lying between the mean and the median, a skewness-like measure. The dashed segments mark several lines of equal multimodality likelihood scores. Filled squares represent multimodal distributions, while filled circles represent unimodal distributions. B – F: various bimodal distributions; their density functions are depicted in the corresponding panels. 3M – a trimodal distribution; B2 – Beta(2, 2); N – standard normal distribution; T – a tdistribution; U – uniform distribution; XP – exponential distribution. 13 Cluster analysis for large, high-dimensional datasets Using this criterion, we split the set of attributes into three subsets: the first contains all dimensions with scores below 1.2 that are involved in the highest correlations; the second contains all dimensions with scores lower than the normal distribution (1.25) that were not included in the first one; and the third contains all the remaining dimensions (i.e. with scores > 1.25). The subspace defined by the first set of attributes is considered to be the most favorable – here, the algorithm will look for splits along the largest principal components (eigenvalues > 0.95), and also along the coordinates. If no suitable split is found, the algorithm will search the coordinate projections from the second set, and, if needed, from the third set. 2.3. Sampling-based determination of high variance components In order to estimate correlations fast and efficiently, we resort to sampling: the algorithm selects randomly (based on object index; for potential data ordering dependencies, see Section V.1) a small group of objects, and calculates the correlations between the attributes within this reduced set. To ensure that the error introduced by sampling is low, we first examined the error values for different data sizes, sampling rates, and correlation degrees, on random datasets with normal or uniform distributions (Figure II-3A, B). The implemented sampling rate is ( ) ( ) (see Figure II-3C). All objects are employed when the current data has low cardinality (m smaller than approximately 6000 objects). The sampling rate decreases polynomially with rate 2/3 for increasingly larger datasets. This choice guarantees that the median relative error is lower than 5%, while the median absolute error is lower than 0.01, even if all correlations are low (absolute value < 0.2). For the more interesting correlations in the upper range (absolute value > 0.5), the errors fall below these thresholds in more than 95% of the cases. Figure II-3: Correlation estimation errors due to sampling. A, B: Median relative errors (%) for low correlations (absolute value < 0.2; A) and large correlations (absolute value between 0.5 and 0.8; B), for different data sizes (horizontal axis; logarithmic scale) and sampling rates (vertical axis). The error rates are represented as filled contour plots with logarithmic scales. Each map is interpolated between 25 points; the values at each point are summaries for 1000 random tests on uniformly and normally distributed two-dimensional data (with no inner structure). C: Implemented sampling rate, chosen such that the median relative errors are always less than 5%, independent of the actual scale of the correlations. Iulian Ilieș PhD Thesis 14 Since directions of high variance lie in subspaces consisting of (strongly) intercorrelated dimensions, the PCA can be restricted to the attributes involved in the largest correlations. Given that the sample size varies within the same run (clusters obtained at different steps have different sizes), we decided to select a number of attributes dependent on the total dimensionality d rather than the actual scale of the correlations. In the current ( ) dimensions are retained – corresponding to no information implementation, loss for lower-dimensional datasets ( ) that can be processed fast, and consistent with the general expectation that clusters in high-dimensional data reside in lowdimensional subspaces. 2.4. Smoothed histograms as density estimators In order to approximate the distribution functions of the objects on the different one-dimensional projections in a fast way, we use ASHs (Scott, 1992, pp. 113-117). These are a smoothed version of histograms, obtained by averaging several histograms of the same bin size and different offsets. Alternatively, ASHs can be constructed as fine-gridded histograms that are then smoothed with a moving-average filter. It is recommended (Scott, 1992, pp. 118-120) to set the narrow bin width such that the range of the data is covered by 50 to 500 bins (proportional to the size of the current data, m) with some additional empty bins on both sides, while the averaging parameter should be at least three. Our algorithm ⁄ calculates the number of narrow bins as the nearest integer to with an upper bound of 500, and uses a triangular moving-average filter of length 7 (corresponding to an averaging parameter of 4). With this choice of parameters, the bin width of the averaged histograms, is equal to approximately 70% of the oversmoothing ( ) bin width (Scott, 1992, pp. 72-75), ensuring that major features of the density distribution are correctly represented in the ASH. Indeed, for distributions with small skewness coefficients (absolute value < 1), the resultant bin width lies within 30% of the optimal value with respect to the mean integrated square error (MISE) criterion, (Scott, 1992, pp. 55-57) for small datasets (approximately 1000 objects; for normal distributions), and within 30% of 2h* for very large datasets (approximately one million objects). 2.5. Local minima scoring In their initial paper presenting the grid partitioning framework and the OptiGrid method, Hinneburg and Keim (1999) suggested to detect potential splitting points by first finding pairs of density maxima that are above a certain noise level, and then by finding the minimal density between them (with the quality of the split proportional to the density at that point). The presence of a noise level parameter makes their procedure difficult to use in an efficient way, while the actual scoring is too simplistic since it ignores the geometry of the maxima. Milenova and Campos (2002) improved the criterion by contrasting the bin count at the minimum point with the average of the two maxima via a chi-square test, and ranking split points by test significance. However, their system is still prone to errors: since Cluster analysis for large, high-dimensional datasets 15 it does not take into account the width of the maxima, it is likely to rank highly those cutting planes that separate distribution-tail noise or outliers from the rest of the data. We further improved the scoring system by employing cumulative densities instead of maximal values (as also suggested by Stuetzle (2003) for limiting the number of noise clusters). Our criterion is very similar to the runt excess mass proposed by Stuetzle and Nugent (2010): the score of each minimum point is defined as the geometric average (rather than the minimum) of the excess masses (Hartigan, 1987) of the two resulting subclusters (i.e. the sums of bin parts above the trough level; Figure II-4A). Furthermore, to ensure comparability between different clusters, we use histograms based on relative frequency counts. This measure has a maximum value of 0.5 (complete separation between two equal groups), and penalizes asymmetric cases (e.g. Figure II-4B, C), thus eliminating the need for a minimal value for local maxima. Our tests suggest using values of 0.05-0.15 as minimal score thresholds, and hence as stopping criterions (see also Figure II-4B-F). Values larger than 0.1 are more appropriate for data with clusters of similar sizes, or when trying to minimize the effects of noise. If the data is expected to contain a large number of clusters and/or clusters of highly variable sizes, threshold values between 0.05 and 0.1 generally yield better results. Non-meaningful fluctuations in the ASHs score just below 0.05. Figure II-4: Local minima scoring. A: The goodness-of-fit of a local minimum as a splitting point is calculated as the geometric average of the normalized excess masses of the two resulting subclusters (graphically corresponding to the areas of the associated peaks lying above the trough level). B – F: Scoring examples on several bimodal distributions with variable degrees of separation and symmetry. Vertical lines mark the best splitting points; the corresponding scores are printed above the graphs. 3. Practical implementation The proposed clustering algorithm is implemented as a MATLAB function with four input arguments and two outputs. The input arguments are: the dataset to be partitioned (mandatory), the maximum number of clusters, the minimum cluster size, and the minimum Iulian Ilieș PhD Thesis 16 split threshold (Section II.2.5). The outputs are the list of cluster indices for each object and, optionally, the decision tree constructed during the partitioning process. Additionally, we have implemented a graphical interface for the algorithm (see Section II.3.1). The interface allows for the specification of additional partitioning parameters (see Section II.3.2), and also includes several visual tools that facilitate the exploration of the provided solutions. 3.1. Graphical interface The graphical interface (Figure II-5) consists of five modules: for data and parameter selection; for visualizing the solution provided by the partitioning algorithm; and for displaying the message log. Each panel is presented in more detail below. Contextual help is available for every element of the interface in the form of tooltips. Figure II-5: Graphical interface for the proposed partitioning algorithm. The interface allows for the configuration of different partitioning parameters. In addition, it includes several visualization tools. The data selection panel, located in the upper left corner of the interface, allows the user to select a dataset for partitioning from the variables extant in the global MATLAB workspace. If the desired dataset has not been loaded yet, the panel provides the option of browsing for data files on the local disk directly from the graphical interface. The program will attempt to load MATLAB data files (“.mat”) and comma separated value files (“.csv”) directly, and will open the data import wizard for other types of files. The parameter setup panel is located in the center-left region of the interface. It contains a variety of parameters that can influence the outcome of the partitioning process (see Section II.3.2). To accommodate potential conflicts between the various parameters, Cluster analysis for large, high-dimensional datasets 17 each of them is assigned a priority level. The priorities can be changed easily via the linked slider bars located under the name of each parameter. If at any time during the partitioning procedure a conflict occurs, the parameter with the highest priority is used. The dendrogram panel, located in the upper right region of the interface, displays a compact, horizontal version of the most recent partitioning tree (an example is included in Figure II-5). For user reference, a copy of the tree is also saved in the main workspace. The internodes are colored in black, while the final leaves are colored in red and are labeled incrementally in the order in which they were produced. To prevent over-cluttering of the image, the scale of the tree is fixed. If the tree does not fit completely in the figure, the side scrollbars activate, allowing the user to explore the entire image. The matrix panel is located in the lower right part of the interface. It contains a representation of the clusters corresponding to the final leaves (see example in Figure II-5), aimed at highlighting the cluster-subspace associations that occur in high dimensional settings. Each cluster is represented by a row of colored squares (one for each dimension) with fixed size. For each dimension, the corresponding color is proportional with the difference between the cluster mean and the grand mean on that dimension. Red hues denote positive differences, blue hue negative differences, while green hues represent differences close to zero. Intense blue or red colorings are indicative of strong clusterdimension associations. The labels of the clusters are the same as in the dendrogram plot. If the matrix is too large, the side scrollbars activate, thus enabling the complete visualization. The message log panel is located in the lower left corner of the interface. It displays all error, warning, and informative messages generated at parameter setup or during the partitioning process. Multiple selection of the printed text, both continuous (using the SHIFT key) and discontinuous (using the CTRL key) is supported. Any selection is copied to the system clipboard and can be pasted in a text editor of choice. 3.2. Partitioning parameters The interface permits the definition of four different parameters that can control the partitioning process: the split score; the number and size of clusters; and the partitioning level. Each parameter is presented in more detail below. For every parameter, the user can define both lower and upper bounds. A number of constraints stemming from the interactions between the different parameters apply when defining multiple values (e.g. the product of the minimum number of clusters and the minimum cluster size must not exceed the total number of objects). Apart from these restrictions, any combination of values is possible, as well as defining only certain parameters or none at all. The latter choice will result in a complete dendrogram. The expected number of clusters is a simple parameter that may be the easiest to work with for less experienced users. Although usually associated with iterative methods such as k-means or density-based algorithms, it can have a significant influence on the outcome of partitioning methods as well. If the user has knowledge about the structure of the data to be analyzed, for example from previous test runs, the solution can be greatly improved by specifying a target number of clusters. Iulian Ilieș PhD Thesis 18 The partitioning level is perhaps the most commonly employed stopping criterion for hierarchical methods. Setting the number of successive divisions provides a balanced tree with respect to the number of branches, although it may have negative effects on the size of the final leaves. The interface offers a more flexible approach, allowing the user to specify both a minimal and a maximal value for this parameter. Used in conjunction with restrictions on the size of the clusters, it can result in equilibrated solutions. The cluster size is a more sophisticated parameter, in that it requires prior information about the data for a successful usage. Setting a minimum threshold limits the number of very small clusters that may just represent noise in the data. Correspondingly, imposing a maximal value forces the partitioning algorithm to produce a more balanced tree, with the objects more evenly distributed among the final clusters. The split score, as defined in Section II.2.5, is a parameter measuring the quality of the partitioning of any analyzed cluster. Despite having an unbalancing effect on the cluster tree, defining a minimal and/or maximal value will usually lead to a more accurate solution. Setting a lower bound guarantees that groups with no internal structure will not be subdivided further. Correspondingly, specifying an upper bound guarantees that even the finer structure in the data is revealed. 4. Experimental results As a first assessment, we tested the proposed method on several simple datasets with problematic configurations (several preliminary results using an early version of the algorithm are reported in Ilies & Wilhelm, 2008). The results (see examples in Figure II-6) indicated that our algorithm could detect clusters that were linearly separable and nonoverlapping. If the former condition does not hold, any interlocked clusters are subdivided into smaller groups, resulting in several clusters that are subsets of the actual clusters (Figure II-6C). The algorithm effectively cuts away pieces from such clusters until the remaining data is linearly separable. If the clusters are overlapping, then misclassifications may occur in the overlap regions, or where the clusters are close enough to each other such that the objects in the contact area cannot be assigned unambiguously (Figure II-6A, B). In most cases, following the projection pursuit heuristic proves to be beneficial, and allows for fewer misclassification errors compared to employing only the coordinate projections, as OptiGrid does. Figure II-6A illustrates such a situation; using principal component projections, the error rate is reduced from 6.3% to 1.6%. However, there are situations when the principal components do not contain useful information (Chang, 1983) – e.g. if there are many isotropically distributed clusters (Huber, 1985), or if the clusters are oriented such that the correlations are close to zero (e.g. Figure II-6B). In such cases, the inclusion of coordinate projections provides additional chances to differentiate between the clusters; in the example shown in Figure II-6B, this additional step decreases the misclassification rate from 3.9% to 1.5%. At worst, the proposed algorithm partitions the data only with planes parallel to the coordinate axes, similarly to OptiGrid (Figure II-6B-C). Cluster analysis for large, high-dimensional datasets 19 Figure II-6: Qualitative performance assessment. The clusters recovered by the proposed method are represented in different hues and markers. A: Since our algorithm considers the dominant principal component (diagonal line) as a projection, it is able to recover all clusters. Overlapped clusters (lower pair) are separated as best as possible; some misclassifications will still occur in the overlap region. By contrast, OptiGrid shows reduced performance, since it analyzes only coordinate projections (cutting planes shown as dashed lines). B: Principal components are not always useful, and any partitioning procedure relying only on PCA to find good projections may produce inaccurate solutions (cutting planes shown as dashed lines). Our method incorporates both coordinate and PCA projections, and is therefore able to recover the clusters correctly. C: An intrinsic limitation of the proposed approach is the subdivision of linearly non-separable clusters. 4.1. Large, high-dimensional synthetic datasets We further analyzed the performance of our method on synthetic datasets of large cardinality (10000-89000 objects, median = 45000) and dimensionality (10-50 attributes, median = 30.5) (see example in Figure II-7D). The datasets were generated using an early version of the algorithm presented in Chapter III. Objects were distributed into 2 to 20 (median = 12) ellipsoidal clusters of variable size (within 50% of average size), following diagonal multivariate normal distributions. Each cluster was associated with a random subspace of random size; on these selected attributes, the mean and standard deviation were set at random values, while on all other dimensions the values were sampled from the standard normal distribution, with mean zero and unit standard deviation (see e.g. Figure 6F). In half of the cases, up to 24% (median = 4%) noise objects, sampled from spherical multivariate normal distributions with zero means and unit standard deviations on every attribute, were appended to the data set. This generating procedure allowed us to create clusters with various degrees of overlap, by varying the number of associated dimensions (1-5, median = 3), and the distribution parameters on these attributes (cluster mean ranging from 0.4 to 3.8, median = 1.8; standard deviation between 0.3 and 0.6, median = 0.4). To quantify the overlap, we used a purely geometrical measure. For each cluster, we calculated the percentage of objects lying inside the convex hull of the inner 99% of the remaining data, within the associated subspace. Thus, a 0% overlap corresponds to wellspaced, linearly separable clusters, while a 100% overlap indicates a rather continuous Iulian Ilieș PhD Thesis 20 distribution of objects. Every generated data set (total = 540) was used as input to our method (Figure II-7D, E shows a typical example). We set the split threshold at 0.1, and left both the maximal number and minimal size of the clusters undefined, in order to test whether our algorithm can make appropriate stopping decisions. We investigated the accuracy of the provided solutions by examining the contingency of the initial (constructed) clusters and the ones found by our method. We calculated the Adjusted Rand Index (ARI; Hubert & Arabie, 1985), a partition-similarity measure that takes values between 0 (for random partitions) and 1 (for perfect agreement). In addition, we computed the fractions of recovered clusters and of objects classified correctly. An original cluster was said to be recovered as one of the clusters found by the algorithm if their intersection contained at least 75% of the objects from either. The objects belonging to such intersections were considered to be correctly classified. Figure II-7: Performance on synthetic datasets. A – C: The Adjusted Rand Index (A), the percentage of clusters recovered (B), and the percentage of objects correctly classified (C), as functions of the average cluster overlap. Error bars represent standard errors of the means (46-53 sets per point). The datasets were generated as explained in Section II.4.1, with 1000089000 objects, 10-50 dimensions, and 2-20 hyper-ellipsoidal clusters. D – E: Typical example. The left panel (D) shows the constructed clusters, while the right panel (E) shows the clusters uncovered by our method. Corresponding clusters are linked with arrows. Note that the noise objects (bottom group in panel E) were also recovered as an independent cluster. Color map: -4 (black) to 4 (white). F: Two-dimensional projection of the example dataset within the definition subspace of one of the clusters (marked with a star in panel E). Gray squares mark objects from the selected cluster, while black dots correspond to all other objects (cluster overlap = 10%). Cluster analysis for large, high-dimensional datasets 21 Results (see Figure II-7A-C) revealed that clusters satisfying the most basic differentiation criterion – being distinguishable from noise (or other clusters) on at least one dimension – were correctly recovered. If the average degree of overlap was less than 25%, almost all clusters were retrieved (including the group of noise objects, if present), and the ARI was larger than 0.9, while for an overlap of up to 50%, at least 65% of the clusters were found on average, with corresponding ARI values of 0.6-0.9 (Figure II-7A, B). In contrast, at high overlap degrees (more than 85%), very few of the modes in the data were uncovered, as expected given that the original clusters were merged by construction. The fraction of objects correctly classified (Figure II-7C) showed a similar pattern, albeit with somewhat lower absolute values. The misclassifications originated mostly from the overlap regions, where objects were assigned to the different clusters based on their position with respect to the minimal density cuts. A secondary effect was that the clusters found could have different sizes than the initial ones (e.g. the upmost cluster in Figure II-7D, E), since intersection areas were redistributed unevenly, or even separated as distinct clusters (e.g. lowest cluster in Figure II-7E). At low and medium degrees of overlap, the provided solution contained a significant number of small clusters (up to 50% of the total), most likely the result of noise, outliers, or cluster overlapping. 4.2. Comparison with common approaches To assess the performance of our MATLAB-implemented method relative to the already available algorithms, we compared it to several other approaches, which are either commonly used or considered to be very effective: k-means and k-medians (vectorized MATLAB versions); average linkage (MATLAB implementation); the BIRCH-based TwoStep cluster analysis from SPSS; and the model-based clustering Mclust (Fraley & Raftery, 1999) from R. We included k-medians in addition to k-means since it should perform better in a high-dimensional context due to its reliance on Manhattan distances (Aggarwal et al., 2001). We tested the selected methods on ten moderately large synthetic datasets consisting of 6000-16500 objects (median = 10400) defined in 10-20 dimensions (median = 18). Objects were grouped into 3-6 clusters (median = 5) with subspace sizes of 1-5 (median = 3), and low degrees of overlap (0-38%, median = 16%). For our method, we set the minimal split threshold at 0.1 and left the other parameters undefined. For k-means, k-medians, and average linkage we specified the true number of clusters as input. For TwoStep and Mclust, we used the default settings, allowing the algorithm to choose the best model with at most nine clusters. Since the datasets are essentially mixtures of partially overlapping multivariate normal distributions with diagonal covariance matrices, we expected the k-centroid methods, and especially the model-based clustering procedure, to perform particularly well. Indeed, Mclust was able to recover 98% of the clusters (range = 80-100% over the ten datasets; mean ARI of 0.98, range = 0.9-1), and to classify correctly 96% of the objects (range = 72-100%), albeit in a very long time (80-280 seconds). Our method attained similar results, recovering 100% of the clusters (mean ARI of 0.96, range = 0.88-1), and classifying correctly 98% of the objects (range = 89-100%) very quickly (in at most 0.5 Iulian Ilieș PhD Thesis 22 seconds). TwoStep and k-medians also performed well, with 93-95% of the clusters recovered, an average ARI of 0.93-0.94, and 91-93% of the objects classified correctly, also in very short times (averages of 2.2 and 0.7 seconds, respectively). K-means was slightly worse, recovering 87% of the clusters (mean ARI of 0.9), and classifying 85% of the objects correctly, and was roughly as fast as k-medians. Only average linkage performed markedly worse, being able to recover only 39% of the clusters and to classify correctly just 39% of the objects (mean ARI of 0.55) in, on average, 12 seconds. These results indicate that, similarly to state-of-the-art methods like Mclust and TwoStep, our algorithm is able to provide solutions of good quality without prior information on the number of clusters. In addition, it is very fast (with running times several times smaller than k-medians or TwoStep), and this ability scales well with increasing the number of objects or dimensions (see Section 3.4 below). By contrast, we could not run Mclust on data with more than 5000 objects and 75 dimensions on the computer used in this study. 4.3. High-dimensional real datasets To complete the assessment of the proposed method, we tested it on several real datasets from the UCI Machine Learning Repository (Asuncion & Newman, 2007). We searched the repository for multivariate datasets with high dimensionality and numericalonly attributes that were marked as appropriate for classification, and selected the following: the University of Wisconsin Breast Cancer Original Data Set (Mangasarian & Wolberg, 1990) – 683 observations described by nine integer valued variables, distributed into two classes (444 benign, 239 malign); the Johns Hopkins University Ionosphere Database (Sigillito et al., 1989) – 351 radar returns described by 34 continous attributes, distributed into two classes (225 good, 126 bad); the Statlog Landsat Satellite Dataset – 6435 observations with 36 hue values each (range = 0-255), grouped in six classes (e.g. red soil, cotton crop) of sizes 626, 703, 707, 1358, 1508, and 1533; the Statlog Image Segmentation Dataset – 16 color and contrast measures of 2310 regions cropped from outdoor photos, grouped into seven classes (e.g. foliage, cement, sky) of equal size; the SPAM E-mail Database – 4601 e-mails described by 57 continuous attributes (various character and word frequencies), distributed in two classes (1813 spam, 2788 not spam); the Statlog Vehicle Silhouettes Dataset (Siebert, 1987) – 846 observations described by 18 continuous attributes obtained from the processing of standardized photos of cars, grouped into four classes (e.g. bus, van) of sizes 199, 212, 217, 218; and the Multiple Features Digit Dataset – six sets of measurements (with 6-240 features per set, median = 70) of 2000 monochrome images of digits, distributed into ten classes (“0” to “9”) of equal size. Since we had no prior information about the separability of the classes of objects, we opted not to define the maximum number of clusters. Instead, in each case, we set the minimal cluster size threshold at the size of the smallest extant class. Thus, assuming that the true clusters are not linearly separable, and therefore would likely be subdivided by our algorithm, our parameter choice guaranteed that any fragments that are smaller than a true cluster will not be further partitioned, as they probably span only one cluster. Correspondingly, we relaxed the cluster recovery rule defined in Section II.4.1 to allow for Cluster analysis for large, high-dimensional datasets 23 smaller cluster fragments. We associated the clusters found with the original clusters if their intersection contained at least 75% of the objects from the retrieved cluster, and at least 25% of the objects from the true cluster. The performance was rather modest, with a mean ARI of 0.31 (range = 0.14-0.53 over the twelve datasets). On average, at least one fragment was recovered for 52% of the clusters (range = 20-100%), with a mean number of 1.4 fragments per cluster, and 38% of the objects (range = 10-91%) were correctly classified. When allowing for more heterogeneity within the retrieved clusters, by lowering the 75% requirement mentioned above to 50%, the proportion of recovered clusters increased to 76% (range = 50-100%), while the fraction of correctly classified objects increased to 50% (range = 32-91%). These results suggest that the considered datasets contained significantly overlapping, linearly non-separable clusters. Indeed, errors were typically larger when the algorithm had to discriminate between very similar images, e.g. damp gray soil and very damp gray soil, cement patches and paths, or the digits “6” and “9”. For comparison, we also clustered the twelve test datasets using the k-means, TwoStep, and Mclust algorithms. For k-means and TwoStep we specified the correct number of clusters beforehand, while for Mclust we employed the default automatic model selection. The performances were similar to that of our method, albeit more varied across the datasets. They had slightly lower rates of cluster recovery (averages: 38-43%, range: 0100%) and correct classification of objects (averages: 32-35%, range: 0-96%), but somewhat larger ARI values (averages: 0.4-0.44, range: 0-0.88) due to the reduced cluster fragmentation. 4.4. Theoretical and empirical algorithm complexity To calculate the theoretical complexity, let n be the total number of objects, d be the number of attributes, be the average cluster subspace size, the size of a group of objects analyzed at some point, and q the number of true clusters contained in this group. The assessment of the multimodality of the distributions of this group on all dimensions involves the calculation of ( ) sums over objects, and hence has complexity ( ). On average, a number of dimensions will be employed in the subsequent search for optimal projections. The estimation of the correlations between the selected attributes, based on a sample of ( ) ( ) objects, has complexity ( ⁄ ). The determination of the principal components for the dimensions involved in the largest ( ( )) correlations involves an eigenvalue decomposition, and hence has a complexity of ( ( ) ). The subsequent calculation of the projections of objects on the ( ( )) ( ( ) ) ( ( ) ) The construction of principal directions is of order histograms is of order at most ( ), when all coordinate projections are also analyzed. The analysis of local minima depends quadratically on the number of bine, and hence has complexity at most ( ⁄ ). Aggregating, the expected analysis time for one cluster is ( ( )) ( ). Let now k be the final number of clusters found by the algorithm. Then clusters of progressively smaller size and number of children are analyzed in total. The decrease in size and number of subclusters should be approximately linear with respect to the index. Therefore, on average, analyzing additional clusters will Iulian Ilieș PhD Thesis 24 amount to a sub-linear increase in computation time. Summarizing, the overall expected algorithm complexity is ( . ), with To verify the above calculations empirically, we performed three sets of tests (consisting of 160-250 tests each) using randomly generated data with very small overlap between the clusters (generated as explained in Section II.4.1). In each set, we varied one of the following three parameters – number of objects (10000 to one million), number of dimensions (1-500), and number of clusters (2-240), while keeping the other two restricted. The results (see Figure II-8A-C) confirmed our expectations: the running time was linear with respect to the number of objects (exponent = 0.997 with a 95%-confidence interval of ± 0.045, adjusted R2 = 0.989), slightly superlinear with respect to the number of dimensions (exponent = 1.067 ± 0.023, adjusted R2 = 0.995), and sublinear with respect to the number of clusters found (exponent = 0.774 ± 0.074, adjusted R2 = 0.932). Furthermore, the total time was generally very low, in the order of seconds to tens of seconds on a regular computer (all tests were performed on a computer with 2 GHz dual-core Intel processor and 2 GB of RAM). Figure II-8: Timing performance. A – C: Running time, as function of the number of objects (A; 5-6 clusters in 20 dimensions), the number of dimensions (B; 50000-55000 objects in 5-6 clusters), and the number of clusters found (C; 50000-55000 objects in 20 dimensions). Clusters were defined in 1-5 dimensions, with an average degree of overlap of at most 5%. Dashed lines represent the best approximations by power functions (adjusted R2 > 0.93 in all cases). The exponents are 0.997 for the number of clusters, 1.067 for the dimensionality, and 0.774 for the number of clusters found. We conducted an additional set of 100 tests on synthetic data of similar structure as above (24000-81000 objects, 10-90 dimensions, 3-133 clusters found), where we monitored the time spent by the algorithm in the various sections. On average, approximately 45% of the time was used for constructing histograms, 23% for calculating projections, 19% for the modality tests, 5% for estimating correlations, and 5% for analyzing local minima (with the remainder 3% being taken up by overhead). The low value registered for the correlations estimation step is an effect of the attribute selection procedure, and corroborates with the observed low-order dependence on the number of dimensions (Figure II-8B); without prior feature selection, the dependence would have been quadratic. III. Generation of synthetic datasets with clusters 1. State of the art The only effective technique for the evaluation of clustering algorithms relies on the analysis of a large number of synthetic datasets (Milligan, 1996). Since the exact cluster structure of computer-generated datasets is known, the performance of any clustering algorithm can be measured efficiently by calculating the agreement between the partition found by the algorithm and the true partition. Moreover, depending on the method employed, it is possible to generate datasets with variable degrees of clustering difficulty, for example by adding measurement error, by appending different amounts of noise, or by varying the degree of separation between the clusters. Correspondingly, clustering algorithms that perform poorly on datasets with well-separated clusters can be considered of low quality, while algorithms which perform well on closely-spaced clusters can be considered of high quality (Qiu & Joe, 2006). Even when using simulated datasets, generalizability remains questionable, since the results of the performance analysis may only be valid for datasets with similar structures (Milligan, 1996). Many of the early approaches for constructing artificial datasets (see Steinley & Henson, 2005, for a review) have relatively simplistic implementations of cluster separation and/or overlap. The procedure of McIntyre and Blashfield (1980) generates mixtures of multivariate normal distributions with random covariance matrices and uniformly distributed means. The overlap between clusters is varied in a heuristic manner, by scaling the covariance matrices of all clusters with a common factor. Consequently, although the overall degree of overlap increases with the scaling factor, some pairs of clusters may overlap much more simply because their centers lie close to each other. This is also a problem for the method of Overall and Magee (1992), which constructs mixtures of diagonal multivariate normal distributions with normally distributed means using a similar implementation of cluster overlap. The widely used algorithm of Milligan (1985) generates datasets consisting of well-separated clusters sampled from independent truncated normal distributions on each dimension. Measurement error as well as outliers can be added to the datasets, however with unpredictable effects on the intended cluster separation (Atlas & Overall, 1994). Many of the generators of synthetic datasets employed by researchers to validate their clustering algorithms have similar drawbacks. Datasets constructed with the method of Zhang et al. (1997) consist of simple clusters with diagonal multivariate normal distributions and with centers selected uniformly or placed on a regular grid. While clustering difficulty can be increased by adding uniform noise, the effects of the potential overlap between clusters are not taken into account. The algorithm described by Zaït and Messatfa (1997) builds datasets with hyper-rectangular clusters sampled from multivariate Iulian Ilieș PhD Thesis 26 uniform distributions, featuring very narrow ranges on selected dimensions. The degree of separation between clusters depends on the number of such special dimensions and on the widths of the clusters on these dimensions, all of which must be specified by the user. A more refined method for creating datasets containing this type of clusters has been developed by Procopiuc et al. (2002). In the latter approach, clusters are sampled from independent normal or uniform distributions on selected dimensions, and from uniform distributions spanning the entire data space on all other dimensions. Each successive cluster shares some of the special dimensions with the previous cluster, and it is placed such that it overlaps moderately with the previous cluster on their common dimensions. However, since all these decisions are random, it is difficult to quantify the resulting degree of cluster overlap. More recently, significant theoretical work has addressed the production of artificial datasets having deterministic levels of separation or overlap between the clusters. The algorithm of Waller et al. (1999) can produce a variety of multivariate clusters with different volumes and covariance matrices, and with objects sampled from normal-like distributions with different degrees of skewness and kurtosis. The positioning of the clusters and their relative degrees of overlap are calculated using what the authors define as cluster indicator validities: measures related to the total between-clusters variance. The method of Qiu and Joe (2006) constructs mixtures of multivariate normal distributions with arbitrary covariance matrices, centered at the vertices of a multi-dimensional grid of equilateral simplexes. In order to increase the complexity of such datasets, normally distributed noise variables or uniformly distributed noise objects can be appended to the dataset. Furthermore, the entire dataset can be subjected to a random rotation, in order to ensure that clusters are not distinguishable from each other in low-dimensional coordinate projections. To adjust the degree of overlap, the clusters are rescaled such that the geometrical gap between each pair of clusters satisfies the user-prescribed minimal requirements. By contrast, the method developed by Steinley and Henson (2005) relies on a more precise definition of the degree of overlap between any two clusters, which requires the analytical estimation of the intersection of the marginal probability distributions of the clusters. This latter approach can produce variable datasets, containing clusters with different covariance matrices and objects sampled from different probability distributions, but it depends heavily on the user input. For example, it is necessary to specify the desired degree of overlap or separation between each pair of clusters, with the restriction that no cluster can overlap with more than two other clusters. The procedure of Maitra and Melnykov (2010) uses a similar measure of exact overlap: for any two clusters, their overlap score is calculated as the combined probability of misclassification. This measure offers a very accurate description of the clustering difficulty of a dataset, but it is rather difficult to calculate for generic clusters. Consequently, the authors restrict their approach to mixtures of multivariate normal distributions with random covariance matrices. 27 2. Generation of synthetic datasets Proposed method As the methods outlined above show, there is a trade-off between the capacity to generate datasets containing clusters of different types or shapes, and the ability to specify the degree of separation or overlap between the clusters. In the present work, we propose a novel procedure for the creation of synthetic datasets, which offers an optimal balance between these two aspects. Similarly to the method of Steinley and Henson (2005), our algorithm can construct clusters with different sizes, shapes, and orientations (see Section III.3.3), consisting of objects sampled from different probability distributions (Section III.3.2). As in the method developed by Procopiuc et al. (2002), some or all clusters can be well-defined in restricted subsets of dimensions. However, unlike previous methods, our proposed algorithm allows some clusters to also have curvilinear shapes (see Section III.2.2), or multimodal distributions (Section III.3.3). Despite this flexibility, which allows users a wide variety of choices in the construction of clusters, the proposed generator has a relatively low number of inputs, that are intuitive and are accessible through a user-friendly interface (see Sections III.3.1-III.3.2). The clusters are distributed within the data space using a greedy geometrical procedure (Section III.2.3). The overall degree of cluster overlap is adjusted by scaling the clusters, as in Qiu and Joe (2006). Although heuristical, our approach is highly effective in prescribing the cluster overlap (see Section III.4.2). Furthermore, it can be extended to allow for the production of datasets with nonoverlapping clusters with defined degrees of separation. 2.1. Dataset generation algorithm We propose a bottom-up approach for generating synthetic datasets containing clusters (see Figure III-1), where clusters are broadly defined as groups of objects that are similar to each other, but dissimilar to the other objects in the dataset. In the present work, we consider several specific types of clusters. Unimodal or center-defined clusters have density functions with a unique, global maximum, and can therefore be described in reference to that point. By contrast, multimodal clusters have density functions with multiple maxima, and can be interpreted as collections of overlapping unimodal subclusters centered at the local maxima points. Rotated or ellipsoidal clusters have the principal axes rotated with respected to the coordinate system, either orthogonally or obliquely, and therefore have non-diagonal covariance matrices. Nonlinear or curvilinear clusters have one or more principal axes curvilinear or mapped to a non-trivial manifold. Subspace or projected clusters are well-defined, and hence distinguishable from the rest of the data, in only a subset of dimensions, while on the remaining, non-relevant dimensions they follow noise distributions spanning the entire data space. With the exception of unimodal and multimodal, all types of clusters presented above are non-exclusive. The proposed algorithm starts by randomly choosing the size, modality, dimensionality, shape and orientation of each cluster from the set of possible values prescribed by the user. In the next step, the algorithm constructs the clusters sequentially according to the selected parameters. For each unimodal cluster, the values taken by all its Iulian Ilieș PhD Thesis 28 objects on each of the attributes are sampled from independent probability distributions, and scaled such that the value ranges are conforming to the desired aspect ratio and volume of the cluster (Section III.3.3). Multimodal clusters are built as sub-datasets consisting of several overlapping unimodal subclusters (Figure III-1). The subclusters are generated similarly to unimodal clusters, and combined with the same procedure as that employed for assembling the entire dataset (see below). Next, the current cluster may be subjected to a nonlinear distortion (see Section III.2.2), followed by a random rotation, either orthogonal or oblique. To generate arbitrary rotations, the algorithm uses a variation of the method developed by Anderson et al. (1987) (see Section III.3.3). After all clusters have been created, the final dataset is assembled by placing the clusters in the data space. Here, we use a custom procedure that allows for the precise control of the degrees of cluster overlap and of spatial heterogeneity of the dataset (Section III.2.3). Figure III-1: Schematic representation of the proposed procedure for generating synthetic datasets with clusters. Rectangular cells represent calculation steps, rounded cells represent optional calculation steps, and hexagonal cells represent conditional nodes. Unimodal clusters (blue cells), and subclusters of multimodal clusters (purple cells), are constructed consecutively through the application of several linear and nonlinear transformations to blocks of data sampled from independent distributions on each dimension. Subclusters are combined to form multimodal clusters, which can then be subjected to further alterations. The dataset is then assembled by combining the generated clusters with a prescribed degree of overlap. Optionally, noise objects or dimensions can be added during dataset construction. In addition, the order of the objects or dimensions can be randomized (yellow cells). 29 Generation of synthetic datasets In the next stage, noise objects and/or dimensions sampled from predefined probability distributions with range equal to the size of the data space can be appended to the data in the desired proportions. In addition, if the dataset contains subspace clusters, the coordinates of their objects on the non-relevant dimensions are also generated at this step. In the last stage, the dataset is formatted for output, e.g. by randomly permuting the objects and/or dimensions, or by normalizing the values recorded on each attribute. With this procedure, it is possible to produce datasets of varied degrees of clustering difficulty and containing clusters with highly variable characteristics (see Section III.4). 2.2. Construction of nonlinear clusters Clusters with nonlinear principal axes can be obtained from generic, axes-parallel clusters through the application of various space-distorting transformations such as functions which send lines to curves or planes to convex surfaces. Such transformations involve the mapping of subspaces to manifolds of equal dimensionality embedded in higher dimensional subspaces. The coordinate system of the transformed subspace is mapped to the locally defined set of directions tangent to the manifold. Correspondingly, the dimensions from the embedding subspace that do not belong to the transformed subspace are mapped to the set of directions normal to the manifold. Dimensions outside the embedding subspace, if any, will remain unchanged. In the present work, we consider transformations involving up to three dimensions, and for which the target manifolds have fixed curvatures. Under these conditions, three different types of distortions are possible: mapping planes to cylindrical surfaces, by transforming lines into either circles or helixes; mapping planes to convex surfaces; and mapping planes to concave surfaces. In the following, each of these cases will be described in more detail. Let C be a cluster considered for nonlinear deformation. Denote by m the number of objects contained, and by s the dimensionality of the cluster. Assume the cluster is represented in numerical matrix form, with objects on rows and attributes on columns. Then ( ) are the values that the i-th object takes on all attributes, while ( ) is a list of the values taken by all objects on the j-th attribute. Let denote the range of the cluster C on dimension j, i.e. ( ( )) the cluster is centered on each dimension: | ( ( )). Without loss of generality, we assume ( ( ))| | ( ( ))| * +. The simplest nonlinear transformation we consider is the deformation of parallel lines into concentric circles. Let and * + be two randomly selected dimensions, and assume we intend to curve the axis in the plane. Let be the desired angle that the cluster will subtend after distortion. To ensure that does not intersect with itself after transformation, the angle must not exceed . The corresponding curvature ⁄ ), where the latter term is radius of the principal axis of is ( ⁄ included to ensure injectivity of the nonlinear transform over the convex hull of . To make larger angles possible, the involved dimensions should be selected such that . The desired distortion can be achieved by mapping the objects in to circles centered at Iulian Ilieș PhD Thesis 30 ( ), with radii equal to their distance from this reference point on the angles proportional to the ratio between their values and the curvature radius ( ) (( ) ( ) ( ) ( )) * axis, and : + According to this transformation, lines parallel to the axis in the original setting are mapped to circles centered at (– ), while lines parallel to the axis are mapped to radial spokes passing through this reference point. Furthermore, the mapping is injective )| | | over the region {( ( )}, and hence over the convex hull of . The projection of the transformed cluster on the -plane will be an annular or, more rarely, a circular sector. Correspondingly, three-dimensional projections including the and axes will show as a cylindrical fragment (see Figure III-2B, C). Since lines with constant values that lie closer to (– ) are mapped to circles of smaller radii, and hence lengths, than lines further away, different areas are compressed or enlarged depending on their position relative to the reference point. Therefore, the object densities are changed, and no longer reflect the original probability distributions of in the new coordinate system. For example, if is uniformly distributed on both the and directions, the transformed cluster will exhibit a uniform angular distribution of objects, but its density will decrease with the radius. To maintain the original distributions, the area changes must be compensated by modifications of opposite magnitude in object density. Here, we eliminate a fraction of the objects lying in the compressed regions, proportional to the observed changes in the areas of surface elements, i.e. proportional to the Jacobian determinant of the distortion map. Let ( ) be the surface element near the point ( ) from the original coordinate system, and ( ) be the surface element near the image ) ⁄ in the new polar point lying at radius ( ) and with angle ( system. Then the relative area change, ( ), is: ( ) ( ) ( ) ( ) | ⁄ ⁄ | ( ) | ⁄ | Let us set the probability of objects being retained after the application of the nonlinear deformation to the following: ( ) * + ( ) ( ) ⁄ , ⁄ ⁄ - With this adjustment, the down-sampled, transformed cluster probability distributions of in the new coordinates. will correctly display the A variation of the curvilinear transformation outlined above is the mapping of lines to three-dimensional curves. We consider here the special case of helixes, which have constant curvature and constant pitch. The corresponding nonlinear deformation is a composition of the circular distortion described above and a shearing function. Let , and * + be three randomly selected attributes, and assume we intend to curve the axis 31 Generation of synthetic datasets in the plane, while shearing with respect to . Let ( - be the desired angle that cluster will subtend after distortion in projection. As above, the corresponding ⁄ ). Furthermore, let curvature radius of the principal axis of is ( ⁄ ⁄( ) be the pitch of the helix. The lower limits for and ensure that the mapping will be injective over the convex hull of . The desired nonlinear function is then: ( ) (( ) ( ) ( ) ( ) ) * + Depending on the relative sizes of , , , and , the transformed cluster will have the appearance of a true helix, a spiral ramp, or a sheared cylinder fragment (e.g. Figure III-2D). As before, the correct transfer of the original probability distribution of to the new cylindrical coordinate system can be achieved by weighted down-sampling, by setting the probability that any object from remains in to: ( ) * ⁄ + The second class of nonlinear transformations that can be performed in a threedimensional space consists of functions that send two-dimensional planes to convex surfaces. Such mappings can be constructed from two successive cylindrical distortions. Let , and * + be three randomly selected dimensions, and assume we intend to curve the and axes with respect to the axis. Without loss of generality, assume that the distortion of the axis is performed first. Let and be the desired curvature angles. To prevent the cluster from intersecting with itself after deformation, it is necessary that both and do not exceed . Additionally, in order to ensure that the distorted cluster will not have concave regions, it is further necessary that . The corresponding curvature radii are: ( ( ) ( ) ( ( ))) for the principal and axes of , respectively. Here, the minimal values dependent on are introduced to make sure that the reference points of the two cylindrical distortions, ), are located outside the convex hull of the cluster. Combining the ( ) and ( two nonlinear mappings yields the following: ( ) (( ) ( ) ( ) (( ) ( ) ( ) (( ( ( ) ) ( ( ( ( ) ( ) ) ( ( ) ( ( ) ) ) ) ( ) * )) + ) ( ) ) ( )) Iulian Ilieș PhD Thesis 32 The deformation function presented above compresses the regions of cluster lying at negative values and large absolute values, and stretches the regions with positive coordinates and values close to zero (see Figure III-2G, H). In order to map the probability distribution of correctly, a proportion of the objects belonging to the compressed regions must be discarded. This can be achieved by setting the probability of object retention proportional to the relative change of the local volume. Given that the two deformations are ) between the final volume element ( ) applied sequentially, the ratio ( and original volume element ( ) can be calculated as the product of the two successive alterations in the areas of and surface elements: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ⁄ ) ( ) ( ) ⁄ ⁄ ] | ⁄ | ⁄( ⁄ For fixed values and ) , the ratio [ ( ) has its maximum at ( ) to . Consequently, setting to zero reduces , ( ) ( ⁄ ) ( ⁄ ). For ⁄ ⁄ a quadratic function in , ( ) is monotonically increasing, and is therefore and , the function ⁄ . Thus, the probability of objects not being discarded can be calculated largest at as: ( ) ( ) ( ) , [ ⁄ ⁄ ⁄ ⁄ ] ( ( ⁄ ) ⁄ ) * ⁄ + The final convex shape of the transformed cluster depends on the order in which the two deformations are performed. The differences are particularly noticeable at large distortion angles and similar curvature radii (see Figure III-2G, H). As an alternative, we developed a method for mapping planes to a spherical surface in a more symmetrical manner. Let ⁄ ) be the curvature radius of the principal and ( ( )⁄ ( ) axes of the cluster . The following function performs both distortions at the same time, mapping -planes in directly onto spheres centered at the point ( ): √ ( ) ( √ ) ( ( ( ) ) ( ) √ ( √ * ) ( ) √ ) + 33 Generation of synthetic datasets Under this transformation, lines parallel to the or axes, as well as circles centered at the origin in the plane, are mapped to circles lying on spheres centered at (– ). Correspondingly, lines parallel to the axis are mapped to radial spokes passing through this reference point. The distortion function compresses the regions of cluster lying at negative values and large absolute and values, and stretches the regions with positive coordinates and and values close to zero (see Figure III-2I). To maintain the original probability distributions in the new coordinate system, it is necessary to eliminate some of the objects, proportional to the observed variations in local volume ) be the volume element near the point ( elements. Let ( ) from the original coordinate system, and ( ) be the volume element near the image point lying at radius ( ( ) change, ( ( ) ) ) √ inclination ( ⁄ ( ⁄ ) in the new spherical coordinate system. ), is: ( ) ( ) ⁄ ⁄ ⁄ ( ) )) | ⁄ ( ( ⁄ ⁄ , ( ) √ ( ) | ⁄ ) √ ( ⁄ ⁄ ⁄ ⁄√ ⁄( ( , and azimuth angle Then the relative volume | ⁄ ) √ | ⁄( ) ) √ ) is a product of two independent terms: a quadratic function of , and In this case, ( a multiple of the function ( )⁄ , where is defined as above. The former function is monotonously increasing over the range of on the axis, while the latter has a global maximum at . The corresponding probability of object retention is: ( ) ( ) ( ) , [ , ⁄ ⁄ ⁄ ( ( ⁄ ⁄ ] ⁄ - ) ⁄ ) √ * + √ ( ) In addition, to ensure that the cluster does not intersect with itself at large distortion angles, )| ⁄ ⁄ all objects outside the cylindroid {( } are discarded. With these adjustments, the down-sampled, transformed cluster correctly exhibits the original probability distributions of in the new spherical coordinate system. The third class of three-dimensional nonlinear transformations considered here consists of functions that map planes to concave surfaces with constant curvatures. Iulian Ilieș PhD Thesis Figure III-2: Effects of different nonlinear transformations on a sample three-dimensional cluster. Wire frames represent the example cluster, while black lines denote its principal axes. Several types of transformation can be applied, including cylindrical, concave, and convex. A: Original, undistorted cluster. B, C: Examples of cylindrical deformations, with one axis curved towards one of the other axes. D: Example of helicoidal deformation, with one axis curved towards one of the other axes, and the third axis sheared with respect to the curved one. E: Example of saddle deformation, with two axes curved successively towards opposite directions of the third axis. F: Example of concave deformation, with one axis curved towards one of the other axes, and the third axis curved towards the previously curved one. G, H: Examples of convex deformations, with two axes curved successively towards the same direction of the third axis. Note the visible effect of performing the distortions of the two axes in reverse order (compare panels H and G). I: Example of spherical deformation, with two axes curved simultaneously towards the same direction of the third axis. 34 35 Generation of synthetic datasets Similarly to the mappings of planes to convex surfaces discussed above, deformations of planes into concave surfaces can also be constructed from two successive cylindrical distortions. Let , and * + be three randomly selected dimensions, and assume we intend to curve the and axes with respect to the axis, but in opposite directions. Let ( - and ( - be the desired angles that the cluster will ⁄ ) subtend after deformation. The corresponding curvature radii are ( ⁄ ( ⁄ ⁄ ) for the principal axis of , where the for the principal axis and ⁄ lower limit is enforced in order to ensure that the distortion map is injective over . Without loss of generality, assume that the distortion of the dimension with respect to ( ) is performed first: ( ) (( ) ( ) ( ) ( ) ) ( ) The second cylindrical deformation sends lines which are parallel to the axis to circles ) in projection: centered at the point ( ( ) ( ( ) ( ) ( ) ( )) * + The composite map is therefore: ( ) ( ( ( ( ( ) ) ( ( ) ) ( ) ) ( )) ) ( ) * ( ) + Following this deformation, the projection of the cluster is a saddle-like shape with thickness equal to (e.g. Figure III-2E, Figure III-8E). However, since different regions of are distorted to different degrees, the original probability distributions are distorted as well. As in the case of convex deformations, it is possible to calculate the pointby-point change of the volume element, ( ), as the product of the two changes in the area of and surface elements induced by the two successive cylindrical transformations: ( ) ( ⁄ ) ( ) ( ) ( ) ( ) does not depend As in the case of mappings to convex surfaces discussed above, ) is linear in on . Furthermore, for fixed values, ( ( ⁄ ) with negative slope. ( ) has its maximum at ⁄ . Given that the angle ⁄ is limited to , -, ) to a quadratic function in . If ⁄ Setting to ⁄ reduces ( , this function ⁄ on the axis. Its maximum is convex with the critical point located to the left of ⁄ ⁄ - is attained at ⁄ . For ⁄ ⁄ value on , , ( ) is actually ⁄ ⁄ -, with the largest linear in with positive slope. Therefore, it is increasing on , ⁄ value registered at ⁄ . If ⁄ , ( ) is quadratic with negative leading ⁄ ⁄ - range depends on the position of term. The position largest value over the , the critical point, which is also the global maximum. In summary, Iulian Ilieș PhD Thesis 36 ⁄ ⁄ ⁄ ⁄ ⁄ ⁄ ⁄ | | ⁄ ⁄ ( ⁄ { The corresponding probabilities of objects being retained after distortion are: ( ) ( ) ( ) , [ ⁄ ⁄ ) ⁄ ⁄ ] ( ( ) ( ⁄ ) ( ) ⁄( * )) + In the nonlinear transformation presented above, both elementary deformations are performed about the same axis. However, that is not compulsory in order to construct concave surfaces. Let , and * + be three randomly selected dimensions, and assume we first deform the axis with respect to the axis. Next, instead of curving the ( axis with respect to the opposite direction, we distort with respect to . Let - be the desired curvature angles. The corresponding curvature radii are: ( and ( ) ( ⁄ ( ) ( ( ))) for the principal and axes of , respectively. Similarly to previous transformations, the enforced lower bounds ensure that the distortion map is injective over the convex hull of . The first distortion maps lines parallel to the axis to circles centered at ( ) in projection, while the second sends lines parallel to the axis to circles centered at ( ) in projection. ( ) (( ( ) ( ) ( ( ) ) ( ( ) ( ) ( ) ) ( ( ) ) * )) + Combining the two nonlinear mappings yields the following: ( ) (( ) ( ( ( ) ) ( ( ( )) ( ) )) ( * ( )) ) + As before, the original probability distributions of can be preserved by down( ): sampling proportional to the observed changes in the size of volume elements, ( ) ( ⁄ ) ( ) ( ) ( ) For fixed values of , | ⁄ | value, ( ) is linear in ( ⁄ ) with positive slope. Given that ⁄ ). With fixed to this , the maximal value is attained at ( ⁄ ( ) becomes a convex quadratic function of , whose critical points lies to the 37 Generation of synthetic datasets ⁄ on the axis. Consequently, it achieves its maximum over the , left of interval at ⁄ . The corresponding object retention probability is: ( ) ( ) ( ) , [ ⁄ ⁄ ⁄ - ⁄ ⁄ ⁄ ] ( ( ⁄ ) ) * + ⁄ ) ⁄( ( ( ( ⁄ ))) The transformed cluster has the appearance of a toroidal segment (Figure III-2F), but of a different type than if it would had been deformed with the previous method (Figure III-2E). ⁄ 2.3. Construction of the dataset The process of synthetic data generation serves the primary purpose of providing testing material for the evaluation of clustering and classification algorithms: sample datasets with variable size, structure, and clustering complexity. Consequently, a fundamental concern is having the ability to control in a reliable manner the clustering difficulty of the simulated data – usually defined as a measure of the overlap between the different clusters. Existent approaches (Section III.1) range from very simplistic solutions such as setting the distances between cluster centers depending on the cluster variances (e.g. Procopiuc et al., 2002), to probabilistic methods that iteratively scale the clusters until the degree of overlap between each pair of clusters falls in a prescribed interval (Maitra & Melnykov, 2010). Here, we rely on a greedy algorithm that tries to minimize the separation between clusters, while preventing their overlap. For each cluster already appended to the dataset, the algorithm stores its minimal and maximal values on each of the d dimensions, * +, as well as the position of its midpoint ( ( )) and ( ( )), within the data space, * +. In addition, ( ( ( )) ( ( )))⁄ for each cluster and oriented dimension ⃗ or ⃗, the algorithm stores the size, location, and contact face of the hyper-rectangular cell of maximal volume which extends from in that direction and does not intersect any of the other already placed clusters (e.g. Figure III-3A–F). For each new cluster , the algorithm examines all currently stored cells , checking whether would fit in the available space. If so, it then calculates the increase in the volume of the data space, ( ( ( ))) ( ( ( ))) * + * + ( ) ∏ ( ( ( ))) ( ( ( ))) * + * * + + and the corresponding deviation from the desired aspect ratio, ( ) ( * * + ( + ( ( ))) * ( + , of the data space, ( ( ))) ) ∏ * + Iulian Ilieș PhD Thesis 38 that would result from placing within this cell, adjacent to the contact face. Consequently, the algorithm inserts the cluster in the cell which minimizes the objective function ( ) ( ) ( ) ( ) , - is proportional with k. Thus, for earlier clusters the emphasis is where the weight on limiting the increase in volume of the data space, while for later clusters the emphasis changes to meeting the targeted aspect ratio of the data space. If is the first cluster, then its midpoint is positioned randomly within the unit cube. After the placement of , any cell that is associated with the previous clusters and that intersects is downsized correspondingly. In addition, a new set of cells, extending from to infinity, is created. The intersections of these new cells with the previously placed clusters are similarly trimmed. The process outlined above is repeated until all clusters are positioned in the data space. The resulting dataset generally exhibits compactly distributed clusters (e.g. Figure III-3G). The spatial variability of the dataset can be increased through the random interspersion of a number of empty spacers of sizes similar as the actual clusters. Thus, some of the spaces reserved for clusters will not be used, and the dataset will have a more heterogeneous appearance (e.g. Figure III-3H). Depending on their shapes, orientations, and probability distributions, some clusters may have common borders, but they will never overlap. To produce datasets of increased clustering complexity, a part or all of the clusters can be expanded such that they exceed their assigned space and thus overlap with their immediate neighbors. Alternatively, the same effect can be achieved by scaling down the information about the ranges of the clusters passed to the cluster placing algorithm. The larger the scaling factors, the more difficult it would be to distinguish between neighboring clusters, and conversely. Furthermore, different clusters can be assigned different scaling factors, both supra- and sub-unitary, thus yielding heterogeneous datasets with large spatial variability in cluster overlap. Following the observations of Steinley and Henson (2005) on the asymptotic relationship between marginal and joint overlap, we use maximal scaling factors proportional to the number of dimensions. Thus, datasets of different dimensionalities that are constructed with the same overlap settings, have similar degrees of full-dimensional cluster overlap. If the dataset contains subspace clusters which are only defined on some of the dimensions, the algorithm described above cannot be used directly. The main impediment is that, because such clusters follow noise distributions that span the entire data space on the non-relevant dimensions, it is not possible to know all their value ranges beforehand. Moreover, subspace clusters can intersect with each other, e.g. if they have no relevant dimension in common. Therefore, the cluster placement algorithm is modified such that the subspace of each new cluster is treated as a separate setting. Instead of storing a global list of unoccupied, hyper-rectangular cells which are updated after each cluster addition, the modified algorithm constructs a different list of cells before placing each new cluster . Previously placed clusters that do not have relevant dimensions in common with are ignored. For all other clusters , and for all relevant dimensions j that each of these clusters shares with , the modified algorithm defines cells extending from in both the positive and the negative directions of j within the subspace of (see Figure III-4A–F). 39 Generation of synthetic datasets Figure III-3: Algorithm for placing full-dimensional clusters in the data space. A – F: The placement of the first six clusters in a sample twodimensional dataset. Clusters are represented by dashed ellipses. New clusters (orange) are placed adjacently to previously existing ones (purple). Continuous lines represent the contact regions of places available for the addition of further clusters. The center of the first cluster placed in the data space is marked with a black cross. G: The final configuration of the sample dataset, after 15 iterations. Note the compact placement of the clusters within the data space. H: The corresponding dataset, superimposed on the configuration illustrated in panel G (light blue). Objects are represented by crosses. Note that several empty clusters were randomly inserted in the list passed to the placing algorithm. The spaces assigned to these clusters remain unoccupied, increasing the spatial variability of the dataset. Also, note that clusters do not overlap unless they are expanded beyond their assigned spaces. Iulian Ilieș PhD Thesis Figure III-4: Algorithm for placing subspace clusters in the data space. A – F: The placement of the first six clusters in a sample two-dimensional dataset. Clusters are represented by dashed ellipses. New clusters (orange) are added adjacently to already existing ones (purple). The center of the first cluster is marked with a black cross. The contact regions of spaces that are available for the placement of new clusters are indicated by continuous lines. Note that new subspace clusters can only be added along their relevant dimensions, because they grow with the dataset in any other direction. G: The final configuration of the sample dataset, after 15 iterations. H: The corresponding dataset, superimposed on the configuration illustrated in panel G (light blue). Objects in full-dimensional clusters are indicated by crosses. Objects in clusters which are defined only on the horizontal axis are represented by circles. Squares denote objects in clusters defined only on the vertical axis. As in the example shown in Figure III-3, spaces assigned to empty clusters remain unoccupied, thus increasing the spatial variability of the dataset. 40 41 Generation of synthetic datasets The intersections of these cells with the projections of the relevant clusters within the considered subspace are subsequently removed. For this step, the minimal and maximal values on the irrelevant dimensions of the selected clusters are set to and , respectively, in order to ensure that will not overlap with any of them within its subspace. Consequently, the algorithm examines each of the cells, measuring its suitability for hosting , and then places the new cluster in the optimal location. If no other previously placed cluster was defined on any of the dimensions relevant for , then its midpoint is positioned randomly within the unit cube. This procedure is repeated for every cluster added to the dataset. In general, the resulting dataset consists of compactly placed clusters (Figure III-4G). Indeed, if all clusters would be full-dimensional, the cluster placement produced by the modified algorithm would be very similar to that produced by the basic, unaltered version, although the process would be slower. The only noticeable differences are the presence of clusters that span the entire data range on some dimensions, and the fact that such subspace clusters can intersect each other if they are defined in different subspaces (e.g. Figure III-4G). The methods described above for adjusting the spatial heterogeneity and clustering difficulty of the dataset can be employed with no further modifications. Interspersing empty cluster spacers can create datasets looking less uniform (e.g. Figure III-4H), while expanding some or all of the clusters results in datasets with various degrees of cluster overlap. 3. Practical implementation The developed methodology is implemented in MATLAB 7 (R2010a; The MathWorks, Inc., Natick, MA), in the form of a collection of “.m” files. In addition to the main function which carries out the complete dataset generating procedure outlined in Figure III-1, the package includes several stand-alone functions which perform the different steps of the algorithm. The basic functions are: creating unimodal clusters or subclusters (see Section III.3.3), anisotropic scaling (Section III.3.3), applying nonlinear transformations (Section III.2.2), applying orthogonal or oblique rotations (Section III.3.3), and placing fulldimensional or subspace clusters in the data space (Section III.2.3). In addition, we have implemented a graphical interface for the dataset generating function (see Figure III-5). The interface facilitates the configuration of a variety of parameters for the construction of clusters and datasets, from common inputs such as the number of clusters, to more advanced, optional settings such as the degree of distortion of nonlinear clusters (Section III.3.1). For numerical parameters, the interface allows for the setting of both minimal and maximal limits, as well as choosing how values are selected within that interval (see Section III.3.2). Related parameters, such as the dataset dimensionality, the proportion of noise dimensions, and the cluster subspace size, are dynamically linked in order to prevent the occurrence of incompatibilities. Basic information about each of the parameters is displayed in tooltips, and more details are available through a companion help tool (Figure III-5). Furthermore, the interface allows for the saving and loading of parameter configurations to and from disk, as well as the batch production of synthetic datasets. Iulian Ilieș PhD Thesis 3.1. 42 Configurable program parameters The set of configurable parameters is distributed within the graphical interface in four panels with different background colors, based on their complexity and importance (Figure III-5). The main panel, which is located in the upper left part of the interface, includes several necessary parameters pertaining to the construction of the clusters and the dataset. This panel allows users to define the numbers of objects, dimensions, and clusters in the dataset, as well as the degrees of cluster overlap and of spatial variability in the placement of clusters within the data space. The degree of cluster overlap controls the scaling of clusters with respect to the spaces assigned by the cluster placing algorithms (Section III.2.3). It is measured on an ordinal scale, and can take values from –50 for well-separated clusters, to 0 for adjacent, but non-overlapping clusters, and up to 50 for highly overlapping ) ⁄ to clusters. The corresponding scaling factors applied to the clusters range from ( ( ) ⁄ , where d is the average cluster dimensionality. The degree of spatial variability influences the number of cluster spacers used by the cluster placing algorithms to produce datasets looking less uniform. It can range from 0 for compactly-placed clusters to 100 for using a number of spacers equal to the number of clusters in the dataset. The main panel also permits the selection of the probability distributions employed in the construction of clusters (see Section III.3.2), as well as in the determination of the relative sizes, scales, and aspect ratios of the clusters (Section III.3.3). In addition, the main panel includes measures of the overall scale and aspect ratio of the dataset. These parameters are sufficient for the generation of highly variable synthetic datasets containing full-dimensional clusters with principal axes parallel to the coordinate axes. The second and third panels contain advanced settings that allow for the presence of more specific types of clusters and of noise. All parameters present in these panels are optional. They are deactivated by default, and appear grayed out in the interface. They can be switched on and off via the checkboxes located in the vicinity of their names in the interface (see Figure III-5). The second panel, located in the upper right part of the interface, includes two groups of parameters. The inputs in the first group allow users to specify the proportions of noise objects and dimensions, as well as to select the probability distribution of noise. The second group of parameters in this panel pertains to cluster rotations. Users can define the proportions of rotated and obliquely rotated clusters, expressed as percentages, as well as the corresponding degrees of rotation and of oblique distortion, measured on ordinal scales from 0 (no rotation/distortion) to 100 (maximal rotation/distortion). The degree of rotation is proportional to the number of elementary two-dimensional rotations applied to each cluster, and to the average angle of these rotations. Correspondingly, the degree of obliqueness is proportional to the number of distorted elementary rotations, and to the average angular deviation from orthogonality of these rotations (see Section III.3.3). 43 Generation of synthetic datasets Figure III-5: Graphical interface for the proposed synthetic dataset generator. The interface allows users to configure a variety of settings for the construction of clusters and datasets. Based on their complexity and relevance, the parameters are grouped in several panels: parameters concerning the structure of the dataset and the sizes and shapes of clusters (upper left corner); optional parameters pertaining to noise and cluster rotations (upper right corner); optional specifications pertaining to particular cluster types (lower left corner); and options related to the output format (lower right corner). Program functions such as saving the current settings for later use, or retrieving help information about the parameters, are also accessible from the interface (lower panels). The third panel, located in the lower left part of the interface, contains three groups of parameters, pertaining to subspace, multimodal, and nonlinear clusters, respectively. For subspace clusters, the interface allows users to specify the proportion and subspace Iulian Ilieș PhD Thesis 44 dimensionality. For multimodal clusters, it is possible to define the number, dimensionality, and type of subclusters within multimodal clusters (Section III.3.3), as well as to specify the proportion of clusters in the dataset which are multimodal. For nonlinear clusters, the panel permits the selection of the types and degrees of nonlinear distortions applied to the clusters (Section III.2.2), in addition to the specification of the proportion of distorted clusters. In the latter case, the degree of distortion is proportional to the curvature of the nonlinear distortion applied to the cluster, and can take values from 0 for no distortion to 100 for distortions to closed shapes, e.g. cylinders, spheres or tori. The fourth and last panel, located in the lower right corner of the interface, contains several basic parameters concerning the format of the generated datasets. Up to three different randomization procedures can be applied to each dataset: the random permutation of objects, dimensions, and clusters. For normalization, the coordinates of all objects can be centered at zero on each dimension, have the lower bound at zero on each dimension, or occupy different ranges with arbitrary offsets from zero on the different dimensions. Furthermore, the data can be discretized with a prescribed step size. Additionally, the last panel allows users to set the number of datasets to be generated, and to define the names of the variables storing the datasets and the corresponding lists specifying the cluster membership of each object. For simplicity, a numerical index is automatically appended to the variable names if more than one dataset is generated. 3.2. Generation of random numbers For the sampling of random data from specific probability distributions, our algorithms rely on the two built-in functions from the core MATLAB package “randfun”, “rand” and “randn”. Starting with MATLAB R2008b, the “rand” function uses the Mersenne Twister algorithm (Matsumoto & Nishimura, 1998) for generating uniformly distributed pseudorandom numbers. Correspondingly, the “randn” function relies on uniformly distributed numbers to produce normally distributed values with the ziggurat method (Marsaglia & Tsang, 2000). The Mersenne Twister generator has a very long period of , and can be initialized from different seeds. To ensure independence between different sessions, each of the functions from our package (see above) changes the seed to a value dependent on the time indicated by the system clock. The method developed here uses the MATLAB “rand” function for the uniform selection of various weights or parameter values within specified limits (e.g. Figure III-6C), as well as for generating uniformly distributed clusters and noise. Similarly, the “randn” function is employed in the generation of normally distributed clusters and noise, as well as in the Gaussian selection of parameter values within a given range. For convenience, the employed normal distribution is restricted to the middle 99% (Figure III-6A). Any values that lie outside this finite support are repeatedly resampled until they fall within the desired range. Two additional, narrower Gaussian distributions (see Figure III-6A) are used to generate weights centered at 1, when defining the relative sizes and scales of clusters, or the relative ranges of clusters along the different attributes (see Section III.3.3). 45 Generation of synthetic datasets Figure III-6: Probability density functions of the distributions employed by the proposed synthetic dataset generator. All distributions are mapped to the [–1, 1] interval. Note the different scales of the vertical axes. A: Truncated Gaussian distributions with different degrees of kurtosis, as measured by the standard deviation (σ). B: Truncated lognormal distributions with various degrees of skewness, as measured by the standard deviation (σ) of the respective underlying Gaussian distributions. C: Uniform distribution. D: Linear distributions with different ratios (ρ) between the maximal and minimal probabilities. E: Truncated Laplace (double exponential) distribution. F: Exponential distributions, truncated to different degrees. The distributions illustrated in A (dotted line), B–E, and F (dotted line) are used in the production of clusters and noise. The distributions depicted in A, C, and with continuous and dashed lines in F are used in the selection of weights and parameter values. Iulian Ilieș PhD Thesis 46 In order to increase the variability within and between the generated datasets, we also use values sampled from several other probability distributions that can be accessed indirectly through the “rand” and “randn” functions. An array of lognormal distributions with various degrees of skewness are available for the construction of clusters and noise. To sample values from these distributions, we simply apply the exponential function to normally distributed numbers. Here, we set the mean of the underlying Gaussian to zero, - interval. Figure III-6B illustrates and we select the standard deviation within the , three such lognormal distributions, having the minimum, average, and maximum degrees of skewness supported by our methods. As in the case of the Gaussian distributions discussed above, we truncate the lognormal distributions to the leftmost 99%, in order to ensure finite support. In addition, a collection of linearly varying distributions with different degrees of skewness (see Figure III-6D), constructed from the uniform distribution, are also available for the creation of both clusters and noise. To sample numbers from linear distributions, we apply a square root transformation to uniformly distributed values. Moreover, we rely on the inverse probability integral transform to generate exponentially and Laplace distributed values from the uniform distribution. Taking the negative logarithm of values sampled from the uniform distribution on , - yields exponentially distributed numbers with scale parameter 1. Further multiplication of these values with a randomly chosen sign, either –1 or +1, produces numbers distributed according to the double-exponential Laplace distribution with mean 0 and scale parameter 1. The exponential and Laplace distributions are truncated to the leftmost 99% (Figure III-6F) and the middle 99% (Figure III-6E), respectively, and then used in the production of clusters, as well as for the generation of noise. In addition, several more severely truncated exponential distributions (e.g. Figure III-6F) are used in the selection of various weights or parameter values within specified limits (see Section III.3.3). 3.3. Construction of different types of clusters Within the proposed synthetic dataset generator, the construction of unimodal clusters is a process comprising of several steps (see Figure III-1). In the first stage, each cluster is constructed by sampling the coordinates of all objects independently on each dimension from one of the six supported probability distributions (Section III.3.2). The distribution used can be the same for all dimensions of a given cluster (e.g. Figure III-7A–F), or it can be randomly selected for each dimension. For subspace clusters, only relevant dimensions are included, while their nonrelevant dimensions are filled with noise at a later stage (see Section III.2.1). The number of objects belonging to each cluster is calculated in advance, and depends on a random weight selected for the cluster from one of the available probability distributions (see Figure III-6A, C, and F). However, for clusters that will undergo a nonlinear distortion, the initial number of objects is increased by a factor proportional to the distortion type, in order to offset the subsequent density correcting down-sampling (Section III.2.2). For convenience, the - interval. Consequently, each generated cluster has sampled values are scaled to the , equal range on all relevant dimensions after the first step. 47 Generation of synthetic datasets In the second stage, each cluster is scaled anisotropically to a randomly defined aspect ratio. The distribution used for sampling the scaling weights of each dimension depends on the specific shape assigned to the cluster from those supported by our program. Spheroidal clusters have ranges distributed normally around 2 with standard deviation 0.1. Ellipsoidal and eccentric clusters also have ranges distributed normally around 2, however with standard deviation 0.2 and 0.4, respectively (Figure III-6A). To create clusters having a few long principal axes and the remaining axes short, the target ranges are sampled from - interval, i.e. the standard exponential distribution, truncated to the , approximately to the median 25% of the distribution (Figure III-6F). With this choice of minimal and maximal scaling weights, clusters with ratios between the longest and the shortest principal axes of up to 20 can be produced. Such elongated clusters are ideal candidates for cylindrical and helicoidal nonlinear distortions outlined in Section III.2.2 (e.g. Figure III-7H). Correspondingly, to generate clusters having the majority of the principal axes longer than the average, the scaling weights are obtained by taking the reciprocals of values sampled from the same truncated exponential distribution. Such flattened clusters are most appropriate for convex and concave deformations (Figure III-7I). Finally, to obtain clusters with more general aspect ratios, the scaling weights are sampled from the uniform -. distribution on , The resulting, transformed clusters can take a variety of shapes (e.g. Figure III-7AF), but they have similar volumes. For increased dataset variability, each cluster is scaled by an additional factor on all dimensions. Similarly to cluster sizes, the values of these scaling factors are sampled from one of the six probability distributions available for weight generation (see Figure III-6A, C, F). If desired, the scaling procedure can be followed by a nonlinear distortion of specified type and angle (e.g. Figure III-7G). In the last, optional stage, clusters undergo rotations of various amplitudes and degrees of obliqueness. To generate random rotations with specific average amplitudes, we use an approach similar to that described by Anderson et al. (1987). We select a random subset of pairs of dimensions, and for each selected pair we uniformly choose a rotation angle and a direction of rotation (clockwise or anticlockwise). The minimal and maximal allowed angles are chosen such that the expected value of the average rotation angle across all possible pairs of dimensions is proportional to the prescribed degree of rotation. Thus, clusters with small degrees of rotation are produced by using few such elementary rotations with small angles, while clusters with large degrees of rotation are created by using a large number of elementary rotations with large angles. To generate obliquely rotated clusters, a proportion of the employed elementary rotations are modified in such a way that the orthogonality of the involved axes is not preserved. The fraction of altered rotations and the magnitudes of the changes are chosen such that the average angular deviation from orthogonality across all pairs of dimensions is proportional to the desired degree of obliqueness. The sign of the angular deviation is chosen such that the correlation between the two directions involved is increased by the distortion process. For any given pair of dimensions and corresponding rotation angle, the largest allowed oblique distortion is obtained when rotating only one of the axes and keeping the other one fixed. Iulian Ilieș PhD Thesis Figure III-7: Examples of clusters produced by the proposed synthetic dataset generator. Objects within the clusters are represented by crosses. Level sets of the corresponding object densities (orange contours) are calculated using smoothed histograms. A: Rotated cluster with eccentric ellipsoidal shape and normally distributed objects. B: Obliquely rotated cluster with ellipsoidal shape and uniformly distributed objects. C: Spheroidal cluster with objects sampled from a Laplace distribution. D: Mirrored ellipsoidal cluster with objects sampled from lognormal distributions. E: Rotated spheroidal cluster with objects sampled from linear distributions. F: Obliquely rotated cluster with eccentric shape and objects distributed exponentially. G: Multimodal cluster consisting of several highly overlapping subclusters with objects distributed normally. H: Elongated cluster with normally distributed objects, nonlinearly distorted to a helix. I: Flattened multimodal cluster transformed through a cylindrical distortion. Similarly to the example shown in G, the subclusters have normally distributed objects, but they overlap to a smaller degree. 48 49 Generation of synthetic datasets To ensure that clusters subjected to rotations with larger angles consistently appear more rotated to human observers than clusters subjected to rotations with smaller angles, we limit the used angles to ⁄ . Any elementary rotation with a larger angle can be decomposed into a rotation with angle equal to a multiple of ⁄ , obtainable via axis ⁄ -. Thus, to reflections and/or permutations, and a rotation with angle within , ⁄ ensure that all possible rotations of a given cluster can be achieved, we combine the product of the elementary rotations with a random non-orientation-preserving permutation of the dimensions. The resulting transformation is applied to the cluster objects, yielding the final, rotated cluster (e.g. Figure III-7A, B, E, F). As mentioned in Section III.2.1, the construction of multimodal clusters is a more complex process, carried out recursively. Each multimodal cluster is built as a sub-dataset consisting of several overlapping sub-clusters (Figure III-1). The procedure employed is the same as for the actual datasets, however with several restrictions. The selection of subcluster shapes is limited to three alternatives: spheroidal subclusters, with similar numbers of objects and scaling factors; ellipsoidal subclusters, with variable relative sizes and scales; and arbitrary subclusters, with uniformly selected relative sizes, scales, and aspect ratios. Most importantly, the overlap between subclusters must be larger than the overlap between clusters, in order to be able to distinguish between multimodal clusters and groups of overlapping unimodal clusters. Additionally, subclusters must exist in the same subspace if they are not full-dimensional (e.g. Figure III-7I), and cannot undergo nonlinear transformations, although the resulting multimodal cluster may be subjected to a curvilinear distortion (e.g. Figure III-7I). 3.4. Computational complexity of the algorithm Let n be the total number of objects, d be the total number of dimensions, and be the total number of clusters. Without loss of generality, assume that the data has no noise objects or dimensions, since they are not involved in more complex procedures such as the rotation of clusters. Also, for simplicity, assume that all clusters are unimodal and fulldimensional. The cumulative computational complexity of the different data generation and rescaling processes, such as creating clusters and noise or normalizing the final dataset, is ( ). The further application of nonlinear distortions has maximal complexity ( ), when all clusters are transformed (Section III.3.3). Generating random rotations is an ( ) process (Anderson et al., 1987), while rotating a cluster consisting of objects is of ). Assuming that all clusters are rotated, the combined computational cost is of order ( ) ( ) ( ). Computing the positions of full-dimensional clusters in order ( ). Increasing the spatial variability of the data space (Section III.2.3) has complexity ( the dataset by the addition of cluster spacers at this step, as outlined in Section III.2.3, modifies this cost by a proportional factor. Overall, the maximal complexity for constructing ( ) ( ) datasets containing unimodal, full-dimensional clusters is ( ) ( ). This formula remains valid also when the dataset contains noise objects and/or dimensions. Correspondingly, the cost of generating a full-dimensional multimodal cluster Iulian Ilieș PhD Thesis 50 ), consisting of objects distributed in subclusters is of at most ( ( ) ( ) ( ) if it were unimodal. Thus, constructing compared to ( ) multimodal clusters instead of unimodal ones leads to an increase in the computational cost by a factor proportional to the number of subclusters. Therefore, the overall complexity for constructing datasets containing combinations of rotated unimodal and multimodal full), with an additional factor proportional to the dimensional clusters remains ( average number of subclusters. If some of the clusters are not full-dimensional, then the modified, slower algorithm for placing subspace clusters in the data space is used (see ), where l is the average Section III.2.3). Its average computational complexity is ( subspace size across all clusters. In the worst case, when most clusters are full-dimensional, ). Correspondingly, the computational cost of this process reaches its maximum of ( the computational complexity of generating a dataset that contains all supported types of ) ( ) ( ). If none of the clusters is rotated, the clusters is ( ). If, in addition to that, all clusters are unimodal computational cost is reduced to ( ). and full-dimensional, the complexity is further reduced to ( 4. Experimental validation The developed generator can produce a wide variety of synthetic datasets. Several low-dimensional and high-dimensional examples are illustrated in Figures III-8 and III-9. Clusters can be constructed using several different probability distributions (e.g. Figure III-7A-F) or combinations thereof (Figure III-8A, F). Moreover, different clusters can have different sizes, aspect ratios, and orientations (Figure III-8B), or can be distorted into various nonlinear shapes (Figure III-8D, E). Clusters can be both unimodal as well as multimodal, the latter consisting of several overlapping subclusters (e.g. Figure III-7G). The spatial distribution of clusters can range from well-separated (Figure III-8A), through adjacent (e.g. Figure III-8B) and moderately overlapping (Figure III-8C) to highly overlapping (Figure III-9D). Some of the clusters can be well-defined only within specific subspaces, usually in the case of high-dimensional datasets (Figure III-9) but not exclusively so (e.g. Figure III-8F). The datasets can also contain a variable proportion of noise objects (Figure III-9A, D) or dimensions (Figure III-9A, B). 51 Generation of synthetic datasets Figure III-8: Examples of low-dimensional datasets created with the proposed synthetic dataset generator. A: Dataset containing 15 wellseparated clusters with various probability distributions and volumes (crosses). The dataset also contains 5% uniformly distributed noise objects (circles). B: Dataset containing 15 non-overlapping clusters of uniformly distributed objects (crosses). The clusters have various aspect ratios and orientations. The dataset also includes 10% Gaussian noise (circles). C: Dataset containing 15 moderately-overlapping clusters with normally distributed objects. The clusters have various densities and orientations. D: Dataset containing unimodal clusters (crosses), multimodal clusters (squares), and nonlinearly distorted clusters (circles). E: Dataset containing three-dimensional nonlinear clusters. Such clusters can be deformed into cylindrical (crosses), convex (circles), or saddle shapes (squares). Black lines indicate the coordinate axes. F: Dataset containing one full-dimensional cluster (crosses), three two-dimensional subspace clusters (circles), and one one-dimensional subspace cluster (squares). All clusters have normal distributions within their subspaces and uniform distributions on their nonrelevant dimensions. Black lines indicate the coordinate axes. Iulian Ilieș PhD Thesis Figure III-9: Examples of high-dimensional datasets produced with the proposed synthetic dataset generator. Image plots of sample datasets, with objects shown on rows and dimensions shown on columns. Each dataset consists of 10000 objects described on 20 dimensions, and contains 10 subspace clusters (separated by black lines). The clusters have normal distributions on all dimensions, both relevant and non-relevant. A: Dataset containing well separated clusters with similar subspace dimensionality (3– 7). The dataset includes 20% noise dimensions and 20% noise objects. B: Dataset containing non-overlapping clusters which share the same subspace. The dataset includes 20% noise dimensions. C: Dataset containing moderately overlapping clusters with similar subspace dimensionality (7– 11). D: Dataset containing overlapping clusters with variable subspace dimensionality (1–11). The dataset includes 20% noise objects. 52 53 4.1. Generation of synthetic datasets Cluster rotations The first validation experiment assessed the effectiveness of the proposed implementation of cluster rotations (see Section III.3.3). We constructed 2070 clusters with different degrees of rotation and obliqueness, and calculated the average absolute correlation between all possible pairs of dimensions for each cluster. Each cluster consisted of 1000 objects defined in 5 to 20 dimensions, sampled from normal distributions. The clusters were scaled to arbitrary aspect ratios (Section III.3.3) and subsequently rotated, either orthogonally or obliquely. The degrees of rotation were selected uniformly between 0, corresponding to no rotation, and 100, corresponding to the successive application of two-dimensional rotations with angles of ⁄ for all possible pairs of dimensions in a given cluster. Furthermore, for each cluster, the degree of oblique distortion was set randomly to one of the three following values: 0 for orthogonal rotations, half of the assigned degree of rotation for moderately oblique rotations, or equal to the degree of rotation for highly oblique rotations. Since rotations, in particular oblique ones, introduce linear dependencies between the coordinate axes, we expected the average correlations to be proportional to the prescribed degrees of rotation and obliqueness. Thus, to quantify the results, we calculated the Pearson correlation coefficient between the prescribed degrees of rotation and the observed average absolute within-cluster correlations for each of the three types of degrees of obliqueness (N = 690 clusters within each category). Additionally, we fitted linear regression models without constant terms to each of the three sets of measurements. The results (Figure III-10A) indicated that the employed cluster rotation method is performing as intended. The average within-cluster correlation increased with the prescribed degree of rotation, reaching maximal values of approximately 0.6 for clusters with large degrees of rotation and obliqueness. This linear relationship was less pronounced for clusters undergoing orthogonal rotations (Pearson’s ), where the rate of increase in within-cluster correlations diminished at larger degrees of rotation. For moderately oblique rotations, the increase in within-cluster correlations resulting from the deviations from orthogonality contributed to the maintenance of a linear association with the degree of rotation (Pearson’s ; linear regression coefficient ). Correspondingly, the dependence between the average within-cluster correlations and the prescribed degrees of rotation reached the largest proportionality factor for clusters undergoing highly oblique rotations (regression coefficient ; Pearson’s ). Overall, the within-cluster correlations were found to increase both with the degree of rotation applied to the cluster, and with the degree of oblique distortion applied to the rotation, as intended. Interestingly though, for low degrees of rotation there was virtually no difference between the average within-cluster correlations of clusters with different degrees of distortion (Figure III-10A). The most likely reason is that, in our implementation, oblique distortion degrees cannot exceed the actual rotation degrees. Thus, for low degrees of rotation, the maximal possible deviations from orthogonality are also small, and hence do not affect the within-cluster correlations to a noticeable degree. Iulian Ilieș PhD Thesis 4.2. 54 Cluster overlap In a second experiment, we tested whether expanding clusters with respect to the spaces assigned to them by the cluster placing algorithm has the intended effect of altering the overlap between clusters in a consistent manner. We generated 2070 datasets consisting of different types of convex clusters scaled to different degrees of overlap, and measured the resulting overlap between the clusters in each case. Here, we calculated the actual overlap as the average proportion of objects from each cluster that were located inside the convex hulls of other clusters. For subspace clusters, only their relevant dimensions were included in this analysis. To compute the convex hulls of clusters, we used the MATLAB function “convhulln”, which implements the Quickhull algorithm of Barber et al. (1996). Because the computational cost of calculating the convex hull of a cluster grows exponentially with the number of dimensions, we restricted our analysis to small datasets consisting of 1000 objects grouped in 5 to 15 clusters with subspace dimensionality at most 5. Clusters were either full-dimensional in a space with 1–5 dimensions, or they inhabited subspaces of dimensionality 1–5 in a 20-dimensional space. In one third of the datasets, the clusters were unimodal with ellipsoidal shapes and similar sizes and scales, and did not undergo rotations or nonlinear transformations (see Section III.3.3). In another third of the datasets, the clusters had more variable sizes, scales, and aspect ratios, and were rotated to medium degrees. In the remaining one third of the datasets, the clusters could be both unimodal and multimodal, and were all rotated obliquely. For each dataset, the number of cluster spacers employed by the cluster placing algorithm for increasing the spatial variability of the dataset (Section III.2.3) was set randomly to either 0, half of the number of clusters, or equal to the number of clusters in the dataset. Furthermore, the degree of overlap of each dataset was selected uniformly between –5, corresponding to clusters slightly contracted with respect to the spaces assigned by the cluster placing algorithm, and 50, corresponding to clusters expanded two to three times over the assigned space on each dimension. Within each dataset, the overlap scaling factors assigned to the different clusters were allowed to vary within of the average value. The results indicated that our approach was effective in creating datasets with increased clustering difficulty. The measured average cluster overlap increased linearly with the prescribed degree of overlap, and the proportionality factor was independent of the proportion of cluster spacers (Figure III-10B), as well as of the type of the clusters present in the dataset (data not shown). The correlation between the intended and achieved degrees of overlap was very strong across the entire sample of 2070 datasets (Pearson’s ), suggesting that our implementation of cluster overlap is relatively accurate. The linearity of this relationship was further confirmed by a linear regression analysis. The resulting regression coefficient had a value of 0.014, while the intercept equaled –0.05 ( ). Correspondingly, the observed average proportion of objects from each cluster lying within the convex hulls of the other clusters in the dataset increased by approximately 7% for every increase of 5 points in the prescribed degree of overlap, up to a maximum of 65% at the largest degree of overlap implemented (Figure III-10B). 55 Generation of synthetic datasets Figure III-10: Evaluation of the effectiveness of the implementation of cluster rotations and cluster overlap. A: Average within-cluster correlations, in absolute values, for clusters with various degrees of rotation and oblique distortion. Clusters undergoing orthogonal rotations (purple) exhibit moderate correlation values. Clusters which undergo moderate oblique distortions, with angular deviations from orthogonality equal to half the rotation angles (green), have higher correlations. Rotated clusters with large degrees of obliqueness, having angular deviations from orthogonality equal to their rotation angles (orange), show the highest correlations. Note that for low degrees of rotation there is virtually no difference in the values of the correlations obtained for the three cluster categories shown. Each point of the graphs represents an average over 15 clusters with arbitrary aspect ratios, each containing 1000 objects in 5–25 dimensions. Error bars represent standard errors of the means. B: Observed cluster overlap as a function of the prescribed degree of overlap. Datasets with increased degrees of cluster overlap were obtained by expanding all clusters with a factor proportional to the desired degree. The actual overlap was calculated as the average proportion of objects from each cluster that are located within the convex hulls of other clusters. Note that the effectiveness of the implementation is maintained regardless of the proportion of cluster spacers employed to increase the spatial variability of the datasets. Each point on the graphs represents an average over 6–31 datasets, each consisting of 1000 objects distributed in 5–15 clusters of various types and dimensionalities. Error bars denote standard errors of the means. 4.3. Timing performance To verify the theoretical dependencies of the time requirements on the input parameters outlined in Section III.3.4, we performed three sets of tests. For the first set of tests, we constructed 600 different datasets with both the number of clusters and the number of dimensions set to 20, and the number of objects selected uniformly between 1000 and one million. For the second set of tests, 600 datasets with 10000 objects, 20 clusters, and between 1 and 1000 dimensions were generated. For the third set of tests, we Iulian Ilieș PhD Thesis 56 constructed 600 datasets consisting of 10000 objects in 20 dimensions, distributed into 1– 100 clusters. In each of the three sets of tests, the 600 datasets were further partitioned into six subsets of 100 datasets. In each of the subsets, at most one of the following advanced options (see Section III.3.1) was enabled: cluster rotations, subspace clusters, multimodal clusters, nonlinear clusters, addition of cluster spacers, noise. For each dataset in each test session, we measured the total time needed for its production. Subsequently, for each subset of 100 datasets, we assessed the degree of dependence between the required processing time and the altered parameter, by fitting power functions of the form ( ) with real-valued coefficients. To confirm the inferred degree of dependence, we subsequently fitted the measurements with polynomial functions of degrees equal to the integer nearest to the coefficient . All tests were run on a computer with 2.4 GHz dual-core Intel processor and 4 GB of RAM. The time requirements increased linearly with the number of objects, independently on the types of clusters included in the dataset (Figure III-11A, B). The exponents of the best fitting power functions were very close to 1, ranging from 0.99 for datasets with unimodal clusters to 1.03 for datasets with multimodal clusters with 10 modes each (all ). The observed linear dependence of the time needed to generate the datasets on the number of objects was in general of very small amplitude. For simple unimodal clusters defined in 20 dimensions, an increase in the number of objects from 1000 to one million corresponded to an increase in the required time of approximately 4 seconds (regression coefficient intercept , ). As expected, using only a proportion of the objects in the construction of clusters and treating the remainder as noise had limited impact ( ). However, if all clusters were nonlinearly distorted, the slope increased five-fold to , while the intercept remained virtually unchanged (simple linear regression, ). The observed increase in slope was approximately equal to the implemented increase in the number of objects assigned initially to each cluster in order to offset the down-sampling associated with nonlinear transformations (see Section III.2.2). Therefore, the computational cost of the actual application of nonlinear distortions is dominated by the cost of sampling more objects. By contrast, if all clusters were multimodal or if they were treated as subspace clusters, there was no variation in the slope, but the constant terms increased from 0.9 for unimodal clusters to 6.7 (simple linear regression, ) and 4.5 (simple linear regression, ), respectively. For multimodal clusters, the increase was caused by the additional steps of combining the subclusters within each cluster, which depend only on the number and dimensionality of the subclusters, and not on the number of objects. Correspondingly, for subspace clusters the additional processing was an effect of using the slower cluster placing algorithm (see Section III.2.3), which is also independent on the number of objects in the dataset. Interestingly, for rotated clusters there was also no change in the slope due to the additional calculations needed to apply the rotations to each object. It is likely that these calculation steps are dominated by the main process of creating clusters through sampling objects from different probability distributions, as observed for nonlinear clusters. The constant term did however increase to 1.9 (simple linear regression, ), most likely due to the additional cost of generating rotation matrices for each of the clusters in the dataset. 57 Generation of synthetic datasets Figure III-11: Experimental evaluation of time requirements for the proposed generator of synthetic datasets. A, B: Time requirements increase linearly as a function of the number of objects included in the dataset. The observed linear relationship is independent of the type of clusters present in the dataset and of the presence of noise objects (squares in B). Each point represents a dataset comprising 20 clusters in 20 dimensions. Note the different scales on the time axis. C, D: The time requirements increased quadratically with the number of dimensions when the cluster dimensionality was equal to or proportional with the total dimensionality, and linearly (circles in D) or super-linearly (squares in D) when cluster dimensionality was fixed. Each point represents a dataset consisting of 10000 objects grouped in 20 clusters. Note the different scales. E, F: The time requirements depended cubically on the number of clusters if they were subspace clusters (squares in E), and quadratically if all clusters were full-dimensional. Using cluster spacers to have a less uniform distribution of clusters within the dataset (circles in E), or including Iulian Ilieș PhD Thesis 58 multimodal clusters (circles and squares in F) increased the time requirements proportionally, but did not affect the degree of the relationship. Each point represents a dataset consisting of 1000 objects defined on 20 dimensions. The relationship between the required time and the dimensionality of the dataset depended on the average cluster subspace size (Figure III-11C, D). If the dimensionality of the clusters was equal to the total number of dimensions, the dependence was quadratic (Figure III-11C), with the best fitting power functions having exponents ranging between 1.89 and 1.93 (all ). To assess the exact dependencies, we fitted the measurements with linear regression models with three predictors: a constant term, the number of dimensions, and the square of the number of dimensions. For unimodal fulldimensional clusters, the leading coefficient had a value of (linear regression, ), corresponding to construction times ranging from 1s for three-dimensional datasets to approximately 15s for 100-dimensional datasets comprising of 10000 objects and 20 clusters. If all clusters were considered to be subspace clusters, and therefore a more complex cluster placing algorithm was used (see Section III.2.3), the leading coefficient increased five-fold to (linear regression, ). Similarly, if all clusters underwent rotations, the leading coefficient increased to (linear regression, ). This increase shows that, for datasets with sufficiently large dimensionality, the computational cost of rotating clusters is no longer dominated by the cost of generating the clusters (see above). However, the expected third-order dependency of time on the number of dimensions arising from the process of generating random rotations (see Section III.3.4) was not detected. Most likely, as long as the clusters have much more objects than dimensions, the time needed to calculate the rotation matrices is much smaller than that needed to apply the rotations to the clusters. We further examined the dependence of the dataset construction time on the number of dimensions for different configurations of subspace clusters (Figure III-11D). If a constant proportion of dimensions were designated as noise, the dependence remained quadratic, with the best fitting power function having an exponent of 1.57 ( ). Compared to the set of full-dimensional clusters, the leading coefficient of the corresponding optimal polynomial fit decreased markedly, down to (linear regression, ) when 95% of the dimensions were noise. Further limiting the number of relevant dimensions to a maximal value changed the dependence to linear. If the number of non-noise dimensions was set to 20, the exponent of the best fitting power function decreased to 1.28 ( ). Accordingly, the measurements could be approximated by a linear fit with slope (simple linear regression, ), a variation resulting from the increased cost of creating objects with higher dimensionality. Interestingly, if the cluster dimensionality was set to 20 but the clusters were allowed to inhabit independent subspaces, the dependence became non-monotonic (see Figure III-11D). The time requirements increased with the dataset dimensionality, reached a maximum at approximately 100 dimensions, and then started to decrease asymptotically. The best fit for the recorded measurements was a rational function with cubic numerator and quadratic denominator ( ). This dependence can be decomposed in a linear 59 Generation of synthetic datasets term, corresponding to the process of generating the objects in the dataset, and an inverse quadratic term, corresponding to the cluster placing procedure (Section 2.3). In our implementation, the placement of each subspace cluster requires a number of computations proportional to the square of the number of clusters that share dimensions with it. The latter number of clusters is in turn inversely proportional to the total number of dimensions if the subspace size is fixed. In general, time requirements increased quadratically with the number of clusters (Figure III-11E, F). The exponents of the best fitting power functions ranged from 1.44 for datasets with multimodal clusters to 1.83 for datasets with unimodal clusters placed nonuniformly in the dataset (all ). In the most basic parameter configuration examined, with simple unimodal clusters distributed compactly in a 20-dimensional space, the time needed to construct a dataset with 10000 objects increased quadratically from 0.1s if the objects were grouped in 2 clusters, to 1s for 20 clusters and 15s for 100 clusters. The corresponding coefficient of the quadratic term calculated via linear regression was of ( ). When cluster spacers were used to distribute the clusters less uniformly in the data space, the time requirements increased proportionally. If the number of spacers equaled the number of clusters, and therefore the cluster placing algorithm had to process twice as many potential clusters, the leading coefficient of the polynomial fit increased four-fold to (linear regression, ). By contrast, the coefficient of the quadratic term did not change significantly if all clusters were multimodal. When each cluster consisted of 5 subclusters, the leading coefficient remained at (linear regression, ), while when each cluster consisted of 10 subclusters the coefficient increased slightly to at (linear regression, ). However, the coefficients of the linear term changed more strongly, increasing from 0.03 for unimodal clusters to 0.13 and 0.34 for multimodal clusters with 5 and 10 subclusters, respectively. The extents of these increases were roughly proportional to the square of the average number of subclusters per cluster, and corresponded to the additional steps of placing the subclusters within each cluster using the cluster placing algorithm (Section III.2.3). If all clusters were rotated to large degrees, the results were very similar to those observed when the clusters were multimodal. For rotated clusters, the coefficient of the linear term increased to 0.08, while the coefficient of the quadratic term did not change as compared to the value recorded for simple unimodal clusters (linear regression, ). Most probably, this increase corresponds to the additional steps of constructing rotation matrices for each cluster. The only exception from the above pattern of quadratic dependence functions was the set of measurements performed on datasets containing subspace clusters, where the exponent of the power function that fitted best the observations had a value of 2.63 ( ). Consequently, the measurements in this set could be optimally described by a polynomial function of the third degree (linear regression, ), as expected given that a different, slower algorithm was used for the placement of clusters in the data space. IV. Applications of cluster analysis in automatic image classification 1. Background The World Wide Web is home to a constantly increasing collection of images. For fast access and efficient usage of this wealth of visual data, new indexing and retrieval methods are needed, particularly since most of the existing approaches do not exploit all available information. The prevalent approach in current web search engines is to associate images with text, e.g. file names and captions, other metadata such as manually-added tags, or keywords extracted from the surrounding articles. This method restricts the set of searchable images to those associated with text, and can lead to errors if the existent associations are incomplete or incorrect. To enable more efficient queries based on visual information, alternative procedures relying on image processing have been proposed. In semantic search (e.g. Schober et al., 2005), images are annotated automatically by object or scene recognition algorithms. While this strategy is certainly appealing, progress within the field of image understanding remains limited, with algorithms being successful only in restricted settings such as face recognition (e.g. Bolme et al., 2003). Another candidate is the pure visual search, which relies on low-level image features such as color or textures (e.g. Hermes et al., 2005). However, it is of limited usefulness in textual queries, unless the extracted features represent real concepts. Our goal is to combine the advantages of these different methods into an integrated framework for automatic image search and retrieval that would support both visual queries – e.g. finding images similar to, or keywords relevant for, a given image –, as well as standard text-based queries – e.g. finding images associated with the input keyword, either directly or through similar images. Consequently, the focal point of the work presented here is the establishment of a dual-layered linkage system between different images, based on common keywords and on similar visual features, or, equivalently, the definition of a cooccurrence-based association scheme between keywords and visual features. Such association rules between textual and visual concepts can indeed be exploited in queries (Jacobs et al., 2007), but, more importantly, they can be used as an annotation system. Thus, images whose textual information does not match the represented objects or visual features can have their associations corrected. More importantly, images with no captions or tags can be automatically labeled, and therefore made available for text-based queries. If a large, heterogeneous set of images with sufficiently accurate textual descriptions is available for training, such a linkage scheme can be easily constructed (Figure IV-1). In the first stage, relevant keywords are extracted from the associated text of each training image by using natural language processing tools. This can be done with specialized concept 61 Automatic image classification detectors such as named entity recognizers (Drozdzynski et al., 2004), or by using term relevance measures such as TF-IDF (Salton & Buckley, 1988). In the latter case, it may be necessary, however, to use dimensionality reduction techniques such as singular value decomposition, which can properly correct for synonymy or multi-word expressions (Deerwester et al., 1990). In the second stage, visual equivalents of keywords are extracted from the training images using image processing algorithms. This can be achieved using specialized detectors, such as face recognition algorithms (e.g. Wiskott et al., 1997), as shown in our previous work (Jacobs et al., 2008). Alternatively, visual words can be built through vector quantization of low-level features found by generic detectors (Lowe, 1999), as described in the visual vocabulary approach of Sivic and Zisserman (2003). In the last step (see Figure IV-1), the original associations between images and their captions or labels are translated into links between visual and textual concepts. Figure IV-1: Construction of a linkage scheme between textual and visual concepts. Keywords are extracted from captions (step 1) and visual features are extracted from images (step 2) by using concept detectors, and clustered into groups of related words or similar features if necessary. The associations existent in the training data are translated into a set of relations between textual and visual concepts (step 3) which can be subsequently employed as an automatic image classifier. Purple cells represent data, while yellow and blue cells represent algorithms. 2. Methodology and data 2.1. Image classification system We propose an automatic system for classifying novel images into text-derived categories, even when there is no co-occurring text available for extrapolation (see Figure IV-2). The classifier relies on a set of associations between textual concepts and prototypical visual features inferred from a sufficiently large set of representative images and their related text, e.g. captions. In the first stage, the most relevant concepts and features are Iulian Ilieș PhD Thesis 62 extracted from the training data (Section IV.2.2). The image captions are analyzed with natural language processing tools focused on the detection of keywords or concepts; for the work reported here, we use the named entity recognizer of Drozdzynski et al. (2004). Similarly, the training images are examined using image understanding techniques that search for salient visual features; here, we use the scale-invariant feature transform of Lowe (1999). The extracted features are subsequently grouped into similarity-based clusters (Section IV.2.3); each such cluster represents a prototypical visual feature, or visual word (Sivic & Zisserman, 2003). In the next step, the linkage degrees between the resulting textual concepts and clusters of features are calculated. These represent the central component of the proposed classifier. The natural, co-occurrence-based associations between captions and textual concepts are transferred directly to the related images, and further to the visual features found in these images. The linkage degrees between each cluster and the different textual concepts are then obtained by aggregating the association values of the basic visual features belonging to that cluster (Section IV.2.4). Once the cluster-concept associations are calculated, this process can also be executed in the opposite direction: by directly transferring the linkage degrees with textual concepts from the clusters of features to any visual feature assigned to that cluster, and then by averaging across all features belonging to an image to find out which concepts are associated to that image (Section IV.2.5). Consequently, the constructed classifying system is able to annotate new images with text-derived concepts using only their visual features (Figure IV-2, right). Figure IV-2: Constructing and evaluating the proposed image classifier. Visual words are obtained by clustering the visual features extracted from training images (yellow cells). Keywords are extracted from the corresponding training captions, and then sequentially associated with images, their visual features, and visual words (blue cells). Conversely, the associations can be transferred from visual words to new visual features, and then to the test images from which these features were extracted (purple cells). 63 2.2. Automatic image classification Data collection and preprocessing The employed data set consisted of 1054 images and associated text captions published on the “www.faz.de” and “www.tagesschau.de” German news websites between 2006 and 2007. Such sites feature strongly structured articles that have well defined relations between images and co-occurring text, and can therefore be parsed automatically. The images were harvested and preprocessed as described by Jacobs et al. (2007). The SIFT algorithm (Lowe, 1999) was used to identify and quantify interest points in the images: local, low-level visual features such as edges, corners, and other regions of variable contrast. Each such interest point was described by a 128-dimensional vector representing the local image gradient distribution at eight uniformly-spaced angles, starting from the direction with the largest gradient. For each orientation, the image gradients were measured at the nodes of a four-by-four regular grid using pixel differences, after normalization for color and scale. In total, 174370 interest point descriptors were extracted from the available image data set, with a median of 110 (range = 27–1336) descriptors per image. In addition, a named entity detector was used on the image captions (Drozdzynski et al., 2004). More than 50 distinct person names were identified in the considered set of images, including personalities from politics (e.g. George W. Bush, Angela Merkel, Tony Blair) and sports (e.g. Jan Ullrich, Patrik Sinkewitz). Every image was associated with at least one and at most four names, with 9% of the images having more than one person mentioned in the caption. The number of images associated with each person ranged from 1 to 98 (median = 18). For simplicity, the least frequent persons were merged into one category. This procedure resulted in 51 different concepts: 50 named persons, and “other”. The resulting imageconcept associations (Figure IV-3A) were verified manually, and found to be fairly accurate: compared to the ground truth (Figure IV-3B; data from Jacobs et al., 2008), the proportion of correct associations, i.e. the precision, was of 81%, while the proportion of actual associations recovered, i.e. the recall, was of 87%. Figure IV-3: Associations between images and concepts within the employed data set. A: Concepts detected in the image captions. B: Manual image annotations. Iulian Ilieș PhD Thesis 2.3. 64 Visual vocabulary construction The visual vocabulary is an image representation system proposed by Sivic and Zisserman (2003), focused on structuring images in such a way that they can be analyzed with established information retrieval algorithms, such as bag-of-words classifiers (e.g. Joachims, 1998). The system allows for the coding of images or portions of images as word frequency vectors, similarly to text documents. To construct the vocabulary, the interest point descriptors extracted from the set of images available for training are subjected to cluster analysis. Each resulting cluster is considered to be a prototypical visual feature, or visual word, and is represented by its average descriptor. The optimal vocabulary size is however strongly dependent on the problem addressed, and might be considered a trainable parameter if finding clusters in high-dimensional data would not require significant processing power (Sivic & Zisserman, 2003). To find the optimal representation for our setting, we varied the number of clusters in several steps between 100 and 2500, and tested four different clustering algorithms: the commonly used k-means (Hartigan & Wong, 1979); k-medians (Mulvey & Crowder, 1979), a similar method that employs Manhattan distances and is therefore more suitable for high-dimensional data such as the 128-dimensional descriptors produced by the SIFT algorithm; TwoStep cluster analysis, a fast algorithm from the SPSS software package based on the method of Zhang et al. (1997); and the fast projection-based partitioning algorithm described in Chapter II. 2.4. Forward concept propagation Taking advantage of the additional information provided by the image captions, we extended the visual vocabulary framework by calculating linkage degrees between visual words and the concepts extracted by the named entity recognizer (Figure IV-2, left). Firstly, the natural associations between each concept and the captions it was found in were transferred directly to the corresponding training images, and further to all visual features found in those images. Secondly, we calculated the average relative frequencies of the different concepts within each cluster of visual features, as well as across the entire training set. Lastly, the association probabilities between each feature cluster and each textual concept were calculated as a function of the corresponding local and global concept frequencies. For this purpose, we defined and implemented four different calculation methods (see Figure IV-4). In general, pairs of prototypical visual features and textual concepts with local frequencies similar to the global ones were assigned association probabilities near 0.5, pairs with local frequencies less than one half of the global ones received linkage degrees close to 0, while pairs with local frequencies more than twice as large as the global ones had linkage scores close to 1. The first method (Figure IV-4A) was a heuristic principle that was observed to perform well during preliminary testing (Ilies et al., 2009). For each cluster-concept pair, it truncated the ratio between the local and global concept frequencies to the [0.5, 1.3] interval, then simply mapped the result linearly to the interval [0, 1]. The second method relied on a logistic sigmoid transform (Figure IV-4B): it applied the function ( 65 Automatic image classification ( ( ))) to the ratio between the local and global frequencies of each examined pair, where the steepness coefficient was a trainable parameter. The third method (Figure IV-4C) employed a rational sigmoid-like transformation: the function ( ) applied to the considered frequency ratios, with the trainable exponent controlling the steepness. The fourth method (Figure IV-4D) was a statistical approach based on Pearson’s chi-square test, the only criterion to take into account the sizes of the feature clusters. It first calculated the significance level of obtaining the local concept frequency, assuming that the expected rate was the global frequency of that concept. The corresponding linkage degree was then set to if the local frequency was smaller than the global one, and to otherwise. Figure IV-4: Proposed methods for calculating the association degrees between clusters of visual features and textual concepts. A: Linear, heuristic-based criterion. B: Sigmoid transform. C: Rational transform. D: Statistical criterion using the chi-square test. All methods take into account the relative frequencies of the concepts over the entire set of visual features (horizontal axis), as well as within each cluster (vertical axis). Note that the scale on both axes is logarithmic. Iulian Ilieș PhD Thesis 2.5. 66 Reverse concept propagation The procedure outlined above for transferring associations with textual concepts from images to visual words could be easily adapted for execution in the opposite direction, therefore providing the desired classifier functionality. For every image from the test set, the interest point descriptors extracted by the visual feature detecting algorithm were mapped to the feature cluster with the nearest average descriptor. The metric employed depended on the clustering algorithm used for the construction of the visual vocabulary: Euclidean distance for k-means, Manhattan distance for k-medians, and log-likelihood for TwoStep clustering (Section IV.2.3). For cluster systems constructed with the partitioning method described in Chapter II, the test descriptors were assigned to their clusters using the decision trees provided by the algorithm. Each test visual feature inherited the linkage degrees from the most similar prototypical feature, i.e. from the cluster it was assigned to. These values were subsequently averaged across all visual features belonging to each test image, yielding association probabilities between that image and the different concepts. To obtain crisp associations, the resulting averages were contrasted to a minimal threshold. 2.6. Algorithm validation To test the classifier outlined above, as well as its underlying concept propagation techniques, we employed a hybrid cross-validation design. The available data set was partitioned into six random subsets of similar sizes (range = 163–197). Images were distributed to the subsets in a stratified manner, based on their ground-truth associations with the 51 concepts (Figure IV-5). The first five subsets were employed in a standard fivefold cross-validation scheme, while the sixth was reserved for testing. Thus, for each of the examined classifier variants, e.g. using different visual vocabularies (Section IV.2.3) or different concept propagation strategies (Section IV.2.4), five separate test runs were performed. In each of the runs, four of the data subsets, or approximately two thirds of the images and their associated text, were used for training, while the remaining one third of images was used for testing. The number of trainable parameters depended on the forward concept propagation method employed by the examined classifier. If using the heuristic or the chi-square criterions, the only trainable parameter was the minimal probability threshold applied to the associations between text images and textual concepts (Section IV.2.5), with possible values between 0.5 and 1. If relying on the sigmoid or rational transforms, then the steepness coefficient of the respective function (Section IV.2.4) was also trainable; its -. For each run, the values of the parameters were chosen such support interval was , that the F1 score, the harmonic average of precision and recall, was maximized on the training images, either for each concept separately, or for all concepts simultaneously. Since the F1 score is a discrete-valued function, a fact especially noticeable on the relatively small sets of images used in this study, we employed an optimization procedure that can handle discontinuities – the simplex search method of Lagarias et al. (1998). To ensure that the algorithm always reached the global optimum point, it was initialized from the best position 67 Automatic image classification found on an equidistant grid with 25 points per dimension, and covering the parameter ranges. The performance of the resulting classifier was subsequently measured by calculating the overall F1 score on the corresponding set of test images. Figure IV-5: Concept frequencies within the cross-validation subsets. The available images were distributed into six stratified subsets of similar sizes, based on their ground-truth concept associations. A–F: The number of images (vertical axis) from the corresponding subset which were associated to each of the 51 concepts (horizontal axis). Subsets A–E (purple) were employed in a five-fold cross-validation scheme, while subset F (green) was used only for testing. 3. Experimental results 3.1. Associations between visual words and concepts Several preliminary tests were conducted in order to assess whether the visual vocabulary approach is applicable to our data. For the first test, we considered the simplified task of discriminating between two categories. To maximize the chances of observing significant effects, we selected two persons that would differ in both gender and clothing: the politician Angela Merkel, and the cyclist Patrik Sinkewitz. Consequently, we restricted our analysis to the interest-point descriptors extracted from images associated with either of the persons. We partitioned the reduced set of descriptors into 100 clusters with the k-means algorithm. Within each resulting cluster, we calculated the proportions of descriptors associated with either person. We then compared the obtained within-cluster person frequencies to the overall frequencies using the chi-square test. The differences were significant for 6 (16) of the clusters ( ; with and without Bonferroni Iulian Ilieș PhD Thesis 68 correction for multiple comparisons, respectively), thus showing that it could be possible to distinguish between persons by using only the visual information stored in interest-point descriptors. In order to verify that the above observations hold in a more general context, we repeated the test using an extended dataset, consisting of all descriptors found in images associated with the most frequent 13 persons. As above, we partitioned the selected set of descriptors into 100 clusters using the k-means algorithm, and measured the within-cluster and global frequencies of each of the 13 persons. The observed person frequency distributions were significantly different from the global ones for 86 (96) out of the 100 clusters ( ; with and without Bonferroni correction for multiple comparisons, respectively). Essentially, almost every cluster exhibited a bias towards one or several of the 13 persons considered for this preliminary test. 3.2. Differences between classification procedures The first experiment was designed to assess the effectiveness of the different classification procedures that were developed. For each of the five cross-validation sets, we clustered the visual features found in the training images into a simple vocabulary of 100 visual words using the k-means algorithm. For classifier construction, we employed both ground-truth and caption-based image-concept associations, in combination with each of the four forward concept propagation methods outlined in Section IV.2.4. In each case, the parameter values were optimized locally for each individual concept, globally for all concepts simultaneously, or were fixed to specific values based on the results of preliminary tests. For each cross-validation set and each classification procedure, we computed the F1 scores of the concept associations of the training and the test sets of images with respect to the ground-truth. The obtained values were compared through a repeated-measures ANOVA, with three between-subjects factors – type of training associations, forward concept propagation method, and parameter optimization strategy –, and one withinsubjects factor – training vs. test images. Interestingly, there was only a modest difference in performance between classifiers using the different concept propagation methods ( ), indicating that the exact formula of the probability calculation function was actually less important than simply having the correct monotonicity. Consequently, the results of the four concept propagation methods were aggregated (see Figure IV-6). Classifiers constructed from ground-truth associations were better than those relying on the image captions ( ). This result was expected, considering that the captions were only approximately 84% correct (Section IV.2.2). The difference between the two categories of classifiers was however less prominent when measuring the accuracy on test images (F1 scores of 30% and 28%, respectively) than when measuring the fitting of the training data (F1 scores of 49% and 44%, respectively; Wilks’ ). Classifiers that optimized parameters for each concept separately were better than those that optimized the parameters globally, which were in turn better than those with fixed parameters ( , with Scheffé correction for multiple comparisons). This effect was larger on the training images 69 Automatic image classification (F1 scores of 53%, 44%, and 42%, respectively) than on the test ones (F1 scores of 33%, 27%, and 26%, respectively; Wilks’ ). Overall, the classifiers that performed best were those using ground-truth associations for training, and having their parameters optimized for each concept separately. They achieved an average F1 score of 56% on the training images, and were able to predict the associations of test images with 34% accuracy. Figure IV-6: Performance of different classification procedures, with a fixed visual vocabulary of 100 feature clusters. Classifiers constructed from ground-truth associations were better than those constructed from caption-based associations, both on training (green) and on test images (yellow). Moreover, optimizing training parameters for each concept separately yielded better results than doing so globally or setting the parameters to fixed values. Values indicate average F1 scores (N = 20 measurements per data point). Error bars represent standard deviations. 3.3. Differences between visual vocabularies In the second experiment, we investigated the effects of the size and structure of the visual vocabulary (see Section IV.2.3) on the efficiency of the proposed image classification system. We fixed the training data to the more realistic caption-based associations, and the parameter optimization strategy to the method that performed best in the previous experiment, i.e. choosing the optimal values for each concept separately. We then varied the clustering algorithm and the number of clusters. In the first stage, we complemented the 100-cluster vocabularies used in the first experiment with cluster systems of the same size produced by the k-medians and TwoStep algorithms. While the resulting classifiers performed rather similarly (see Figure IV-7A), kmeans and k-medians were markedly slower than TwoStep: the average processing times per cross-validation set were of approximately 6 hours, 10 hours, and 10 minutes, respectively, on a computer with 2 GHz dual-core Intel processor and 2 GB of RAM (see Figure IV-7B). Therefore, we used only the TwoStep method in the construction of larger Iulian Ilieș PhD Thesis 70 vocabularies, having 500, 1000, and approximately 1900 clusters. For each of these six clustering systems and for each cross-validation set, we constructed four different classifiers, using the four different concept propagation methods outlined in Section IV.2.4. We subsequently calculated the F1 scores of each classifier on both training and test images. The resulting values were analyzed using a repeated-measures ANOVA test with two between-subjects factors – clustering system, and forward concept propagation method –, and one within-subject factor – training vs. test images. The results (Figure IV-7A; values aggregated over cross-validation sets and concept propagation methods) showed that the performance of the classifiers increased significantly with the number of clusters, while the employed algorithm had only a minimal effect ( , Scheffé post-hoc comparisons). The average F1 scores improved from 50-51% on training images and 27-32% on test images for the 100cluster vocabularies, up to 83% and 65%, respectively, for the 1900-cluster vocabulary. This trend was paralleled by the processing times needed for the construction of the TwoStep clustering systems (Figure IV-7B), which increased from 10 minutes per dataset at 100 clusters to 2 hours at 1900 clusters. These values remained below the levels registered for k-means and k-medians in the initial 100-cluster setting. However, the time requirements of assigning the test visual features to their nearest clusters had similar values (Figure IV-7B), suggesting that using a complex, log-likelihood based distance measure may not be particularly efficient in a real application. Nevertheless, the TwoStep algorithm proved to be able to prevent over-fitting the training data. In the 1900-cluster setup, the algorithm was not able to generate the desired number of 2000 clusters for all cross-validation sets, due to the insufficient variance of the corresponding sets of visual feature descriptors. The number of clusters produced ranged between 1850 and 2000, with an average of 1920. With these visual vocabularies, the training, caption-based associations were reproducible with an accuracy of 97%. The large number of clusters (1900) needed for the optimal representation of the 51 classes is most likely an effect of the complexity of the task. Since each person is defined by a combination of visual features, the corresponding clusters are probably highly non-linear. Such clusters are subdivided by the clustering algorithms used here, with the number of fragments proportional to the dimensionality. 3.4. Differences between classifiers with an optimal visual vocabulary In the third experiment, we reevaluated all the proposed classification procedures, using the optimal visual vocabulary size found in the second experiment (Section IV.3.3). For each of the five cross-validation sets, we used the maximal vocabularies of between 1850 and 2000 visual words, which were previously constructed using the TwoStep algorithm. As in the first experiment (Section IV.3.2), we examined every possible combination of training association type, concept propagation method, and parameter optimization strategy. We then computed the F1 scores for training and test images for each such classifier and each cross-validation set, and analyzed the results using a repeatedmeasures ANOVA test. Similarly to the first experiment, the differences between the results of classifiers using the various concept propagation procedures were very small (within 71 Automatic image classification 2%), even if significant ( ). Therefore, the reported values were averaged over the cross-validation sets and over the concept propagation procedures. Figure IV-7: Assessment of different visual vocabularies. All classifiers were constructed from caption-based associations, and used parameters optimized for each concept separately. A: The performance, as measured by the F1 scores on training (green) and test images (yellow), increased with the number of clusters, but was not affected by the algorithm used. B: The time necessary to construct the clusters (green) and to assign test objects (yellow) followed a similar trend, although the clustering algorithm had a greater impact. Note the logarithmic scale of the time axis. All values are shown as means ± standard deviations (N = 20 measurements per data point). Iulian Ilieș PhD Thesis 72 Figure IV-8: Performance of different classification procedures, with an optimal visual vocabulary of 1900 feature clusters. The best classifiers (left-most columns) were able to fit the training data almost perfectly (F1 score of 97%), and to predict the associations of test images with 70% accuracy. In general, classifiers constructed from ground-truth associations were better than those constructed from caption-based associations, but less so on test images (yellow) than on training images (green). Optimizing training parameters for each concept separately was only marginally better than doing so globally or using fixed parameters. Values indicate means ± standard deviations (N = 20 measurements per data point). The results (Figure IV-8) followed a similar pattern as in the first experiment, although the recorded F1 scores were noticeably larger. All classifiers fitted their training data almost perfectly, but since the initial caption-based associations were only 84% accurate, classifiers constructed using ground-truth associations performed better ( ). Nonetheless, this difference was significantly smaller on test images (average F1 scores of 69% if using ground-truth associations for training, and 65% if using caption-based associations) than on training images (F1 scores of 95% and 82%, respectively; Wilks’ ). The best parameter optimization strategy was choosing values for each concept separately (F1 scores of 90% at training and 68% at testing), followed by choosing values globally (F1 scores of 89% and 68%, respectively), while using predefined resulted in the worst performance (F1 scores of 88% and 66%, respectively; overall , Scheffé post-hoc test). The differences between parameter optimization strategies were however diminished within test images (Wilks’ ), as well as in classifiers constructed from caption-based concept associations ( ). In both categories, there was no difference in accuracy between the local and global parameter optimization strategies. For example, in the setup most similar to a real application, caption-based classifiers that were trained globally achieved an average accuracy of 66% on test images, while those trained for each concept separately had an 73 Automatic image classification accuracy of 65.5%. In comparison, the best performing classifiers – based on ground-truth associations, and with parameters optimized for each concept separately – reached an average F1 score of 70%. 3.5. Differences between visual vocabularies with optimal size In the fourth experiment, we reassessed the influence of the structure of the visual vocabulary on the performance of the proposed image classifier, while keeping the number of visual words close to the optimal value found in the second experiment (see Section IV.3.3). We fixed the training data to the more realistic caption-based associations, and limited the parameter optimization strategies to those with trainable parameters. We then complemented the 1900-cluster vocabularies constructed with the TwoStep algorithm with cluster systems of similar size produced with the fast projection-based partitioning method of Ilies and Wilhelm (2010). We used a weak cluster split criterion and allowed the algorithm to successively divide all clusters with more than 125 descriptors. The resulting clustering systems consisted of 1890 to 1960 clusters, with an average of 1935 over the five cross-validation sets. Compared to when using the TwoStep algorithm, the process was significantly faster, taking less than one minute per cross-validation set. To further confirm that the vocabulary sizes suggested by TwoStep are indeed optimal, we constructed another set of vocabularies with increased size by using the same partitioning method. By decreasing the maximal cluster size to 100 descriptors, we constructed vocabularies of sizes ranging from 2320 to 2410 clusters, with an average of 2380. For each of these three clustering systems and for each cross-validation set, we constructed eight different classifiers, using the four different concept propagation methods and with the parameters optimized either locally or globally. We subsequently calculated the F1 scores of each classifier on both training and test images. The resulting values were analyzed using a repeated-measures ANOVA test with three between-subjects factors – clustering system, and concept propagation method –, and one within-subject factor – training vs. test images. The results (Figure IV-9; values aggregated over concept propagation methods and cross-validation sets) showed that using different clustering procedures for constructing the dataset has a limited effect on the performance of the image classifier if the vocabularies are of optimal size. The average F1 scores recorded in the two 1900-cluster settings were roughly equal, with values of 83% for training images and 66% for test images. Classifiers based on visual vocabularies of 2400 clusters showed a minor but significant increase in performance, reaching F1 scores of 83% on training images and 67% on test images (overall , Scheffé post-hoc test). This result confirms that, for the analyzed cross-validation datasets, the ideal visual vocabulary size is approximately 2000, and that further increasing the number of visual words may only lead to the over-fitting of the training data. Furthermore, as also observed in the previous experiment (see Section IV.3.4), the differences between training strategies are greatly diminished for large visual vocabularies. Optimizing parameters for each concept separately rather than globally had no effect on the performance on test images (average F1 score of 66% for both strategies), Iulian Ilieș PhD Thesis 74 and minimally increased the fitting of the training images (average F1 scores of 82.5% if optimizing globally and 83% if optimizing locally; overall ). Figure IV-9: Assessment of different visual vocabularies with large numbers of feature clusters. All classifiers were constructed from captionbased associations. There was no difference between classifiers based on vocabularies with the optimal size of 1900 words that were constructed with different clustering algorithms. Further increasing the number of clusters had a minimal effect on the performance. Optimizing training parameters for each concept separately was only marginally better than doing so globally. Values indicate means ± standard deviations (N = 20 measurements per data point). 3.6. Summary The experiments presented here indicated that the developed image classification procedure is feasible, being able to annotate new images with textual concepts with an accuracy of up to 70% under optimal training conditions. The most influential factor proved to be the size of the visual vocabulary. Classifiers constructed with an optimal number of visual words made approximately half as many errors as those having an arbitrarily selected number (Section IV.3.3). By contrast, the differences in performance between classifiers built with several different concept propagation functions and following different training strategies were of small amplitude. The differences in performance were even smaller when the number of visual words was appropriately representing the variability of the set of visual features (Figure IV-8). In this latter setup (Sections IV.3.4, IV.3.5), the F1 scores measured on the sets of test images did not differ between the classifiers optimized for each concept separately and those trained for all concepts simultaneously. The effect of using the less accurate captionbased associations as training data was also significantly diminished (Figure IV-8). These results are especially relevant in the context of a real application, given that more complex training procedures require significantly more processing resources (Figure IV-10). A 75 Automatic image classification reliable image classifier can thus be built more efficiently by using a simpler concept propagation procedure, such as the statistical chi-square criterion, and by optimizing the association probability thresholds globally rather than for each concept separately. If the captions-based concept associations are known to be sufficiently accurate, and if the clustering algorithm chosen for building the visual vocabulary can select the number of clusters automatically, as e.g. TwoStep or the algorithm described in Chapter II.2, then the entire classifier construction process can be executed without external input. Figure IV-10: Timing differences between different training strategies. Classifiers with two parameters were markedly slower to train than those with only one parameter. Furthermore, optimizing the parameters for each concept separately necessitated more time than doing so globally, or than setting the parameters to fixed values. By contrast, the differences between classifiers with large visual vocabularies (green) and those with small vocabularies (yellow) were of smaller amplitude. Note the logarithmic scale of the time axis. Values indicate means ± standard deviations (N = 40 measurements per data point). V. Discussion 1. Cluster analysis for high-dimensional datasets In Chapter II, we described a novel algorithm for partitioning large, highdimensional data, combining an innovative feature selection technique and fast correlation and density estimators into a projection pursuit approach. Results indicate that our method can recover fast and efficiently groups of objects that are well defined (i.e. separable from the other clusters) in their corresponding subspace. The method’s performance does not decrease if different groups share dimensions. Clusters that are clearly separated from the other objects in their subspace or are lying in larger subspaces will typically be extracted more accurately. The basis for this effect is that the quality of the separation on the high variance projections increases with the number of relevant dimensions employed in the PCA. Considering only a subset of the associated dimensions may impede the accurate recovery of a cluster, as such lower-dimensional projections may overlap with the rest of the data. This suggests that redundant attributes should not necessarily be discarded, in opposition to what is recommended in the literature (e.g. Becher et al., 2000; Hennig, 2004). Nonetheless, feature selection is necessary when working in a high-dimensional context if aiming for low processing times and memory requirements, especially if employing PCA. The implemented feature selection rule (Section II.2.2) guarantees that the number of dimensions included in the analysis is proportional with the average cluster subspace dimensionality and the number of subclusters, rather than the total number of attributes. This “effective dimensionality” term decreases rapidly when descending in the cluster tree, and hence the algorithm may need to calculate a large correlation matrix only for the first several steps. Overall, the time and memory requirements for calculating correlations are very low (Section II.4.4). This is especially important if analyzing datasets with high dimensionality compared to the number of objects (e.g. gene expression data), where even if the datasets fit into memory, attempting to perform a PCA on the full set of attributes may generate out-of-memory errors. Importantly, our method does not require the user to provide the number of clusters. If no parameters are specified, the algorithm will produce a cut through the cluster tree at a level defined by the default split score threshold. This constitutes a preview of the structure in the data, and will typically contain several large clusters with visible characteristics (e.g. narrow value ranges on some dimensions), and a number of considerably smaller clusters representing grid cells with sparse populations (e.g. Figure II-7). Such small clusters could be either discarded as noise (as suggested by Hinneburg & Keim, 1999), or retained as expressions of interesting patterns in the data (e.g. subgroups of objects with additional internal cohesion). In the former case, a second analysis of the data with the inferred parameters (optimal number of clusters, and minimal cluster size) would 77 Discussion yield a superior solution. Alternatively, the production of small clusters could be inhibited by modifying the scoring system such that asymmetric cases are penalized more heavily, e.g. by using the minimum of the excess masses (Stuetzle & Nugent, 2010). In addition, the order of the objects has no disruptive effect on performance, since object sampling is randomized. The algorithm makes no prior assumptions on the object index. Even if the ordering may expose data structures (e.g. if all objects in a cluster have consecutive indices), it does not affect the final solution. Furthermore, since it is based on correlations, our method is invariant with scaling: the data can have different ranges on the various attributes with no impact on the solution provided. On the downside, since our algorithm relies on hyper-planes for partitioning the data, it is intrinsically linear, and cannot separate interlocked clusters. Such clusters will however be subdivided into smaller groups (Figure II-6C), and hence could be recovered at post-processing, for example by allowing neighboring groups with continuous subspace densities to merge (Hinneburg & Keim, 1999). Alternatively, the algorithm could be modified to incorporate non-linear projections that could separate clusters better, e.g. kernel PCA (Schölkopf et al., 1998) or other support vector machine techniques. Such additional processing steps would of course be rather costly, and may not offer many advantages: in high-dimensional spaces, even if clusters of such irregular shapes exist, they are more difficult to interpret, and decomposition into smaller groups that are easier to describe may be preferable. 2. Generation of synthetic datasets In Chapter III, we described a new method for constructing synthetic datasets with clusters. Compared to previously developed methods, our procedure can generate a wider variety of datasets, featuring clusters of different types and with different degrees of separation or overlap (e.g. Figure III-8). As most other procedures, it permits the addition of noise objects and dimensions. However, our method is one of the few allowing for the usage of probability distributions other than normal and uniform in the construction of clusters or noise (other examples are Waller et al., 1999, Steinley & Henson, 2005), as well as one of the few allowing for the presence of subspace clusters (e.g. Zaït & Messatfa, 1997, Procopiuc et al., 2002). To our knowledge, it is the first dataset generator to feature multimodal clusters, nonlinearly distorted clusters, and obliquely rotated clusters (Figure III-6). Correspondingly, we are forced to rely on a more simplistic definition of cluster overlap (Section III.2.3), compared to other methods using more precise implementations (e.g. Maitra & Melnykov, 2010; Steinley & Henson, 2005). Nevertheless, our approach is effective in generating datasets with specific degrees of cluster overlap (see Figure III-10B). Moreover, it permits the precise definition of the degrees of cluster separation for datasets or parts of datasets with non-overlapping clusters (as in Qiu & Joe, 2006). The proposed generator of synthetic datasets has relatively low computational complexity (see Section III.3.4). The time necessary to construct a dataset depends linearly on the number of objects, at most quadratically on the number of dimensions, and at most cubically on the number of clusters (Figure III-11). Large datasets, consisting of thousands of objects defined in dozens of dimensions and grouped in dozens of clusters, can be Iulian Ilieș PhD Thesis 78 generated in less than one minute on a personal computer. As an additional advantage, our generator is easy to use: all input parameters are accessible through an intuitive graphical interface (Figure III-5), and groups of related advanced options can be enabled and disabled concurrently. Considering the short dataset generation times and the variety of cluster configurations which can be produced, our method should prove an excellent tool for the evaluation and calibration of both traditional (e.g. k-means, Hartigan & Wong, 1979) and modern clustering algorithms (e.g. Zhang et al., 1997, Procopiuc et al., 2002, Ilies & Wilhelm, 2010). Additionally, it may be used for assessing the robustness of techniques used for determining the number of clusters in a dataset (e.g. Atlas & Overall, 1994) in the presence of various proportions of noise and for clusters with different degrees of separation or overlap. 3. Automatic image classification In Chapter IV, we described a novel system for classifying images, which relies on a linkage scheme between keywords found in file captions and representative visual features appearing in the actual images (Section IV.2.1). We first presented a framework for propagating associations with textual concepts from the set of training images to the visual vocabulary of prototypical features, and then to the images from the test set (Sections IV.2.4–IV.2.5). Subsequently, we employed the developed methodology in a complex person classification task (Sections IV.3.2–IV.3.4) conducted on a non-standardized, heterogeneous data set of images harvested from news websites (Section IV.2.2). The top classifiers were highly effective, with almost perfect characterization and no over-fitting of the training data (average F1 score of 97%), and could be used to annotate novel images with 70% accuracy. While the results may have been affected by the presence of duplicate images, a likely occurrence in such a realistic setting, many of the copies found had different scales, and therefore, from an image processing point of view, they were different images, with different sets of detectable and quantifiable visual features. In general, classifiers that optimized the minimal probability thresholds applied to image-concept associations for each concept separately performed better than those that used a global optimization strategy (Figure IV-6), but they took longer to construct (Figure IV-10). The increased effectiveness of locally trained classifiers was likely an effect of the different concepts being represented to different extents in the data set. Indeed, during preliminary testing, the optimal association probability thresholds appeared to be moderately correlated to the global concept frequencies. Interestingly, the differences between parameter optimization strategies diminished with increasing the size of the vocabulary (Figure IV-8). Presumably, the impact of having a visual vocabulary of appropriate size is so high (Figure IV-7A) that the influence of the other parameters is minimized if that is the case, and accentuated otherwise. This is made especially evident by the fact that the choice of the clustering algorithm employed in the construction of the visual vocabulary has no particular importance (Sections IV.3.3, IV.3.5). Therefore, faster algorithms such as the partitioning method described in Chapter II (Ilies & Wilhelm, 2010) can be used, even if the provided solution is not optimal. As Stommel & Herzog (2009) have 79 Discussion observed, it may even be possible to use clusters centered at uniformly distributed random points. Unfortunately, this flexibility translates into low interpretability of the visual words: for a human observer, it may be difficult to relate the visual feature clusters to recognizable objects, or to understand why some images are classified correctly and others are not (see examples in Figure V-1 below). Figure V-1: Example images from the dataset employed in Chapter IV. The images are grouped by subset (training vs. test) and by the accuracy of the associations generated by one of the tested classifiers. Visual features detected by the image processing algorithm are marked with overlaid contours of appropriate location. The size of each contour is proportional to the scale of the corresponding feature. Visual features associated with the person shown in the images (Angela Merkel) are represented by squares, while those associated with other persons are represented by circles. 4. Future developments 4.1. Projection-based partitioning algorithm While fully functional, the proposed partitioning method could be extended in a number of directions. A necessary improvement would be to include a post-processing stage for fine-tuning the solution, e.g. by revising the assignment of objects near cluster boundaries, removing outliers and/or creating a noise category. Equally important would be to add support for missing values and for more varied data formats. Beside continuous numerical data, discrete attributes could be employed after conversion to a numerical scale, if ordinal, and restricted noising. The latter step is necessary to avoid over-binning: the occurrence of numerous empty bins in the histograms, caused by the data having only a limited number of possible values (as compared to the group-size dependent number of bins). A low-cost solution would be to displace each object on all attributes by a random -) of the corresponding step size. fraction (uniformly distributed in , Iulian Ilieș PhD Thesis 80 By construction, the proposed algorithm is able to process large datasets in relatively short times. For example, partitioning a dataset of 150000 objects and 128 dimensions into approximately 2000 clusters takes just under one minute on a normal computer, compared with several hours for TwoStep or k-means (see Section IV.3). The method can be easily adapted to working with datasets of size larger than the available system memory, as it needs only a fraction of the data at any given time. If data is stored on the hard disk, the algorithm will need to perform only several (three to five) block reads per cluster analyzed, and therefore ( ( )) reads of the entire dataset (where is the total number of clusters found), a process which does not entail significant computational costs. Performance can be further increased by parallelization: if the data and the current cluster system are stored in a common repository, any number of different processors could work in parallel on different subsets, as there is no object redistribution involved. 4.2. Synthetic datasets generator While the presented method for generating synthetic datasets is fully functional and should therefore constitute a significant contribution to the field, there are several computational aspects that would benefit from further investigation. The algorithm for placing subspace clusters in the data space (Section III.2.3) requires increased processing resources for datasets containing large numbers of clusters with high subspace dimensionality. The process could be made more efficient, for example by exploiting the fact that calculations made for a given subspace are also valid for its sub-subspaces. In addition, the nonlinear distortion of clusters (Section III.2.2) requires the initial generation of significantly more objects in order to counteract the subsequent down-sampling necessary for the preservation of the probability density functions of distorted clusters. This is a resource-costly process, especially for clusters with skewed distributions which are particularly affected by the down-sampling. The computational cost could be reduced by pre-computing the dimensions involved in the nonlinear distortion, and only generating additional values on these dimensions. Other possible improvements include the inclusion of more general cluster deformations, the implementation of user-defined subspace overlap for subspace clusters (as in Procopiuc et al., 2002), as well as the use of distributions with prescribed degrees of skewness and kurtosis (as in Waller et al., 1999). 4.3. Automatic image annotation system A generalization of the proposed image classification system to nonspecific concepts obtained via latent semantic analysis (Deerwester et al., 1990) of captions, conducted on a significantly extended set of images, is currently in progress. Other potential developments include refining the final cluster-concept linkage probabilities by relying on a more selective transfer process of associations from images to visual features, and enabling the system to work in an online setting, e.g. as part of web crawler that continuously processes and labels image files from the Internet. The former objective could be achieved geometrically, by calculating linkage degrees between textual concepts and visual sentences, i.e. groups of 81 Discussion visual features co-occurring at similar relative positions (Leibe & Schiele, 2008). Alternatively, the cluster-concept linkage probabilities could be improved analytically, by first emphasizing the associations of training visual features when both the source image and the cluster agree on the represented concept, and then recalculating the cluster-concept linkage degrees. In order to enable the system to work in an online setting, given that all preliminary processing and classifier training steps are automatic, the only component that would need alterations is the visual vocabulary constructor. Consequently, we aim to extend the partitioning algorithm presented in Chapter II such that it can process data streams. VI. References Aggarwal, C. C., & Yu, P. S. (2000). Finding generalized projected clusters in high dimensional spaces. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 70-81). New York, NY: ACM. Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional spaces. Proceedings of the 8th International Conference on Database Theory (pp. 420-434). Berlin, Germany: Springer. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 61-72). New York, NY: ACM. Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 94-105). New York, NY: ACM. Anderson, T. W., Olkin, I., & Underhill, L. G. (1987). Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4), 625-629. Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 49-60). New York, NY: ACM. Asuncion, A., & Newman, D. J. (2007). UCI Machine Learning Repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California at Irvine, School of Information and Computer Science. Atlas, R., & Overall, J. (1994). Comparative evaluation of two superior stopping rules for hierarchical cluster analysis. Psychometrika, 59(4), 581-591. Barber, C. B., Dobkin, D. P., & Huhdanpaa, H. T. (1996). The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4), 469-483. Becher, J., Berkhin, P., & Freeman, E. (2000). Automating exploratory data analysis for efficient data mining. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 424-429). New York, NY: ACM. Berkhin, P. (2006). A survey of clustering data mining techniques. In J. Kogan, C. Nicholas, & M. Teboulle, Grouping Multidimensional Data: Recent Advances in Clustering (pp. 2571). Berlin, Germany: Springer. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin, Germany: Springer. Boley, D. L. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325-344. Bolme, D., Beveridge, R., Teixeira, M., & Draper, B. (2003). The CSU face identification system: Its purpose, features and structure. International Conference on Vision Systems, (pp. 304-311). 83 References Breunig, M., Kriegel, H.-P., Kröger, P., & Sander, J. (2001). Data bubbles: quality preserving performance boosting for hierarchical clustering. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 79-90). New York, NY: ACM. Candillier, L., Tellier, I., Torre, F., & Bousquet, O. (2005). SSC : Statistical subspace clustering. Machine Learning and Data Mining in Pattern Recognition: 4th International Conference (pp. 100-109). Berlin, Germany: Springer. Chang, W. C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society, Series C (Applied Statistics), 32(3), 267-275. Cheng, C., Fu, A., & Zhang, Y. (1999). Entropy-based subspace clustering for mining numerical data. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 84-93). San Diego, CA. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B (Methodological), 39(1), 1-38. Drozdzynski, W., Krieger, H. U., Piskorski, J., Schäfer, U., & Xu, F. (2004). Shallow processing with unification and typed feature structures - foundations and applications. Künstliche Intelligentz, 18(1), 17-23. Faloutsos, C., & Lin, K. (1995). FastMap: A fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (pp. 163-174). New York, NY: ACM. Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering: a cluster ensemble approach. Proceedings of the 20th International Conference on Machine Learning (pp. 186-193). Menlo Park, CA: AIII Press. Fraley, C., & Raftery, A. E. (1999). MCLUST: Software for model-based analysis. Journal of Classification, 16(2), 297-306. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., & French, J. (1999). Clustering large datasets in arbitrary metric spaces. Proceedings of the 15th International Conference on Data Engineering (pp. 502-511). Washington, DC: IEEE Computer Society. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 73-84). New York, NY: ACM. Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76(374), 388-394. Hartigan, J. A. (1987). Estimation of a convex density contour in two dimensions. Journal of the American Statistical Association, 82(397), 267-270. Hartigan, J. A., & Hartigan, P. M. (1985). The dip test of unimodality. The Annals of Statistics, 13(1), 70-84. Iulian Ilieș PhD Thesis 84 Hartigan, J., & Wong, M. (1979). Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 28(1), 100-108. Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of statistical learning: Data mining, inference and prediction (3rd ed.). Berlin, Germany: Springer. Hennig, C. (2004). Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics, 13(4), 930-945. Hermes, T., Miene, A., & Herzog, O. (2005). Graphical search for images by PictureFinder. International Journal of Multimedia Tools and Applications, 27(2), 229-250. Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering large multimedia databases with noise. Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 58-65). New York, NY. Hinneburg, A., & Keim, D. A. (1999). Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. Proceedings of the 25th International Conference on Very Large Data Bases (pp. 506-517). San Francisco, CA: Morgan Kaufmann. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13-30. Huber, P. J. (1985). Projection pursuit. Annals of Statistics, 13(2), 435-475. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. Ilies, I., & Wilhelm, A. (2008). Projection-based clustering for high-dimensional data sets. COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany: Physica Verlag. Ilies, I., & Wilhelm, A. (2010). Projection-based partitioning for large, high-dimensional datasets. Journal of Computational and Graphical Statistics, 19(2), 474-492. Ilies, I., Jacobs, A., Wilhelm, A., & Herzog, O. (2009). Classification of news images using captions and a visual vocabulary. Technical Report No. 50, Universität Bremen, TZI, Bremen, Germany. Jacobs, A., Hermes, T., & Wilhelm, A. (2007). Automatic image annotation by association rules. Electronic Imaging and the Visual Arts EVA 2007, (pp. 108-112). Berlin, Germany. Jacobs, A., Herzog, O., Wilhelm, A., & Ilies, I. (2008). Relaxation-based data mining on images and text from news web sites. Proceedings of IASC2008, (pp. 736-743). Yokohama, Japan. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264-323. Joachims, T. (1998). Text categorization with support vector machines. Proceedings of the European Conference on Machine Learning. Graz, Austria. Kim, M., Yoo, H., & Ramakrishna, R. S. (2004). Cluster validation for high-dimensional datasets. Proceedings of the 11th International Conference on Artificial Intelligence: Methodology, Systems, Application, (pp. 178-187). Kogan, J. (2007). Introduction to clustering large and high-dimensional data. New York, NY: Cambridge University Press. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 9, 1464-1479. 85 References Kriegel, H.-P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data, 3(1), 1-58. Lagarias, J. C., Reeds, J., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM Journal of Optimization, 9(1), 112-147. Leibe, B., & Schiele, B. (2004). Scale-invariant object categorization using a scale-adaptive mean-shift search. Proceedings of the 26th DAGM Symposium on Pattern Recognition, (pp. 145-153). Tübingen, Germany. Liu, B., Xia, Y., & Yu, P. S. (2000). Clustering through decision tree construction. Proceedings of the 9th International Conference on Information and Knowledge Management (pp. 20-29). New York, NY: ACM. Lowe, D. G. (1999). Object recognition from local scale-invariant features. Proceedings of the 7th IEEE International Conference on Computer Vision, (pp. 1150-1157). Kerkyra, Greece. Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modelling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354-376. Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5), 1-18. Marsaglia, G., & Tsang, W. W. (2000). The ziggurat method for generating random variables. Journal of Statistical Software, 5, 1-7. Matsumoto, M., & Nishimura, T. (1998). Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom generator. ACM Transactions on Modelling and Computer Simulation, 8(1), 3-30. McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of highdimensional data sets with application to reference matching. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 169-178). Boston, MA. McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivariate Behavioral Research, 15(2), 225-238. Milenova, B. L., & Campos, M. M. (2002). O-cluster: scalable clustering of large high dimensional data sets. Proceedings of the 2002 IEEE International Conference on Data Mining (pp. 290-297). Washington, DC: IEEE Computer Society. Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50(1), 123-127. Milligan, G. W. (1996). Clustering validation: Results and implications for applied analyses. In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering and Classification (pp. 341375). River Edge, NJ: World Scientific. Mulvey, J. T., & Crowder, H. P. (1979). Cluster analysis: An application of Lagrangian relaxation. Management Science, 25(4), 329-340. Iulian Ilieș PhD Thesis 86 Nagesh, H., Goil, S., & Choudhary, A. (2001). Adaptive grids for clustering massive data sets. Proceedings of the first SIAM conference on data mining. Philadelphia, PA: SIAM. Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis. Applied Psychological Measurement, 16(2), 119-128. Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90-105. Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M. (2002). A Monte Carlo algorithm for fast projective clustering. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (pp. 418-427). New York, NY: ACM. Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2). Quinlan, L. R. (1986). Induction of decision trees. Machine Learning, 1(a), 81-106. Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473), 168-178. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24, 513-523. Savaresi, S. M., Boley, D. L., Bittanti, S., & Gazzaniga, G. (2002). Cluster selection in divisive clustering algorithms. Proceedings of the 2nd SIAM International Conference on Data Mining (pp. 299-314). Philadelphia, PA: SIAM. Schober, J. P., Hermes, T., & Herzog, O. (2005). PictureFinder: Description logics for semantic image retrieval. Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, (pp. 1571-1574). Amsterdam, Holland. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. New York, NY: Wiley. Siebert, J. P. (1987). Rule-based vehicle recognition. Research Memorandum, Turing Institute, Glasgow, UK. Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10(3), 262-266. Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. Proceedings of the 9th IEEE International Conference on Computer Vision, (pp. 1470-1477). Nice, France. Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22(2), 221-250. Stommel, M., & Herzog, O. (2009). SIFT-based object recognition with fast alphabet creation and reduced curse of dimensionality. Image and Vision Computing New Zeeland, (pp. 136-141). Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification, 20(1), 25-47. 87 References Stuetzle, W., & Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 19(2), 397-418. Teboulle, M., & Kogan, J. (2005). Deterministic annealing and a k-means type smoothing optimization algorithm for data clustering. Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (pp. 13-22). Philadelphia, PA: SIAM. Ultsch, A., & Herrmann, L. (2006). Automatic clustering with U*C. Technical Report, PhilippsUniversität, Department of Mathematics and Computer Science, Marburg, Germany. Waller, N. G., Underhill, J. M., & Kaiser, H. A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34(2), 123-142. Wann, C. D., & Thomopoulos, S. A. (1997). A Comparative study of self-organizing clustering algorithms Dignet and ART2. Neural Networks, 10(4), 737-743. Wiskott, L., Fellous, J. M., Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775-779. Woo, K.-G., Lee, J.-H., Kim, M.-H., & Lee, Y.-J. (2004). FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 46, 255-271. Zaït, M., & Messatfa, H. (1997). A comparative study of clustering methods. Future Generation Computer Systems, 13(2-3), 149-159. Zhang, T., Ramakrishnan, R., & Livny, M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 141-182.