Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LVIII (LXII), Fasc. 3, 2012 SecŃia AUTOMATICĂ şi CALCULATOARE SURVEY OF DATA CUSTERING ALGORITHMS BY CORINA CÎMPANU* and LAVINIA FERARIU “Gheorghe Asachi” Technical University of Iaşi, Faculty of Automatic Control and Computer Engineering Received: August 1, 2012 Accepted for publication: September 15, 2012 Abstract. Unsupervised grouping of patterns remains a challenging problem, especially because this task should be performed without rich a priori information, and sometimes even without knowing the number of categories. Given the intense research effort in field, this paper aims to provide an updated survey of the most important clustering algorithms. Recent approaches are introduced in comparison with traditional methods involving similar category of clustering policies, and key open issues are outlined. The paper also discusses the opportunity of evolutionary algorithms within the context of data clustering, together with helpful exemplifications. Key words: data evolutionary algorithms. analysis, clustering algorithms, classification, 2010 Mathematics Subject Classification: 62H30, 68T10. 1. Introduction Grouping the samples in relevant categories represents a key issue for many applications. Basically, it ensures a preliminary structure for large databases (e.g., recorded during specific experimental trials), thus enabling faster data mining, as well as abstract interpretation of samples. The categories * Corresponding author; e-mail: [email protected] 24 Corina Cîmpanu and Lavinia Ferariu should be depicted, although the dependencies between the variables recorded within the database are not beforehand completely understood. Besides, the process can be carried out in supervised or unsupervised manner. The former, called classification, exploits known target classes associated to the training samples, whilst clustering delimitates the groups of patterns based on the analysis performed on the training data set, without having a priori information regarding the classes each of the training samples belong to. One of the most common applications of classification/clustering is pattern recognition (Duin & Pekalska, 2005). In this case, each cluster is associated to a category of patterns, hence, once a sample is included into a cluster, that sample may be interpreted as an instance of the corresponding category and linked to a specific concept. Due to the nonlinear landscape of the input space and the large number of features characterizing each sample (not necessarily all of them relevant and independent), the delimitation of clusters represents a challenging task. Obviously, the problem becomes even more difficult when working with a large, unbalanced training data set (i.e. containing very different number of samples in each cluster and/or samples non-uniformly distributed within each group). This paper aims to provide an insightful overview of clustering algorithms. Basic taxonomy is reviewed in Section 2, whilst Section 3 outlines the recent trends in the field, and browses throughout the main clustering algorithms. Evolutionary clustering is briefly discussed in Section 4, as extension of non-evolutionary clustering. Finally, several summarizing remarks are presented in section 5. 2. Clustering Taxonomy Let us assume that a pattern, x i ∈ S ⊆ ℜ d , represents a sample recorded within the available data set, whilst a feature is an individual attribute describing the patterns. Hence, the data collection can be considered as a pattern set, X = {x 1 , x 2 ,...., x N } , where each pattern is characterized by a vector of feature values, x i = [ xij ] j =1,..,d ∈ S (Jain et al., 1999). Any distance metrics stated within the feature space S ⊆ ℜ d can be used for quantifying pattern similarities. Once a dissimilarity distance metric is adopted, D : S × S → ℜ + , the cluster can be described as a group of objects that are more similar to one another, than to objects from other clusters: D(xi , x j ) > D(xi , x m ) , for any x j , x m , x j ∈ X with x m and x i belonging to the same cluster and x j belonging to another cluster. Please note that a dissimilarity distance computed between Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 25 two patterns is high if the samples are very different and small if the samples are similar. As already mentioned, clustering represents the unsupervised classification of patterns into clusters. Intuitively, this grouping should be done such that the similarities between patterns within a cluster are maximized, while the similarities of objects from distinct clusters are minimized. However, if the number of clusters is not known in advance, the clustering algorithm should also evaluate the relevance of each category accepted within the final map. In this attempt, clustering is called supervised, whenever the number of clusters is preset and unsupervised, otherwise (Luchian, 1999). The clustering algorithms can decide without any doubt (by means of Boolean variables) if a sample belongs to a specific category. This type of grouping is called hard. The alternative lies in fuzzy clustering, which provides numerical values in [0, 1] for describing the membership degrees of a sample to the depicted clusters. Atypically, some algorithms accept non-disjoint clusters depicted via rough clustering. If not otherwise explicitly specified, hard clustering is assumed in the following. Most of the existing clustering algorithms are polythetic, namely they simultaneously use all the available d features for deciding the category to each a sample belongs. This strategy allows the configuration of any cluster map, although it addresses to complex optimization problems involving highly dimensional search spaces. Otherwise, the problem could be split in a sequence of simpler, monothetic clustering processes, each one performed in relation to a single feature (Jain et al., 1999). In this later case, the main risk consists in obtaining a large number of clusters or in the impossibility of providing a correct grouping for a predefined number of categories. The main challenges of clustering could be related to noisy, sparse training data or/and highly dimensioned feature spaces (Breaban, 2010). In this context, several preliminary feature space pre-processing could be helpful. Feature mapping projects the original feature set into a new one providing a more convenient landscape and/or a reduced dimension, but which is physically meaningless and difficult to interpret (Liu et al., 2003). Instead, feature selection (Dhilloon et al., 2003) chooses a relevant subset from the original feature set by retaining the original physical significance, which supports a better understanding of the next learning process (Boutsidis et al., 2009). However, the selection of data attributes relevant for clustering represents a difficult problem, due to the lack of class labels (Law et al., 2003). In relation to the number of cluster maps available at the end, the algorithms could be hierarchical or partitioning. Hierarchical clustering builds a hierarchy of groups, which gathers multiple maps of clusters, designed for different granularity levels. The most common representation of this hierarchy is the a tree-like nested structure called dendrogram, H = {H 1 ,...., H p } . A 26 Corina Cîmpanu and Lavinia Ferariu cluster in the hierarchy is divided in sub-clusters (children nodes placed on the lower level of the tree) or merged with other clusters to form a bigger cluster (parent node placed on the upper tree level). The hierarchy can be constructed by means of top down (divisive) algorithms or bottom up (agglomerative) algorithms. Afterwards, depending on its specific goals, the application can tune the trade-off between generality and detail, solely by selecting a convenient level of granularity in the depicted hierarchy. Having a cluster Ci of partition H m and another cluster C j of the partition H l , the relation m > l means that H m provides grouping in fewer, more general clusters than H l . In hard clustering, this implies that Ci and C j are disjoint, or C j is included in Ci . On the other side, partitioning clustering algorithms offers a single cluster map, usually built by means of an iterative, deterministic or stochastic optimization technique. The hard partitioning clustering procedure attempts to seek a k-partitioning of X into the clusters C = {C1 ,..., C k } , such that ∀Ci, j ∈ C , i ≠ j, Ci ∩ C j = φ , any C i ∈ C contains at least one element of X and k X = ∪ Ci . Although the result seems to be less flexible, almost all of the i =1 algorithms belong to this later category, given the involved computational complexities. The dissimilarity between the two different samples could be measured by means of distance metrics, tailored in relation to the type of attributes (Rokach & Maimon, 2005). For numeric continuous attributes, the most common are Minkowski (e.g. Euclidean, Manhattan and Chebychev) (Han & Kamber, 2001; Xu & Wunsch II, 2009), Mahalanobis (Jain et al., 1999; Xu & Wunsch II, 2009) and Gaussian affinity. In case of binary attributes, either the attributes are symmetric (with states uniformly distributed) or asymmetric, the distance can be computed based on Jaccard coefficient (Xu & Wunsch II, 2009), or simple matching coefficients associated to contingency tables (Rokach & Maimon, 2005). For nominal attributes, the distance is determined via simple matching or by introducing a new binary attribute for each potential state of an original, nominal attribute. Besides, for nominal attributes, clustering can be also carried out via the dissimilarity analysis suggested by Ng and (Cao et al., 2012). Unlike simple, hard matching - which discriminates between identical and different patterns only, this dissimilarity metric exploits the landscape of clusters in order to provide rough non-membership. Besides, (Cao et al., 2012) takes into account the distribution of attribute values over the whole input space. Lastly, the ordinal attributes can be treated as numerical, after their mapping to [0, 1]. Obviously, mixed types of attributes can be also accepted, by considering the weighted sum of squared corresponding distances (Rokach & Maimon, 2005). Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 27 An alternative concept to the dissimilarity distance metric is the similarity metric, which reveals the similitude between patterns, i.e. provides large values whenever the compared patterns are similar. The most popular similarity metrics result based on cosine function (Xu & Wunsch II, 2009), Pearson correlation (Rokach & Maimon, 2005), extended Jaccard index (Strehl et al., 2000) and dice coefficient. The quality of the designed cluster map could be assessed by means of validity measures, which basically check if the clustering structure is meaningful. Unfortunately, most of the popular existing validations ignore the areas with low densities, hence become unsuitable for clusters with different densities or high sizes. The external validity measures - such as Rand index (Wagner & Wagner, 2007), Jaccard coefficient (Xu & Wunsch II, 2009), FM index (Ferragina & Manzini, 2001), Γ statistics (Xu & Wunsch II, 2009) – exploits some available information additional to X. Meanwhile, the internal validity measures make use of X, only; this is the case of the Cophenetic correlation coefficient CPCC (Xu & Wunsch II, 2009), Goodman-Kruskal γ statistic (Taheri & Hesamian, 2011), Kendall’s τ statistic (Jain and Dubes, 1988) and V measure. The later explicitly assesses the homogeneity and the completeness of clusters (Zhao & Karypis, 2001; Zhao & Karypis, 2006). Moreover, the relative validity measures compare the results provided by different clustering algorithms or the results produced by a single clustering procedure for different values of its control parameters. 3. Important Clustering Algorithms Given the large variety of clustering algorithms proposed in the related literature, it is difficult to separate them in firm, disjoint categories, based on the policy employed for cluster delimitation. Several classifications are available in (Jain et al., 1999; Berkhin, 2006; Periklis, 2003; Rokach & Maimon, 2005; Rokach, 2010), however those categories are neither disjoint, neither consistent. This survey considers four main types of clustering algorithms (distance-based, density –based, model-based and grid-based algorithms) which can gather almost of the available approaches. A summary of this classification is presented in Table 1, together with relevant references where the reader can find various exemplifications. 3.1. Distance-based Clustering Techniques Distance-based algorithms analyze the dissimilarity between samples by means of a distance metric and assess the most representative pattern of each cluster, called centroid. Afterwards, the class is decided by assigning the sample to the closest centroid. With this in mind, centroids are found targeting small dissimilarity distances to the samples of the own cluster and large dissimilarity 28 Corina Cîmpanu and Lavinia Ferariu distances to the samples of the other clusters. Obviously, there are situations when it becomes unclear how to assign a distance measure to a data set and how to associate the weights of the features. Table 1 Main Types of Clustering Algorithms Cluster policy Key idea mean centroids Algorithm SNN (Jain & Dubes, 1988; Wagstaff et al., 2001; Kanungo et al., 2002; Tan et al., 2006; Boutsidis et al., 2009) (Jang et al., 1996; Chang et al., 2011) (Jain & Dubes, 1988) (Kaufmann & Rousseeuw, 1990; Van der Laan et al., 2010; Vijayarani & Nithya, 2011) (Kaufmann & Rousseeuw, 1990; Vijayarani & Nithya, 2011) (Ng & Han, 2002; Vijayarani & Nithya, 2011) (Guha et al., 2003; Indyk & Price, 2011) (Moreira et al., 2005; Tan et al., 2006; Mumtaz and Duraiswamy, 2010; Shah et al., 2012) (Moreira et al., 2005;Moellic et al., 2008) OPTICS (Ankerst et al., 1999; Alzaalan et al., 2012) Subtractive clustering AutoClass Decision trees (Jang et al., 1996; Ertoz et al., 2002; Bataineh et al., 2011) (Hanson et al., 1991; Beitzel et al., 2007) (Liu et al., 2000; Lin et al., 2009) (Garg et al., 2006; Hai Zhou & Yong Bin, 2010) (Scanlan et al., 2006; Xia & Xi, 2007) (Vesanto & Alhoniemi, 2000; Kohonen et al., 2001; Seiffert & Jain, 2001) (Milenova & Campos, 2002; Ilango & Mohan, 2010) (Wang et al, 1997; Ilango & Mohan, 2010; Han & Kamber, 2012) (Ilango & Mohan, 2010) (Goil et al., 1999; Parsons et al., 2004; Ilango & Mohan, 2010) (Chang, 2009; Ilango & Mohan, 2010) (Agrawal et al., 2005; Ilango & Mohan, 2010) (Jang et al., 1996; Yang & Wu, 2005) K-means Fuzzy K-means K-medoids Distance based PAM medoid centers CLARA CLARANS median centroids K-medians ε vicinity of fix DBSCAN size Density based ε vicinity of variable size vicinity of adaptive size Model based tree-based single grid multiple grids Grid based BIRCH COBWEB neural networks SOM O-CLUSTER STING Wave Cluster adaptive grid combined with density based policies References MAFIA ASGC CLIQUE Mountain clustering Usually, the mean centroid is employed. It results by averaging the feature values of the samples located in the cluster, hence it is not necessarily included in the training data set (Xu & Wunsch II, 2009). Another alternative is to use the median centroid, defined by the median in every feature dimension Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 29 corresponding to the samples of the cluster. Also, the medians aren’t necessary instances from the data set. On the other side, a medoid-type centroid is selected from the training data set, such that its average dissimilarity to the other samples in the cluster is minimal. K-means is the most popular partitioning clustering algorithm. It assumes a predefined number of clusters and selects their mean centroids via an iterative process, aimed at minimizing the within-cluster sum of squares (i.e., the sum of squared dissimilarity distances computed from each sample to its centroid). Every iteration, the samples are reassigned to newly located centroids and the centroids are then updated in accordance to the mean feature values resulted in the updated clusters (Jain & Dubes, 1988; Tan et al., 2006; Boutsidis et al., 2009). The procedure continues until all the centroids are stable. Unfortunately, the convergence towards the global optimum is not guaranteed and the final solution is strongly dependent to centroids’ initialization. Moreover, it is worth mentioning that K-means is characterized by a high computational complexity. However, given its intuitive tracing, this algorithm is still extensively used, in a large variety of enhanced versions. For instance, GSA-KM mixes KMeans with gravitational search algorithms, in order to increase the convergence speed and reduce the risk of remaining trapped in local optima (Hatamlou et al., 2012). Fuzzy K-Means (Jang et al., 1996; Chang et al., 2011) extends the canonical K-means to clusters represented as fuzzy sets. The objective function is defined in compliance to fuzzy formalism and, iteratively a membership matrix is updated for indicating the membership degree of each sample to each cluster. The algorithm stops if the objective value is acceptable or if no improvement can be achieved during several successive iterations. K-means based approaches can be configured for hierarchical clustering, too. The Hierarchical Data Divisive Soft Clustering (H2D-SC) is an adaptive, hierarchical, fuzzy quality-driven clustering algorithm. H2D-SC evaluates multiple objective functions for constructing the hierarchy. For each node, a hierarchy of sub-nodes is generated based on the characteristics of the corresponding cluster (cohesion, size, mass, fuzziness) and the entropy of the partition (Bordogna & Pasi, 2012). As expected, clusters in the upper levels describe more general classes. Increased robustness to noise and outliers can be provided by Kmedoids algorithm, which, basically, groups data around medoids. The algorithm starts with k initial random training samples, assigned as medoids. Then, iteratively, for every medoid, the replacement with another non-medoid sample is performed, whenever this change is advantageous in relation to the adopted criterion. The process stops when no replacements occur at successive iterations. Partitioning Around Medoids (PAM) (Van der Laan et al., 2010) is the most common version of K-medoids. It minimizes the sum of dissimilarities computed pair-wise between each training sample and its medoid. PAM 30 Corina Cîmpanu and Lavinia Ferariu operates on the dissimilarity matrix defined on the entire data set, in order to provide a graphical display, called silhouette plot, which allows the preliminary selection of an optimal number of clusters. When large data sets are involved, PAM becomes impractical due to its large computational complexity order. The easiest option lies in Clustering LARge Applications (CLARA), which works similarly to PAM, yet on a reduced data subset. This approximation may be improved by extracting multiple subsets of samples, on which CLARA is separately applied (Vijayarani & Nithya, 2011). However, a more efficient approach stands in CLustering Algorithm based on RANdomized Search (CLARANS). Unlike CLARA, which draws the whole subset of samples at the beginning of its search, CLARANS chooses another subset of neighbor samples in each step, thus offering higher quality clustering for a very small number of iterations (Ng & Han, 2002). Basic K-means idea can be also translated to median centroids. The Kmedians algorithm begins with the random selection of k centroids, and improves them iteratively, until stabilization (Indyk & Price, 2011). 3.2. Density-based Clustering Algorithms Density-based algorithms assign each sample to all the clusters with different probabilities. Then, the clusters are defined as the areas of higher density, meaning that some samples, called noisy or border points, could remain “outside” the clusters, namely in the areas that separate the groups. The main idea of this clustering policy is to enlarge the existing cluster as long as the density in the neighborhood withdraws a certain threshold. For this reason, density - based clustering algorithms are suitable in finding non-linear shaped clusters. The most widely used algorithm is Density Based Spatial Clustering of Applications with Noise (DBSCAN) (Mumtaz & Duraiswamy, 2010; Shah et al., 2012). Let us assume that a point p is density reachable from another point q, if p is within the ε distance from q, and q has a sufficient number of points in its neighbors, within the specified distance. Let us also consider that point p and q are density connected, if there exists a point r which has a sufficient number of points in its neighbors and both p and q are within the ε distance. Using the former definitions, DBSCAN can be simply explained. It starts with an arbitrary selected point that hasn’t been visited before, extracts its ε neighborhood and marks it as visited. Then it checks if the neighborhood of the selected sample contains more than the minimum number of points required to form a cluster. If true, a cluster is created, else the sample is labeled as noisy. If an item belongs to a cluster, then its ε neighborhood is also part of that cluster. The neighborhood search continues until all the items in a cluster are determined. A new, unvisited sample is selected and the procedure is repeated until all the samples are marked as scanned. DBSCAN is capable of identifying noisy data Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 31 and arbitrarily shaped clusters, but fails on high dimensional or varying density clusters. The Ordering Points to Identify the Clustering Structure (OPTICS) is an extension of DBSCAN allowing variable sizes of ε neighborhoods (Ankerst et al., 1999; Alzaalan et al., 2012), i.e. ε ∈ [0, Eps] . The smallest distance between a core pattern and another pattern in its ε neighborhood is called core-distance, while the smallest distance between a core and a dewsnsity reachable pattern is called reachability distance. OPTICS doesn’t assign cluster memberships, yet it builds a priority queue, according to resulted core-distances and reachability distances. Increased robustness can be also achieved by means of Shared Near Neighbor (SNN) clustering algorithm (Wu et al., 2007). Clustering starts with the construction of a similarity graph, which associates patterns to nodes and similarities between patterns to weights of the edges. Then, the corresponding similarity matrix is processed, by keeping the k most similar neighbors of each node, only. Based on the resulted simplified graph, the density of each pattern is computed and the points with density higher than a predefined threshold are set as core points. A pair of core points is in the same cluster if they share minimum ε radius area. All non-core points that aren’t within a ε neighborhood of a core point are classified as noise. The remaining points are assigned to the cluster that contains more similar core points. In Subtractive Clustering, the training patterns stand as nominees for cluster representatives (Jang et al., 1996; Bataineh et al., 2011). Iteratively, the patterns having the highest Gaussian-based densities are selected as centers; then, the centers which are too spatially closed are reanalyzed and the unnecessary ones are rejected, by adjusting the density of previously selected center. The Automatic Classification (AutoClass) algorithm employs Bayesian technique for determining the most probable number of classes and the probabilities of membership to classes assigned to each pattern (Beitzel et al., 2007). In addition to the training data set, AutoClass requires explicit functional models of classes. 3.3. Model-based Clustering Model based clustering algorithms assume that the pattern set was generated by a model and try to recover it (Manning et al., 2009). This model inherently defines the clusters and the assignment of patterns to clusters. The approach can also generate multiple clustering models, which may be probabilistically mixed or insightfully compared. Given the stochastic nature of X (McLachlan & Peel, 2000), the quality of a clustering model can be assessed by means of Akaike’s Information Criterion -AIC (Xu & Wunsch II, 2009), Bayesian Information Criterion - BIC (Pelleg & Moore, 2000), Minimum 32 Corina Cîmpanu and Lavinia Ferariu Description Length - MDL (Xu & Wunsch II, 2009) or Non coding Information Theoretic Criterion - ICOMP (Xu & Wunsch II, 2009). The large variety of available model templates lies in a multitude of model-based clustering techniques. Many model-based approaches support hierarchical clustering. In this attempt, decision trees are the most popular ones, due to their intuitive interpretation. They use a purity function to separate the data space into different regions (Liu et al., 2000). A cluster tree captures the natural distribution of the data. The tree is then refined to highlight the most meaningful clusters. More successful in overcoming the noise issues and the propagation of errors from lower levels within the whole hierarchy is the Localized Diffusion Folders (LDF) algorithm. The key idea of LDF is to utilize the local geometry of the data and the local neighborhood of each data point in order to construct a new local geometry at each step (David & Averbuch, 2012). The local metric consists in the diffusion distance. Assuming a graph-based representation which associates the nodes to the patterns and the weights of edges to the dissimilarities between patterns (normalized in compliance with a Markov process), the diffusion distance between two patterns illustrates the probability of transition among the corresponding nodes. The hierarchy is constructed bottom-up, by partitioning the data into local diffusion folders, based on the localized affinity specific to the simplified geometry. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) associates a Clustering Feature (CF) to each dense region of the input space. Basically, a CF gathers the main characteristics of a sub-cluster, i.e. the number of its data points, the sum of the feature vectors and the sum of squared feature vectors for the samples located within the sub-cluster (Garg et al., 2006; Hai Zhou & Yong Bin, 2010). In relation to CFs, the algorithm builds several treebased structures, by means of agglomerative techniques. In each resulted tree, a leaf absorbs the patterns belonging to a region, hence these representations are compact. COBWEB is a well-known clustering algorithm that incrementally organizes the observations into a classification tree. Each node represents a class and is labeled by a probabilistic term summarizing the attribute distributions for the patterns classified under that node. Hence, the clusters are described by a list of nominal values and probabilities (Scanlan et al., 2006; Xia & Xi, 2007). COBWEB maximizes the probability that two patterns in the same cluster have common attribute values, as well as the probability that patterns from different clusters have different attribute values. The hybrid model proposed in (Fan et al., 2011) integrates data clustering, fuzzy decision tree and genetic algorithms to construct a medical classification system. The fuzzy decision tree performs a high level discrimination of samples. Additional details are obtained with supplementary weighted clustering for each case treated within the fuzzy decision tree. Optimal Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 33 fuzzy membership functions and optimal structure of the fuzzy tree are designed by means of genetic algorithms. Many types of neural networks based on unsupervised training can be used for clustering, including here neural networks with radial basis functions and Self Organizing Maps (SOM). For instance, the later can be used for building the dendrogram in agglomerative clustering. SOM learning process aims to generate low dimensional representations of the input space, called maps (Vesanto & Alhoniemi, 2000; Kohonen, 2001). Each map unit is represented by a prototype vector. The training is an iterative procedure that randomly chooses a sample vector and computes the distances to all the prototype vectors. The map unit with the closest prototypes and its topological neighbors are moved closer to the selected sample. In relation to any other model template, the design can be performed via a general-purpose learning method. For instance, Expectation Maximization Algorithm (EM) alternates two steps, namely the expectation and the maximization. The first one estimates the distribution over classes given a certain model, while the second chooses new parameters of the model in terms of maximizing the expected log-likelihood of hidden variables and observed data (Gupta & Chen, 2011). 3.4. Grid-based Clustering A different clustering policy consists in dividing the feature space into a finite number of grid-cells (usually, hyper-rectangles). Obviously, the computational time is strongly dependent to the granularity of the grid. Orthogonal partitioning CLUSTERing (O-CLUSTER) (Milenova & Campos, 2002) identifies continuous areas of high density, by means of partitioning active sampling technique. Borders parallel to the axes split the input space and define the structure of the corresponding clustering tree. The partitions iteratively depicted within the feature space can be frozen (no additional splitting needed) or active (supplementary splitting required). STatistical INformation Grid (STING) is a multi-resolution grid-based clustering approach working with hyper-rectangular cells. STING exploits a hierarchical structure of resolutions, which stores the statistical information of each cell, such as the number of samples or the mean/ minimum/ maximum/ spread of the attribute values. The hierarchy can be scanned from a level containing any number of cells. The relevancy of each cell is computed and the useless ones are removed (Wang et al., 1997; Han & Kamber, 2012). When very large databases are involved, multi-resolution clustering could be done by means of Wave Cluster algorithm. It obtains a new search space for dense regions, by applying the wavelet transform to the original feature space (Ilango & Mohan, 2010). The wavelet transform allows a 34 Corina Cîmpanu and Lavinia Ferariu natural identification of clusters within the search space, whilst ignoring the isolated patterns. Adaptive grids for fast subspace clustering are configured by Merging of Adaptive Intervals Approach (MAFIA). The histogram gives the minimum number of cells along each feature, while merging the nearby cells with similar histogram values (Goil et al., 1999; Parsons et al., 2004; Ilango & Mohan, 2010). The Axis Shifted Grid Clustering (ASGC) algorithm is a clustering technique that combines density and grid based methods (Chang, 2009; Ilango & Mohan, 2010). It uses two grid-based structures in order to reduce the impact of cell borders. The second grid is constructed by shifting the first grid by half of cell width along each axis. ASGC starts with data set splitting, according to the first grid structure. Then, the cells having density values over a given threshold are identified and grouped together to form the clusters. Lastly, the second grid is applied in order to improve the previously determined clusters. The combination between density and grid-based approaches is also used by CLustering in QUEst (CLIQUE). Here, a unit represents a hypercube depicted in the feature space in terms of a predefined step and the selectivity of a unit is defined as the number of patterns located in that unit. A unit is assumed as dense, if its selectivity is equal or greater to a threshold value. Within this context, a cluster results as the maximal set of connected dense units (Ilango & Mohan, 2010). Mountain clustering approximates the cluster centers using a density function, too (Jang et al., 1996; Yang & Wu, 2005). The mountain density function may be pictured as a density measure which increases with the number of points located nearby. The first step consists in constructing a uniform or non-uniform grid over the feature space. The grid lines’ intersections represent the initial centers of the clusters. Iteratively, the list of centers is reduced to the points with high mountain values; each reconfiguration of centers being followed by an update of mountain densities. This method is recommended for setting the initial centers required by a more sophisticated clustering algorithm or for a standalone, fast, approximate clustering. 4. Evolutionary Clustering Evolutionary algorithms are universal optimization methods, capable of dealing with noisy, non-linear, time-variant objective functions and large, highly dimensional search spaces. Needing no a priori information about the problem to solve, they are suitable to track unsupervised grouping. Besides, the flexibility of evolutionary computation lies in the compatibility with any objective function and any types of features (Baeck et al., 2000; Langdon & Poli, 2002; Cox, 2005; Coello Coello et al., 2007; Corchado et al., 2011). Moreover, given their population-based execution, the evolutionary approaches Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 35 behave increased robustness and reduced risk of stagnating in local optima (risk which is critical for canonical, gradient–based optimization methods). These characteristics enable the use of evolutionary techniques in almost any type of clustering algorithm, if their inherent drawbacks are acceptable for the application. It is worth mentioning here the large amount of requested computational resources, the sensitivity to control parameters (e.g., population size, number of generations etc.) and the lack of convergence guarantee. However, when addressing to the optimizations claimed by the clustering algorithms, a simple replacement of other techniques with the evolutionary ones is not the most fruitful approach. Besides, in the case of very large data sets, this strategy can often become impractical. More recommendable is to re-design the clustering algorithm in better agreement with the capabilities of the evolutionary techniques. Within this context, it is of great interest to notice that the evolutionary approaches can be configured to handle multiple objectives, and/or equality/inequality constraints. Hence, they allow optimization statements which are better tailored to the real environments. Based on their large popularity, genetic algorithms are the most frequently addressed in clustering research. They can be used in three different ways: as a standalone technique; as a preliminary clustering meant at tuning the initial parameters of a subsequent clustering (e.g. the initial centroids of Kmeans); as a technique embedded in a hybrid approach. The standalone genetic clustering works on chromosomes which encode the cluster map. For instance, in partitioning distance-based algorithms, the chromosome can encode a set of centroids describing a potential cluster map, whilst decision tree-based clustering asks for the encoding of a decision tree. Obviously, in the later case, genetic programming could be preferred, as it brings implicit tools for tree-based chromosomal representations. For instance, Genetic Programming with Unconstrained Decision Trees (GP-UDT) provides the evolutionary configuration of clustering decision trees. Instead of a simple comparison, the decisional nodes of the hierarchical structure can embed complex mathematical equations, shaped by means of crossover and mutation (Wang et al., 2011). Other evolutionary clustering approaches are presented in (Luchian, 1999). One of them provides distance-based unsupervised clustering, via variable-length chromosomes. More specifically, a chromosomal string encodes the centroids of the clusters for a potential partition of X. In order to avoid deception, the centroids are sorted before recombination. Obviously, the approach claims for genetic operators compatible with variable-length representations. The objective function is customized to consider both within and inter cluster distances. Hybrid approaches considers the symbiosis of evolutionary techniques with other clustering algorithms. Most of them are model–based. For instance, in (Gharavian et al., 2012), the model-based clustering is performed with a genetically designed ARTMAP fuzzy neural network, and feature selection is 36 Corina Cîmpanu and Lavinia Ferariu accomplished with a correlation-based filter. The system is used for recognizing several emotion categories in recorded speech. Another hybrid approach has been already presented in section 3.3. Evolutionary algorithms have been also successfully applied in feature selection, especially for the applications involving a large number of features and reduced a priori expertise. An innovative genetic unsupervised feature selection is proposed in (Shamsinedadbabki & Saraee, 2011). It is driven by the assessment of feature specialization degrees. As far as a feature subset is depicted, the clustering procedure is accomplished with K-means or K-nearest neighbors (KNN). Genetic weighted KNN classifiers for network anomaly detection are used in (Su, 2011). In this case, the genetic algorithm is in charge with feature weights’ selection and a clustering algorithm is used for building a reduced training data set, with fewer representative patterns. This training data set enables a faster computation, making the employment of evolutionary algorithms valuable. Compared to the traditional weighted KNN classifiers, the genetic WKNN offers better performances of accuracy. In many empirical investigations, the hybridization with evolutionary techniques led to better accuracy performances than the hybridization with other direct optimization techniques. The explanation could be mainly related to the use of multiple solutions per iteration and the advantageous exploitation of parental genetic material during offspring production. However, it is worth mentioning another stochastic search technique, intensively used in clustering, namely the simulated annealing. Its version with Boltzmann trick is preferred. In this case, the survival of worse solutions is accepted with a certain probability, in order to reduce the risk of locking in local optima. 5. Conclusions This paper surveys the theoretical background of unsupervised pattern classification and presents the basic necessary taxonomy, the most popular metrics for pattern similitude analysis, as well as some common clustering validity measures. The most important clustering algorithms are reviewed, separated in four broad categories: distance-based, density-based, model-based and gridbased. In this attempt, the overview outlines the key approaches in the field, focusing on the clustering policies they employ. The main advantages of the evolutionary clustering algorithms are commented, in comparison with non-evolutionary approaches. Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 37 REFERENCES Agrawal R., Gehrke J., Gunopulos D., Raghavan P., Automatic Subspace Clustering of High Dimensional Data. Data Mining and Knowledge Discovery, 11, 1, 5−33, Springer Science and Business Media, Netherlands, 2005. Alzaalan M.E., Aldahdooh R.T., Ashour W., EOPTICS Enhanced Ordering Points to Identify the Clustering Structure. International Journal of Computer Applications, 40, 17, IJCA Journal, Foundation of Computer Science, New York, 2012. Ankerst M., Breunig M., Kriegel H.P., Sander J., OPTICS: Ordering Points To Identify the Clustering Structure. Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD ‘99, 49−60, Philadelphia, 1999. Baeck T., Fogel D.B., Michalewicz Z., Evolutionary Computation 2: Advanced Algorithms and Operators. Institute of Physics Publishing, United Kingdom, 2000. Bataineh K.M., Naji M., Saqer M., A Comparison Study Between Various Fuzzy Clustering Algorithms. Jordan Journal of Mechanical and Industrial Engineering JJMIE, 5, 4, 335−343, 2011. Beitzel S.M., Jensen E.C., Lewis D.D., Chowdhury A., Frieder O., Automatic Classification of Web Queries Using Very Large Unlabeled Query Logs. ACM Transactions on Information Systems, 25, 2, Article 9, New York, 2007. Berkhin P., A Survey of Clustering Data Mining Techniques. In Kogan J., Nicholas C., Teboulle M. (Eds.), Grouping Multidimensional Data, Recent Advances in Clustering, 25-71, Springer-Verlag, 2006. Bordogna G., Pasi G., A Quality Driven Hierarchical Data Divisive Soft Clustering for Information Retrieval. Knowledge-Based Systems, 26, 9−19, Elsevier, 2012. Boutsidis C., Mahoney M.W., Drineas P., Unsupervised Feature Selection for the Kmeans Clustering Problem. In Annual Advances in Neural Information Processing Systems, Proceedings of the NIPS’09 Conference, 2009. Breaban M., Optimized Ensembles for Clustering Noisy Data. In Blum C., Battiti R. (Eds.) Learning and Intelligent Optimization, Lecture Notes in Computer Science, 6073, 220−223, Springer, Berlin, 2010. Cao F., Liang J., Li D., Bai L., Dang C., A Dissimilarity Measure for the K-Modes Clustering Algorithm. Knowledge Based Systems, 26, 120−127, Elsevier, 2012. Chang C., Lin N.P., Jan N.Y., An Axis-Shifted Grid Clustering Algorithm. Tankang Journal of Science and Engineering, 12, 2, 183−192, 2009. Chang C.T., Lai J.C.Z., Jeng M.D., A Fuzzy K Means Clustering Algorithm Using Cluster Center Displacement. Journal of Information Science and Engineering, 27, 995−1009, 2011. Coello Coello C.A., Lamont G.B., van Veldhuizen D.A., Evolutionary Algorithms for Solving Multi-Objective Problems 2nd Edition. In Goldberg D.E., Koza J.R., Genetic and Evolutionary Computation. Springer Science and Business Media, 2007. Corchado E., Kurzynski M., Wozniak M., Hybrid Artificial Intelligent Systems. Proceedings of the 6th International Conference HAIS’11, Part I, Lecture Notes on Artificial Intelligence, 6678, Springer, Poland, 2011. 38 Corina Cîmpanu and Lavinia Ferariu Cox E., Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration. Morgan Kaufmann Publishers, Elsevier, 2005. David G., Averbuch A., Hierarchical Data Organization, Clustering and De-noising via Localized Diffusion Folders. Applied and Computational Harmonic Analysis, 33, 1, 1−23, 2012. Dhillon I., Kogan J., Nicholas C., Feature Selection and Document Clustering. In Berry M.W. (Ed.), Survey of Text Mining I: Clustering, Classification, and Retrieval, 1, 73−100, Springer, 2003. Duin R.P.W., Pekalska E., Open Issues in Pattern Recognition. In Kurzynski M., Puchala E., Wozniak M., Zolnierek A. (Eds.), Advances in Soft Computing Series, Computer Recognition Systems, Part I, Proceedings of the 4th International Conference on Computer Recognition Systems CORES ’05, 30, 27−42, Springer, Poland, 2005. Ertoz L., Steinbach M., Kumar V., A New Shared Nearest Neighbor Clustering Algorithms and Its Applications. In Grossman R., Han J., Kumar V., Mannila H., Motwani R. (Eds.), Workshop on Clustering High Dimensional Data and Its Applications, Proceedings of the 2nd SIAM International Conference on Data Mining, 105−115, 2002. Fan C.Y., Chang P.C., Lin J.J., Hsieh J.C., A Hybrid Model Combining Case-based Reasoning and Fuzzy Decision Tree for Medical Data Classification. Applied Soft Computing, 11, 1, 632−644, 2011. Ferragina P., Manzini G., The FM-Index: A Compressed Full Text Index Based on the BWT. An Experimental Study of a Compressed Index. Information Sciences, 135, 1-2, 13−28, 2001. Garg A., Mangla A., Gupta N., Bhatnagar V., PBIRCH A Scalable Parallel Clustering Algorithm for Incremental Data, The 10th International Database Engineering and Applications Symposium, IDEAS’06, 315-316, 2006. Gharavian D., Sheikhan M., Nazerieh A., Garoucy S., Speech Emotion Recognition Using FCBF Feature Selection Method and GA-optimized Fuzzy ARTMAP Neural Network. Neural Computing and Applications, 21, 8, 2115−2126, 2012. Goil S., Nagesh H., Choudhary A., MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report No CPDC-TR-9906-010, Center for Parallel and Distributed Computing, 1999. Guha S., Meyerson A., Mishra N., Motwani R., O’Callaghan L., Clustering Data Streams: Theory and Practice. Journal IEEE Transactions on Knowledge and Data Engineering, 15, 3, 515−528, 2003. Gupta M.R., Chen Y., Theory and Use of the EM Algorithm. Foundations and Trends in Signal Processing, 4, 3, 223−296, 2011. Hai Zhou, D., Yong Bin, L. An Improved BIRCH Clustering Algorithm and Application in Thermal Power. International Conference on Web Information Systems and Mining WISM, 1, 53-56, IEEE Society, 2010. Han J., Kamber M., Data Mining: Concepts and Techniques. The Morgan and Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, Academic Press, 2001. Han J., Kamber M., Pei J., Data Mining Concepts and Techniques. Third Edition, Morgan Kaufmann, Elsevier, 2012. Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 39 Hanson R., Stutz J., Cheeseman P., Bayesian Classification with Correlation and Inheritance. Proceedings of 12th International Joint Conference on Artificial Intelligence, 692−698, 1991. Hatamlou A., Abdullah S., Nezamabadi-pour H., A Combined Approach for Clustering Based on K-Means and Gravitaional Search Algorithms. Swarm and Evolutionary Computation, 6, 5, 47−52, Elsevier, 2012. Ilango M.R., Mohan V., A Survey of Grid Based Clustering Algorithms. International Journal of Engineering Science and Technology, 2, 8, 3441−3446, 2010. Indyk P., Price E., K-median Clustering, Model Based Compressive Sensing, and Sparse Recovery for Earth Mover Distance. Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, STOC ’11, 627−636, California, 2011. Jain A.K., Dubes R., Algorithms for Clustering Data. Englewood Cliffs, Prentice Hall, 1988. Jain A.K., Murty M.N., Flynn P.J., Data Clustering: A Review. ACM Computing Surveys, 31, 3, 264−323, 1999. Jang J.S.R., Sun C.T., Mizutani E., Neuro-Fuzzy and Soft Computing. A Computational Approach to Learning and Machine Intelligence. Prentice Hall, 1996. Kaufmann L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990. Kohonen T., Schroeder M.R., Huang T.S., Self Organizing Maps 3rd Edition. SpringerVerlag, 2001. Kanungo T., Mount D.M., Netanyahu N.S., Piatko C.D., Silverman R., Wu A.Y., An Efficient K-means Algorithm Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 7, 881−892, 2002. Langdon W.B., Poli R., Foundations of Genetic Programming. Springer, 2002. Law M.H.C., Jain A.K., Figueiredo M.A.T., Feature Selection in Mixture Based Clustering. Advances in Neural Information Processing Systems, 15, 609−616, 2003. Lin P., Lei Z., Chen L., Yang J., Liu F., Decision Tree Network Traffic Classifier via Adaptive Hierarchical Clustering for Imperfect Training Dataset. International Conference on Wireless Communications Networking and Mobile Computing WiCom’09, 1−6, Beijing, 2009. Liu B., Xia Y., Yu P.S., Clustering via Decision-Tree Construction. Proceedings of the 9th International Conference on Information and Knowledge Management, CIKM’00, 20-29, New York, 2000. Liu T., Liu S., Chen Z., Ma W.Y., An Evaluation on Feature Selection for Text Clustering. Proceedings of the 20th International Conference on Machine Learning, ICML’03, Washington, 2003. Luchian H., Three Evolutionary Approaches to Clustering. Evolutionary Algorithms in Engineering and Computer Science, John Wiley & Sons, 1999. Manning C.D., Raghavan P., Schutze H., An Introduction to Information Retrieval. Cambridge University Press, 2009. McLachlan G., Peel D., Finite Mixture Models. John Wiley & Sons, 2000. Milenova B.L., Campos M.M., O-CLUSTER: Scalable Clustering of Large High Dimensional Data Sets. IEEE International Conference on Data Mining ICDM’02, 290−297, 2002. 40 Corina Cîmpanu and Lavinia Ferariu Moellic P.A., Haugeard J.E., Pitel G., Image Clustering Based on a Shared Nearest Neighbors Approach for Tagged Collections. Proceedings of the 2008 International Conference on Content based Image and Video Retrieval CIVR’08, 269−278, ACM, 2008. Moreira A., Santos M., Carneiro S., Density-based Clustering Algorithms – DBSCAN and SNN. University of Minho, Portugal, 2005. Mumtaz K., Duraiswamy K., An Analysis on Density Based Clustering of Multi Dimensional Spatial Data. Indian Journal of Computer Science and Engineering, 1, 1, 8−12, 2010. Ng R.T., Han J., CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering, 14, 5, 1003−1016, 2002. Parsons L., Haque E., Liu H., Subspace Clustering for High Dimensional Data: A Review. SIGKDD Explorations, 6, 1, 90−105, 2004. Pelleg D., Moore A., X-Means Extending K-means with Efficient Estimation of the Number of Clusters. In Langley P. (Ed.), Proceedings of the 17th International Conference on Machine Learning ICML’00, 727−734, Stanford, 2000. Periklis A., Tsaparas P., Miller R.J., Sevcik K.C., Clustering Categorical Data Based On Information Loss Minimization. In 2nd Hellenic Data Management Symposium, HDMS'03, Athens, 2003. Rokach L., Maimon O., Chapter 15: Clustering Methods. In Maimon O., Rokach L. (Eds.), Data Mining and Knowledge Discovery Handbook, 321−352, Springer Heidelberg, 2005. Rokach L., Chapter 14: A Survey of Clustering Algorithms. In Rokach L., Maimon O. (Eds.), Data Mining and Knowledge Discovery Handbook 2nd Edition, Part III, 269−298, Springer, 2010. Scanlan J., Hartnett J., Williams R., Dynamic Web Profile Correlation Using COBWEB. Artificial Intelligence AI’06, Lecture Notes on Artificial Intelligence, 4304, 1059−1063, Springer-Verlag, 2006. Seiffert U., Jain L.C., Self Organizing Maps. Recent Advances and Applications. Springer-Verlag, 2001. Shah G.H., Bhensdadia C.K., Ganatra A.P., An Empirical Evaluation of Density Based Clustering Techniques. International Journal of Soft Computing and Engineering IJSCE, 2, 1, 216−223, 2012. Shamsinedadbabki P., Saraee M., A New Unsupervised Feature Selection Method for Text Clustering Based in Genetic Algorithm. Journal of Intelligent Information Systems, 38, 3, 669−684, Springer, 2011. Strehl A., Ghosh J., Mooney R., Impact of Similarity Measures on Web Page Clustering. In Proceedings of AAAI Workshop on AI for Web Search, 58−64, 2000. Su M.Y., Using Classifiers to Improve the KNN Based Classifiers for Online Anomaly Network Traffic Identification. Journal of Network and Computer Applications, 34, 2, 722−730, Elsevier, 2011. Taheri S.M., Hesamian G., Goodman-Kruskal Measure of Association for Fuzzy Categorized Variables. Kibernetik, 47, 1, 110−122, Institute of Information Theory and Automation AS CR, 2001. Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012 41 Tan P.N., Steinbach M., Kumar V., Introduction to Data Mining, Chapter 8, Cluster Analysis Basic Concepts and Algorithms. 125−146, Pearson Addison Wesley, 2006. Van der Laan M., Pollard K., Bryan J., A New Partitioning Around Medoids Algorithm. Journal of Statistical Computation and Simulation, 73, 8, 575−584, Taylor and Francis, 2010. Vesanto J., Alhoniemi E., Clustering of the Self Organizing Map. IEEE Transactions on Neural Networks, 11, 3, 586−600, 2000. Vijayarani S., Nithya S., An Efficient Clustering Algorithm for Outlier Detection. International Journal of Computer Applications, 32, 7, 22−27, 2011. Vinh N.X., Epps J., Bailey J., Information Theoretic Measures for Clustering’s Comparison: Is a Correction for Chance Necessary?, Proceedings of the 26th International Conference on Machine Learning, Montreal, 2009. Wagstaff K., Cardie C., Rogers S., Schroedl S., Constraint K-means Clustering with Background Knowledge. Proceedings of the 18th International Conference on Machine Learning, 577−584, Morgan and Kaufmann Publishers, 2001. Wang P., Weise T., Chiong R., Novel Evolutionary Algorithms for Supervised Classification Problems. An Experimental Study, Evolutionary Intelligence, 4, 1, 3−16, Springer, 2011. Wang W., Yang J., Muntz R., STING A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of the 23rd Very Large Data Base Environment, VLDB, 186−195, Morgan and Kaufmann, Athens, 1997. Wu W., Guo W., Tan K.L., Distributed Processing of Moving K Nearest Neighbor Query on Moving Objects. ICDE’07, 1116−1125, 2007. Xia Y., Xi B., Conceptual Clustering Categorical Data with Uncertainty. 19th IEEE International Conference on Tools with Artificial Intelligence, ICTAI’07, 1, 329−336, Paris, 2007. Xu R., Wunsch II R.C., Clustering. IEEE Press Series on Computational Intelligence, John Wiley and Sons, 2009. Yang M.S., Wu K.L., A Modified Mountain Clustering Algorithm. Pattern Analysis Applications Journal, 8, 1, 125−138, Springer, London, 2005. Zhao Y., Karypis G., Criterion Functions for Document Clustering - Experiments and Analysis. Technical Report. University of Minnesota, Department of Computer Science, Army HPC Research Center, 2001. Zhao Y., Karypis G., Criterion Functions for Clustering on High-Dimensional Data. In Kogan J., Nicholas C., Teboulle M. (Eds.), Grouping Multidimensional Data, Recent Advances in Clustering, 211−237, Springer-Verlag, 2006. STUDIU ASUPRA ALGORITMILOR DE CLASIFICARE NESUPERVIZATĂ (Rezumat) Gruparea nesupervizată a tiparelor reprezintă încă o problemă de cercetare deschisă, principalele dificultăŃi fiind legate în special de faptul că gruparea trebuie realizată în condiŃiile folosirii unor informaŃii a priori reduse, uneori chiar fără a şti 42 Corina Cîmpanu and Lavinia Ferariu numărul de categorii relevante. łinând cont de volumul mare de lucrări în domeniu, ne propunem realizarea unei sinteze actualizate, care să cuprindă cele mai importante abordări în domeniu. Algoritmii de dată recentă sunt comparaŃi cu cei clasici (bazaŃi pe tehnici de grupare asemănătoare), ilustrând cu această ocazie principalele aspecte ce ar necesita îmbunătăŃiri ulterioare. Lucrarea discută, de asemenea, oportunitatea folosirii tehnicilor evolutive în rezolvarea problemelor de grupare nesupervizată şi oferă câteva exemplificări ilustrative în acest sens.