Download Full Text - Universitatea Tehnică "Gheorghe Asachi" din Iaşi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI
Publicat de
Universitatea Tehnică „Gheorghe Asachi” din Iaşi
Tomul LVIII (LXII), Fasc. 3, 2012
SecŃia
AUTOMATICĂ şi CALCULATOARE
SURVEY OF DATA CUSTERING ALGORITHMS
BY
CORINA CÎMPANU* and LAVINIA FERARIU
“Gheorghe Asachi” Technical University of Iaşi,
Faculty of Automatic Control and Computer Engineering
Received: August 1, 2012
Accepted for publication: September 15, 2012
Abstract. Unsupervised grouping of patterns remains a challenging
problem, especially because this task should be performed without rich a priori
information, and sometimes even without knowing the number of categories.
Given the intense research effort in field, this paper aims to provide an updated
survey of the most important clustering algorithms. Recent approaches are
introduced in comparison with traditional methods involving similar category of
clustering policies, and key open issues are outlined. The paper also discusses
the opportunity of evolutionary algorithms within the context of data clustering,
together with helpful exemplifications.
Key words: data
evolutionary algorithms.
analysis,
clustering
algorithms,
classification,
2010 Mathematics Subject Classification: 62H30, 68T10.
1. Introduction
Grouping the samples in relevant categories represents a key issue for
many applications. Basically, it ensures a preliminary structure for large
databases (e.g., recorded during specific experimental trials), thus enabling
faster data mining, as well as abstract interpretation of samples. The categories
*
Corresponding author; e-mail: [email protected]
24
Corina Cîmpanu and Lavinia Ferariu
should be depicted, although the dependencies between the variables recorded
within the database are not beforehand completely understood. Besides, the
process can be carried out in supervised or unsupervised manner. The former,
called classification, exploits known target classes associated to the training
samples, whilst clustering delimitates the groups of patterns based on the
analysis performed on the training data set, without having a priori information
regarding the classes each of the training samples belong to.
One of the most common applications of classification/clustering is
pattern recognition (Duin & Pekalska, 2005). In this case, each cluster is
associated to a category of patterns, hence, once a sample is included into a
cluster, that sample may be interpreted as an instance of the corresponding
category and linked to a specific concept.
Due to the nonlinear landscape of the input space and the large
number of features characterizing each sample (not necessarily all of them
relevant and independent), the delimitation of clusters represents a
challenging task. Obviously, the problem becomes even more difficult when
working with a large, unbalanced training data set (i.e. containing very
different number of samples in each cluster and/or samples non-uniformly
distributed within each group).
This paper aims to provide an insightful overview of clustering
algorithms. Basic taxonomy is reviewed in Section 2, whilst Section 3 outlines
the recent trends in the field, and browses throughout the main clustering
algorithms. Evolutionary clustering is briefly discussed in Section 4, as
extension of non-evolutionary clustering. Finally, several summarizing remarks
are presented in section 5.
2. Clustering Taxonomy
Let us assume that a pattern, x i ∈ S ⊆ ℜ d , represents a sample
recorded within the available data set, whilst a feature is an individual attribute
describing the patterns. Hence, the data collection can be considered as a pattern
set, X = {x 1 , x 2 ,...., x N } , where each pattern is characterized by a vector of
feature values, x i = [ xij ] j =1,..,d ∈ S (Jain et al., 1999). Any distance metrics
stated within the feature space S ⊆ ℜ d can be used for quantifying pattern
similarities. Once a dissimilarity distance metric is adopted, D : S × S → ℜ + ,
the cluster can be described as a group of objects that are more similar to one
another, than to objects from other clusters: D(xi , x j ) > D(xi , x m ) , for any
x j , x m , x j ∈ X with x m and x i belonging to the same cluster and x j belonging
to another cluster. Please note that a dissimilarity distance computed between
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
25
two patterns is high if the samples are very different and small if the samples
are similar.
As already mentioned, clustering represents the unsupervised
classification of patterns into clusters. Intuitively, this grouping should be done
such that the similarities between patterns within a cluster are maximized, while
the similarities of objects from distinct clusters are minimized. However, if the
number of clusters is not known in advance, the clustering algorithm should
also evaluate the relevance of each category accepted within the final map. In
this attempt, clustering is called supervised, whenever the number of clusters is
preset and unsupervised, otherwise (Luchian, 1999).
The clustering algorithms can decide without any doubt (by means of
Boolean variables) if a sample belongs to a specific category. This type of
grouping is called hard. The alternative lies in fuzzy clustering, which provides
numerical values in [0, 1] for describing the membership degrees of a sample to
the depicted clusters. Atypically, some algorithms accept non-disjoint clusters
depicted via rough clustering. If not otherwise explicitly specified, hard
clustering is assumed in the following.
Most of the existing clustering algorithms are polythetic, namely they
simultaneously use all the available d features for deciding the category to each
a sample belongs. This strategy allows the configuration of any cluster map,
although it addresses to complex optimization problems involving highly
dimensional search spaces. Otherwise, the problem could be split in a sequence
of simpler, monothetic clustering processes, each one performed in relation to a
single feature (Jain et al., 1999). In this later case, the main risk consists in
obtaining a large number of clusters or in the impossibility of providing a
correct grouping for a predefined number of categories.
The main challenges of clustering could be related to noisy, sparse
training data or/and highly dimensioned feature spaces (Breaban, 2010). In this
context, several preliminary feature space pre-processing could be helpful.
Feature mapping projects the original feature set into a new one providing a
more convenient landscape and/or a reduced dimension, but which is physically
meaningless and difficult to interpret (Liu et al., 2003). Instead, feature
selection (Dhilloon et al., 2003) chooses a relevant subset from the original
feature set by retaining the original physical significance, which supports a
better understanding of the next learning process (Boutsidis et al., 2009).
However, the selection of data attributes relevant for clustering represents a
difficult problem, due to the lack of class labels (Law et al., 2003).
In relation to the number of cluster maps available at the end, the
algorithms could be hierarchical or partitioning. Hierarchical clustering builds a
hierarchy of groups, which gathers multiple maps of clusters, designed for
different granularity levels. The most common representation of this hierarchy
is the a tree-like nested structure called dendrogram, H = {H 1 ,...., H p } . A
26
Corina Cîmpanu and Lavinia Ferariu
cluster in the hierarchy is divided in sub-clusters (children nodes placed on the
lower level of the tree) or merged with other clusters to form a bigger cluster
(parent node placed on the upper tree level). The hierarchy can be constructed
by means of top down (divisive) algorithms or bottom up (agglomerative)
algorithms. Afterwards, depending on its specific goals, the application can tune
the trade-off between generality and detail, solely by selecting a convenient
level of granularity in the depicted hierarchy. Having a cluster Ci of partition
H m and another cluster C j of the partition H l , the relation m > l means that
H m provides grouping in fewer, more general clusters than H l . In hard
clustering, this implies that Ci and C j are disjoint, or C j is included in Ci .
On the other side, partitioning clustering algorithms offers a single
cluster map, usually built by means of an iterative, deterministic or stochastic
optimization technique. The hard partitioning clustering procedure attempts to
seek a k-partitioning of X into the clusters C = {C1 ,..., C k } , such that
∀Ci, j ∈ C , i ≠ j, Ci ∩ C j = φ , any C i ∈ C contains at least one element of X and
k
X = ∪ Ci . Although the result seems to be less flexible, almost all of the
i =1
algorithms belong to this later category, given the involved computational
complexities.
The dissimilarity between the two different samples could be measured
by means of distance metrics, tailored in relation to the type of attributes
(Rokach & Maimon, 2005). For numeric continuous attributes, the most
common are Minkowski (e.g. Euclidean, Manhattan and Chebychev) (Han &
Kamber, 2001; Xu & Wunsch II, 2009), Mahalanobis (Jain et al., 1999; Xu &
Wunsch II, 2009) and Gaussian affinity. In case of binary attributes, either the
attributes are symmetric (with states uniformly distributed) or asymmetric, the
distance can be computed based on Jaccard coefficient (Xu & Wunsch II, 2009),
or simple matching coefficients associated to contingency tables (Rokach &
Maimon, 2005). For nominal attributes, the distance is determined via simple
matching or by introducing a new binary attribute for each potential state of an
original, nominal attribute. Besides, for nominal attributes, clustering can be
also carried out via the dissimilarity analysis suggested by Ng and (Cao et al.,
2012). Unlike simple, hard matching - which discriminates between identical
and different patterns only, this dissimilarity metric exploits the landscape of
clusters in order to provide rough non-membership. Besides, (Cao et al., 2012)
takes into account the distribution of attribute values over the whole input
space. Lastly, the ordinal attributes can be treated as numerical, after their
mapping to [0, 1]. Obviously, mixed types of attributes can be also accepted, by
considering the weighted sum of squared corresponding distances (Rokach &
Maimon, 2005).
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
27
An alternative concept to the dissimilarity distance metric is the
similarity metric, which reveals the similitude between patterns, i.e. provides
large values whenever the compared patterns are similar. The most popular
similarity metrics result based on cosine function (Xu & Wunsch II, 2009),
Pearson correlation (Rokach & Maimon, 2005), extended Jaccard index (Strehl
et al., 2000) and dice coefficient.
The quality of the designed cluster map could be assessed by means of
validity measures, which basically check if the clustering structure is
meaningful. Unfortunately, most of the popular existing validations ignore the
areas with low densities, hence become unsuitable for clusters with different
densities or high sizes. The external validity measures - such as Rand index
(Wagner & Wagner, 2007), Jaccard coefficient (Xu & Wunsch II, 2009), FM
index (Ferragina & Manzini, 2001), Γ statistics (Xu & Wunsch II, 2009) –
exploits some available information additional to X. Meanwhile, the internal
validity measures make use of X, only; this is the case of the Cophenetic
correlation coefficient CPCC (Xu & Wunsch II, 2009), Goodman-Kruskal γ
statistic (Taheri & Hesamian, 2011), Kendall’s τ statistic (Jain and Dubes,
1988) and V measure. The later explicitly assesses the homogeneity and the
completeness of clusters (Zhao & Karypis, 2001; Zhao & Karypis, 2006).
Moreover, the relative validity measures compare the results provided by
different clustering algorithms or the results produced by a single clustering
procedure for different values of its control parameters.
3. Important Clustering Algorithms
Given the large variety of clustering algorithms proposed in the related
literature, it is difficult to separate them in firm, disjoint categories, based on the
policy employed for cluster delimitation. Several classifications are available in
(Jain et al., 1999; Berkhin, 2006; Periklis, 2003; Rokach & Maimon, 2005;
Rokach, 2010), however those categories are neither disjoint, neither consistent.
This survey considers four main types of clustering algorithms (distance-based,
density –based, model-based and grid-based algorithms) which can gather
almost of the available approaches. A summary of this classification is
presented in Table 1, together with relevant references where the reader can find
various exemplifications.
3.1. Distance-based Clustering Techniques
Distance-based algorithms analyze the dissimilarity between samples
by means of a distance metric and assess the most representative pattern of each
cluster, called centroid. Afterwards, the class is decided by assigning the sample
to the closest centroid. With this in mind, centroids are found targeting small
dissimilarity distances to the samples of the own cluster and large dissimilarity
28
Corina Cîmpanu and Lavinia Ferariu
distances to the samples of the other clusters. Obviously, there are situations
when it becomes unclear how to assign a distance measure to a data set and how
to associate the weights of the features.
Table 1
Main Types of Clustering Algorithms
Cluster
policy
Key idea
mean centroids
Algorithm
SNN
(Jain & Dubes, 1988; Wagstaff et al., 2001;
Kanungo et al., 2002; Tan et al., 2006;
Boutsidis et al., 2009)
(Jang et al., 1996; Chang et al., 2011)
(Jain & Dubes, 1988)
(Kaufmann & Rousseeuw, 1990; Van der
Laan et al., 2010; Vijayarani & Nithya,
2011)
(Kaufmann & Rousseeuw, 1990; Vijayarani
& Nithya, 2011)
(Ng & Han, 2002; Vijayarani & Nithya,
2011)
(Guha et al., 2003; Indyk & Price, 2011)
(Moreira et al., 2005; Tan et al., 2006;
Mumtaz and Duraiswamy, 2010; Shah et al.,
2012)
(Moreira et al., 2005;Moellic et al., 2008)
OPTICS
(Ankerst et al., 1999; Alzaalan et al., 2012)
Subtractive
clustering
AutoClass
Decision trees
(Jang et al., 1996; Ertoz et al., 2002;
Bataineh et al., 2011)
(Hanson et al., 1991; Beitzel et al., 2007)
(Liu et al., 2000; Lin et al., 2009)
(Garg et al., 2006; Hai Zhou & Yong Bin,
2010)
(Scanlan et al., 2006; Xia & Xi, 2007)
(Vesanto & Alhoniemi, 2000; Kohonen et
al., 2001; Seiffert & Jain, 2001)
(Milenova & Campos, 2002; Ilango &
Mohan, 2010)
(Wang et al, 1997; Ilango & Mohan, 2010;
Han & Kamber, 2012)
(Ilango & Mohan, 2010)
(Goil et al., 1999; Parsons et al., 2004;
Ilango & Mohan, 2010)
(Chang, 2009; Ilango & Mohan, 2010)
(Agrawal et al., 2005; Ilango & Mohan,
2010)
(Jang et al., 1996; Yang & Wu, 2005)
K-means
Fuzzy K-means
K-medoids
Distance
based
PAM
medoid centers
CLARA
CLARANS
median centroids
K-medians
ε vicinity of fix
DBSCAN
size
Density
based
ε vicinity of
variable size
vicinity of
adaptive size
Model
based
tree-based
single grid
multiple grids
Grid
based
BIRCH
COBWEB
neural networks
SOM
O-CLUSTER
STING
Wave Cluster
adaptive grid
combined with
density based
policies
References
MAFIA
ASGC
CLIQUE
Mountain clustering
Usually, the mean centroid is employed. It results by averaging the
feature values of the samples located in the cluster, hence it is not necessarily
included in the training data set (Xu & Wunsch II, 2009). Another alternative is
to use the median centroid, defined by the median in every feature dimension
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
29
corresponding to the samples of the cluster. Also, the medians aren’t necessary
instances from the data set. On the other side, a medoid-type centroid is selected
from the training data set, such that its average dissimilarity to the other
samples in the cluster is minimal.
K-means is the most popular partitioning clustering algorithm. It
assumes a predefined number of clusters and selects their mean centroids via an
iterative process, aimed at minimizing the within-cluster sum of squares (i.e.,
the sum of squared dissimilarity distances computed from each sample to its
centroid). Every iteration, the samples are reassigned to newly located centroids
and the centroids are then updated in accordance to the mean feature values
resulted in the updated clusters (Jain & Dubes, 1988; Tan et al., 2006; Boutsidis
et al., 2009). The procedure continues until all the centroids are stable.
Unfortunately, the convergence towards the global optimum is not guaranteed
and the final solution is strongly dependent to centroids’ initialization.
Moreover, it is worth mentioning that K-means is characterized by a high
computational complexity.
However, given its intuitive tracing, this algorithm is still extensively
used, in a large variety of enhanced versions. For instance, GSA-KM mixes KMeans with gravitational search algorithms, in order to increase the
convergence speed and reduce the risk of remaining trapped in local optima
(Hatamlou et al., 2012). Fuzzy K-Means (Jang et al., 1996; Chang et al., 2011)
extends the canonical K-means to clusters represented as fuzzy sets. The
objective function is defined in compliance to fuzzy formalism and, iteratively a
membership matrix is updated for indicating the membership degree of each
sample to each cluster. The algorithm stops if the objective value is acceptable
or if no improvement can be achieved during several successive iterations.
K-means based approaches can be configured for hierarchical
clustering, too. The Hierarchical Data Divisive Soft Clustering (H2D-SC) is an
adaptive, hierarchical, fuzzy quality-driven clustering algorithm. H2D-SC
evaluates multiple objective functions for constructing the hierarchy. For each
node, a hierarchy of sub-nodes is generated based on the characteristics of the
corresponding cluster (cohesion, size, mass, fuzziness) and the entropy of the
partition (Bordogna & Pasi, 2012). As expected, clusters in the upper levels
describe more general classes.
Increased robustness to noise and outliers can be provided by Kmedoids algorithm, which, basically, groups data around medoids. The
algorithm starts with k initial random training samples, assigned as medoids.
Then, iteratively, for every medoid, the replacement with another non-medoid
sample is performed, whenever this change is advantageous in relation to the
adopted criterion. The process stops when no replacements occur at successive
iterations. Partitioning Around Medoids (PAM) (Van der Laan et al., 2010) is
the most common version of K-medoids. It minimizes the sum of dissimilarities
computed pair-wise between each training sample and its medoid. PAM
30
Corina Cîmpanu and Lavinia Ferariu
operates on the dissimilarity matrix defined on the entire data set, in order to
provide a graphical display, called silhouette plot, which allows the preliminary
selection of an optimal number of clusters.
When large data sets are involved, PAM becomes impractical due to its
large computational complexity order. The easiest option lies in Clustering
LARge Applications (CLARA), which works similarly to PAM, yet on a
reduced data subset. This approximation may be improved by extracting
multiple subsets of samples, on which CLARA is separately applied (Vijayarani
& Nithya, 2011). However, a more efficient approach stands in CLustering
Algorithm based on RANdomized Search (CLARANS). Unlike CLARA, which
draws the whole subset of samples at the beginning of its search, CLARANS
chooses another subset of neighbor samples in each step, thus offering higher
quality clustering for a very small number of iterations (Ng & Han, 2002).
Basic K-means idea can be also translated to median centroids. The Kmedians algorithm begins with the random selection of k centroids, and
improves them iteratively, until stabilization (Indyk & Price, 2011).
3.2. Density-based Clustering Algorithms
Density-based algorithms assign each sample to all the clusters with
different probabilities. Then, the clusters are defined as the areas of higher
density, meaning that some samples, called noisy or border points, could remain
“outside” the clusters, namely in the areas that separate the groups. The main
idea of this clustering policy is to enlarge the existing cluster as long as the
density in the neighborhood withdraws a certain threshold. For this reason,
density - based clustering algorithms are suitable in finding non-linear shaped
clusters.
The most widely used algorithm is Density Based Spatial Clustering of
Applications with Noise (DBSCAN) (Mumtaz & Duraiswamy, 2010; Shah et
al., 2012). Let us assume that a point p is density reachable from another point
q, if p is within the ε distance from q, and q has a sufficient number of points in
its neighbors, within the specified distance. Let us also consider that point p and
q are density connected, if there exists a point r which has a sufficient number
of points in its neighbors and both p and q are within the ε distance. Using the
former definitions, DBSCAN can be simply explained. It starts with an arbitrary
selected point that hasn’t been visited before, extracts its ε neighborhood and
marks it as visited. Then it checks if the neighborhood of the selected sample
contains more than the minimum number of points required to form a cluster. If
true, a cluster is created, else the sample is labeled as noisy. If an item belongs
to a cluster, then its ε neighborhood is also part of that cluster. The
neighborhood search continues until all the items in a cluster are determined. A
new, unvisited sample is selected and the procedure is repeated until all the
samples are marked as scanned. DBSCAN is capable of identifying noisy data
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
31
and arbitrarily shaped clusters, but fails on high dimensional or varying density
clusters.
The Ordering Points to Identify the Clustering Structure (OPTICS) is an
extension of DBSCAN allowing variable sizes of ε neighborhoods (Ankerst et
al., 1999; Alzaalan et al., 2012), i.e. ε ∈ [0, Eps] . The smallest distance between
a core pattern and another pattern in its ε neighborhood is called core-distance,
while the smallest distance between a core and a dewsnsity reachable pattern is
called reachability distance. OPTICS doesn’t assign cluster memberships, yet it
builds a priority queue, according to resulted core-distances and reachability
distances.
Increased robustness can be also achieved by means of Shared Near
Neighbor (SNN) clustering algorithm (Wu et al., 2007). Clustering starts with
the construction of a similarity graph, which associates patterns to nodes and
similarities between patterns to weights of the edges. Then, the corresponding
similarity matrix is processed, by keeping the k most similar neighbors of each
node, only. Based on the resulted simplified graph, the density of each pattern is
computed and the points with density higher than a predefined threshold are set
as core points. A pair of core points is in the same cluster if they share
minimum ε radius area. All non-core points that aren’t within a ε neighborhood
of a core point are classified as noise. The remaining points are assigned to the
cluster that contains more similar core points.
In Subtractive Clustering, the training patterns stand as nominees for
cluster representatives (Jang et al., 1996; Bataineh et al., 2011). Iteratively,
the patterns having the highest Gaussian-based densities are selected as
centers; then, the centers which are too spatially closed are reanalyzed and
the unnecessary ones are rejected, by adjusting the density of previously
selected center.
The Automatic Classification (AutoClass) algorithm employs Bayesian
technique for determining the most probable number of classes and the
probabilities of membership to classes assigned to each pattern (Beitzel et al.,
2007). In addition to the training data set, AutoClass requires explicit functional
models of classes.
3.3. Model-based Clustering
Model based clustering algorithms assume that the pattern set was
generated by a model and try to recover it (Manning et al., 2009). This model
inherently defines the clusters and the assignment of patterns to clusters. The
approach can also generate multiple clustering models, which may be
probabilistically mixed or insightfully compared. Given the stochastic nature of
X (McLachlan & Peel, 2000), the quality of a clustering model can be assessed
by means of Akaike’s Information Criterion -AIC (Xu & Wunsch II, 2009),
Bayesian Information Criterion - BIC (Pelleg & Moore, 2000), Minimum
32
Corina Cîmpanu and Lavinia Ferariu
Description Length - MDL (Xu & Wunsch II, 2009) or Non coding Information
Theoretic Criterion - ICOMP (Xu & Wunsch II, 2009).
The large variety of available model templates lies in a multitude of
model-based clustering techniques. Many model-based approaches support
hierarchical clustering. In this attempt, decision trees are the most popular ones,
due to their intuitive interpretation. They use a purity function to separate the
data space into different regions (Liu et al., 2000). A cluster tree captures the
natural distribution of the data. The tree is then refined to highlight the most
meaningful clusters.
More successful in overcoming the noise issues and the propagation of
errors from lower levels within the whole hierarchy is the Localized Diffusion
Folders (LDF) algorithm. The key idea of LDF is to utilize the local geometry
of the data and the local neighborhood of each data point in order to construct a
new local geometry at each step (David & Averbuch, 2012). The local metric
consists in the diffusion distance. Assuming a graph-based representation which
associates the nodes to the patterns and the weights of edges to the
dissimilarities between patterns (normalized in compliance with a Markov
process), the diffusion distance between two patterns illustrates the probability
of transition among the corresponding nodes. The hierarchy is constructed
bottom-up, by partitioning the data into local diffusion folders, based on the
localized affinity specific to the simplified geometry.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
associates a Clustering Feature (CF) to each dense region of the input space.
Basically, a CF gathers the main characteristics of a sub-cluster, i.e. the number
of its data points, the sum of the feature vectors and the sum of squared feature
vectors for the samples located within the sub-cluster (Garg et al., 2006; Hai
Zhou & Yong Bin, 2010). In relation to CFs, the algorithm builds several treebased structures, by means of agglomerative techniques. In each resulted tree, a
leaf absorbs the patterns belonging to a region, hence these representations are
compact.
COBWEB is a well-known clustering algorithm that incrementally
organizes the observations into a classification tree. Each node represents a
class and is labeled by a probabilistic term summarizing the attribute
distributions for the patterns classified under that node. Hence, the clusters are
described by a list of nominal values and probabilities (Scanlan et al., 2006; Xia
& Xi, 2007). COBWEB maximizes the probability that two patterns in the same
cluster have common attribute values, as well as the probability that patterns
from different clusters have different attribute values.
The hybrid model proposed in (Fan et al., 2011) integrates data
clustering, fuzzy decision tree and genetic algorithms to construct a medical
classification system. The fuzzy decision tree performs a high level
discrimination of samples. Additional details are obtained with supplementary
weighted clustering for each case treated within the fuzzy decision tree. Optimal
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
33
fuzzy membership functions and optimal structure of the fuzzy tree are designed
by means of genetic algorithms.
Many types of neural networks based on unsupervised training can be
used for clustering, including here neural networks with radial basis functions
and Self Organizing Maps (SOM). For instance, the later can be used for
building the dendrogram in agglomerative clustering. SOM learning process
aims to generate low dimensional representations of the input space, called
maps (Vesanto & Alhoniemi, 2000; Kohonen, 2001). Each map unit is
represented by a prototype vector. The training is an iterative procedure that
randomly chooses a sample vector and computes the distances to all the
prototype vectors. The map unit with the closest prototypes and its topological
neighbors are moved closer to the selected sample.
In relation to any other model template, the design can be performed via
a general-purpose learning method. For instance, Expectation Maximization
Algorithm (EM) alternates two steps, namely the expectation and the
maximization. The first one estimates the distribution over classes given a
certain model, while the second chooses new parameters of the model in terms
of maximizing the expected log-likelihood of hidden variables and observed
data (Gupta & Chen, 2011).
3.4. Grid-based Clustering
A different clustering policy consists in dividing the feature space into a
finite number of grid-cells (usually, hyper-rectangles). Obviously, the
computational time is strongly dependent to the granularity of the grid.
Orthogonal partitioning CLUSTERing (O-CLUSTER) (Milenova &
Campos, 2002) identifies continuous areas of high density, by means of
partitioning active sampling technique. Borders parallel to the axes split the
input space and define the structure of the corresponding clustering tree. The
partitions iteratively depicted within the feature space can be frozen (no
additional splitting needed) or active (supplementary splitting required).
STatistical INformation Grid (STING) is a multi-resolution grid-based
clustering approach working with hyper-rectangular cells. STING exploits a
hierarchical structure of resolutions, which stores the statistical information of
each cell, such as the number of samples or the mean/ minimum/ maximum/
spread of the attribute values. The hierarchy can be scanned from a level
containing any number of cells. The relevancy of each cell is computed and the
useless ones are removed (Wang et al., 1997; Han & Kamber, 2012).
When very large databases are involved, multi-resolution clustering
could be done by means of Wave Cluster algorithm. It obtains a new search
space for dense regions, by applying the wavelet transform to the original
feature space (Ilango & Mohan, 2010). The wavelet transform allows a
34
Corina Cîmpanu and Lavinia Ferariu
natural identification of clusters within the search space, whilst ignoring the
isolated patterns.
Adaptive grids for fast subspace clustering are configured by
Merging of Adaptive Intervals Approach (MAFIA). The histogram gives the
minimum number of cells along each feature, while merging the nearby cells
with similar histogram values (Goil et al., 1999; Parsons et al., 2004; Ilango
& Mohan, 2010).
The Axis Shifted Grid Clustering (ASGC) algorithm is a clustering
technique that combines density and grid based methods (Chang, 2009; Ilango
& Mohan, 2010). It uses two grid-based structures in order to reduce the impact
of cell borders. The second grid is constructed by shifting the first grid by half
of cell width along each axis. ASGC starts with data set splitting, according to
the first grid structure. Then, the cells having density values over a given
threshold are identified and grouped together to form the clusters. Lastly, the
second grid is applied in order to improve the previously determined clusters.
The combination between density and grid-based approaches is also
used by CLustering in QUEst (CLIQUE). Here, a unit represents a hypercube
depicted in the feature space in terms of a predefined step and the selectivity of
a unit is defined as the number of patterns located in that unit. A unit is assumed
as dense, if its selectivity is equal or greater to a threshold value. Within this
context, a cluster results as the maximal set of connected dense units (Ilango &
Mohan, 2010).
Mountain clustering approximates the cluster centers using a density
function, too (Jang et al., 1996; Yang & Wu, 2005). The mountain density
function may be pictured as a density measure which increases with the number
of points located nearby. The first step consists in constructing a uniform or
non-uniform grid over the feature space. The grid lines’ intersections represent
the initial centers of the clusters. Iteratively, the list of centers is reduced to the
points with high mountain values; each reconfiguration of centers being
followed by an update of mountain densities. This method is recommended for
setting the initial centers required by a more sophisticated clustering algorithm
or for a standalone, fast, approximate clustering.
4. Evolutionary Clustering
Evolutionary algorithms are universal optimization methods, capable of
dealing with noisy, non-linear, time-variant objective functions and large,
highly dimensional search spaces. Needing no a priori information about the
problem to solve, they are suitable to track unsupervised grouping. Besides, the
flexibility of evolutionary computation lies in the compatibility with any
objective function and any types of features (Baeck et al., 2000; Langdon &
Poli, 2002; Cox, 2005; Coello Coello et al., 2007; Corchado et al., 2011).
Moreover, given their population-based execution, the evolutionary approaches
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
35
behave increased robustness and reduced risk of stagnating in local optima (risk
which is critical for canonical, gradient–based optimization methods).
These characteristics enable the use of evolutionary techniques in
almost any type of clustering algorithm, if their inherent drawbacks are
acceptable for the application. It is worth mentioning here the large amount of
requested computational resources, the sensitivity to control parameters (e.g.,
population size, number of generations etc.) and the lack of convergence
guarantee. However, when addressing to the optimizations claimed by the
clustering algorithms, a simple replacement of other techniques with the
evolutionary ones is not the most fruitful approach. Besides, in the case of very
large data sets, this strategy can often become impractical. More
recommendable is to re-design the clustering algorithm in better agreement with
the capabilities of the evolutionary techniques. Within this context, it is of great
interest to notice that the evolutionary approaches can be configured to handle
multiple objectives, and/or equality/inequality constraints. Hence, they allow
optimization statements which are better tailored to the real environments.
Based on their large popularity, genetic algorithms are the most
frequently addressed in clustering research. They can be used in three different
ways: as a standalone technique; as a preliminary clustering meant at tuning the
initial parameters of a subsequent clustering (e.g. the initial centroids of Kmeans); as a technique embedded in a hybrid approach.
The standalone genetic clustering works on chromosomes which encode
the cluster map. For instance, in partitioning distance-based algorithms, the
chromosome can encode a set of centroids describing a potential cluster map,
whilst decision tree-based clustering asks for the encoding of a decision tree.
Obviously, in the later case, genetic programming could be preferred, as it
brings implicit tools for tree-based chromosomal representations. For instance,
Genetic Programming with Unconstrained Decision Trees (GP-UDT) provides
the evolutionary configuration of clustering decision trees. Instead of a simple
comparison, the decisional nodes of the hierarchical structure can embed
complex mathematical equations, shaped by means of crossover and mutation
(Wang et al., 2011). Other evolutionary clustering approaches are presented in
(Luchian, 1999). One of them provides distance-based unsupervised clustering,
via variable-length chromosomes. More specifically, a chromosomal string
encodes the centroids of the clusters for a potential partition of X. In order to
avoid deception, the centroids are sorted before recombination. Obviously, the
approach claims for genetic operators compatible with variable-length
representations. The objective function is customized to consider both within
and inter cluster distances.
Hybrid approaches considers the symbiosis of evolutionary techniques
with other clustering algorithms. Most of them are model–based. For instance,
in (Gharavian et al., 2012), the model-based clustering is performed with a
genetically designed ARTMAP fuzzy neural network, and feature selection is
36
Corina Cîmpanu and Lavinia Ferariu
accomplished with a correlation-based filter. The system is used for recognizing
several emotion categories in recorded speech. Another hybrid approach has
been already presented in section 3.3.
Evolutionary algorithms have been also successfully applied in feature
selection, especially for the applications involving a large number of features
and reduced a priori expertise. An innovative genetic unsupervised feature
selection is proposed in (Shamsinedadbabki & Saraee, 2011). It is driven by the
assessment of feature specialization degrees. As far as a feature subset is
depicted, the clustering procedure is accomplished with K-means or K-nearest
neighbors (KNN).
Genetic weighted KNN classifiers for network anomaly detection are
used in (Su, 2011). In this case, the genetic algorithm is in charge with feature
weights’ selection and a clustering algorithm is used for building a reduced
training data set, with fewer representative patterns. This training data set
enables a faster computation, making the employment of evolutionary
algorithms valuable. Compared to the traditional weighted KNN classifiers, the
genetic WKNN offers better performances of accuracy.
In many empirical investigations, the hybridization with evolutionary
techniques led to better accuracy performances than the hybridization with other
direct optimization techniques. The explanation could be mainly related to the
use of multiple solutions per iteration and the advantageous exploitation of
parental genetic material during offspring production. However, it is worth
mentioning another stochastic search technique, intensively used in clustering,
namely the simulated annealing. Its version with Boltzmann trick is preferred.
In this case, the survival of worse solutions is accepted with a certain
probability, in order to reduce the risk of locking in local optima.
5. Conclusions
This paper surveys the theoretical background of unsupervised pattern
classification and presents the basic necessary taxonomy, the most popular
metrics for pattern similitude analysis, as well as some common clustering
validity measures.
The most important clustering algorithms are reviewed, separated in
four broad categories: distance-based, density-based, model-based and gridbased. In this attempt, the overview outlines the key approaches in the field,
focusing on the clustering policies they employ.
The main advantages of the evolutionary clustering algorithms are
commented, in comparison with non-evolutionary approaches.
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
37
REFERENCES
Agrawal R., Gehrke J., Gunopulos D., Raghavan P., Automatic Subspace Clustering of
High Dimensional Data. Data Mining and Knowledge Discovery, 11, 1, 5−33,
Springer Science and Business Media, Netherlands, 2005.
Alzaalan M.E., Aldahdooh R.T., Ashour W., EOPTICS Enhanced Ordering Points to
Identify the Clustering Structure. International Journal of Computer
Applications, 40, 17, IJCA Journal, Foundation of Computer Science, New
York, 2012.
Ankerst M., Breunig M., Kriegel H.P., Sander J., OPTICS: Ordering Points To Identify
the Clustering Structure. Proceedings of the ACM SIGMOD International
Conference on Management of Data SIGMOD ‘99, 49−60, Philadelphia, 1999.
Baeck T., Fogel D.B., Michalewicz Z., Evolutionary Computation 2: Advanced
Algorithms and Operators. Institute of Physics Publishing, United Kingdom,
2000.
Bataineh K.M., Naji M., Saqer M., A Comparison Study Between Various Fuzzy
Clustering Algorithms. Jordan Journal of Mechanical and Industrial
Engineering JJMIE, 5, 4, 335−343, 2011.
Beitzel S.M., Jensen E.C., Lewis D.D., Chowdhury A., Frieder O., Automatic
Classification of Web Queries Using Very Large Unlabeled Query Logs. ACM
Transactions on Information Systems, 25, 2, Article 9, New York, 2007.
Berkhin P., A Survey of Clustering Data Mining Techniques. In Kogan J., Nicholas C.,
Teboulle M. (Eds.), Grouping Multidimensional Data, Recent Advances in
Clustering, 25-71, Springer-Verlag, 2006.
Bordogna G., Pasi G., A Quality Driven Hierarchical Data Divisive Soft Clustering for
Information Retrieval. Knowledge-Based Systems, 26, 9−19, Elsevier, 2012.
Boutsidis C., Mahoney M.W., Drineas P., Unsupervised Feature Selection for the Kmeans Clustering Problem. In Annual Advances in Neural Information
Processing Systems, Proceedings of the NIPS’09 Conference, 2009.
Breaban M., Optimized Ensembles for Clustering Noisy Data. In Blum C., Battiti R.
(Eds.) Learning and Intelligent Optimization, Lecture Notes in Computer
Science, 6073, 220−223, Springer, Berlin, 2010.
Cao F., Liang J., Li D., Bai L., Dang C., A Dissimilarity Measure for the K-Modes
Clustering Algorithm. Knowledge Based Systems, 26, 120−127, Elsevier,
2012.
Chang C., Lin N.P., Jan N.Y., An Axis-Shifted Grid Clustering Algorithm. Tankang
Journal of Science and Engineering, 12, 2, 183−192, 2009.
Chang C.T., Lai J.C.Z., Jeng M.D., A Fuzzy K Means Clustering Algorithm Using
Cluster Center Displacement. Journal of Information Science and Engineering,
27, 995−1009, 2011.
Coello Coello C.A., Lamont G.B., van Veldhuizen D.A., Evolutionary Algorithms for
Solving Multi-Objective Problems 2nd Edition. In Goldberg D.E., Koza J.R.,
Genetic and Evolutionary Computation. Springer Science and Business Media,
2007.
Corchado E., Kurzynski M., Wozniak M., Hybrid Artificial Intelligent Systems.
Proceedings of the 6th International Conference HAIS’11, Part I, Lecture Notes
on Artificial Intelligence, 6678, Springer, Poland, 2011.
38
Corina Cîmpanu and Lavinia Ferariu
Cox E., Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration.
Morgan Kaufmann Publishers, Elsevier, 2005.
David G., Averbuch A., Hierarchical Data Organization, Clustering and De-noising
via Localized Diffusion Folders. Applied and Computational Harmonic
Analysis, 33, 1, 1−23, 2012.
Dhillon I., Kogan J., Nicholas C., Feature Selection and Document Clustering. In Berry
M.W. (Ed.), Survey of Text Mining I: Clustering, Classification, and Retrieval,
1, 73−100, Springer, 2003.
Duin R.P.W., Pekalska E., Open Issues in Pattern Recognition. In Kurzynski M.,
Puchala E., Wozniak M., Zolnierek A. (Eds.), Advances in Soft Computing
Series, Computer Recognition Systems, Part I, Proceedings of the 4th
International Conference on Computer Recognition Systems CORES ’05, 30,
27−42, Springer, Poland, 2005.
Ertoz L., Steinbach M., Kumar V., A New Shared Nearest Neighbor Clustering
Algorithms and Its Applications. In Grossman R., Han J., Kumar V., Mannila H.,
Motwani R. (Eds.), Workshop on Clustering High Dimensional Data and Its
Applications, Proceedings of the 2nd SIAM International Conference on Data
Mining, 105−115, 2002.
Fan C.Y., Chang P.C., Lin J.J., Hsieh J.C., A Hybrid Model Combining Case-based
Reasoning and Fuzzy Decision Tree for Medical Data Classification. Applied
Soft Computing, 11, 1, 632−644, 2011.
Ferragina P., Manzini G., The FM-Index: A Compressed Full Text Index Based on the
BWT. An Experimental Study of a Compressed Index. Information Sciences,
135, 1-2, 13−28, 2001.
Garg A., Mangla A., Gupta N., Bhatnagar V., PBIRCH A Scalable Parallel Clustering
Algorithm for Incremental Data, The 10th International Database Engineering
and Applications Symposium, IDEAS’06, 315-316, 2006.
Gharavian D., Sheikhan M., Nazerieh A., Garoucy S., Speech Emotion Recognition
Using FCBF Feature Selection Method and GA-optimized Fuzzy ARTMAP
Neural Network. Neural Computing and Applications, 21, 8, 2115−2126, 2012.
Goil S., Nagesh H., Choudhary A., MAFIA: Efficient and Scalable Subspace Clustering
for Very Large Data Sets. Technical Report No CPDC-TR-9906-010, Center
for Parallel and Distributed Computing, 1999.
Guha S., Meyerson A., Mishra N., Motwani R., O’Callaghan L., Clustering Data
Streams: Theory and Practice. Journal IEEE Transactions on Knowledge and
Data Engineering, 15, 3, 515−528, 2003.
Gupta M.R., Chen Y., Theory and Use of the EM Algorithm. Foundations and Trends in
Signal Processing, 4, 3, 223−296, 2011.
Hai Zhou, D., Yong Bin, L. An Improved BIRCH Clustering Algorithm and Application
in Thermal Power. International Conference on Web Information Systems and
Mining WISM, 1, 53-56, IEEE Society, 2010.
Han J., Kamber M., Data Mining: Concepts and Techniques. The Morgan and
Kaufmann Series in Data Management Systems, Morgan Kaufmann
Publishers, Academic Press, 2001.
Han J., Kamber M., Pei J., Data Mining Concepts and Techniques. Third Edition,
Morgan Kaufmann, Elsevier, 2012.
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
39
Hanson R., Stutz J., Cheeseman P., Bayesian Classification with Correlation and
Inheritance. Proceedings of 12th International Joint Conference on Artificial
Intelligence, 692−698, 1991.
Hatamlou A., Abdullah S., Nezamabadi-pour H., A Combined Approach for Clustering
Based on K-Means and Gravitaional Search Algorithms. Swarm and
Evolutionary Computation, 6, 5, 47−52, Elsevier, 2012.
Ilango M.R., Mohan V., A Survey of Grid Based Clustering Algorithms. International
Journal of Engineering Science and Technology, 2, 8, 3441−3446, 2010.
Indyk P., Price E., K-median Clustering, Model Based Compressive Sensing, and Sparse
Recovery for Earth Mover Distance. Proceedings of the 43rd Annual ACM
Symposium on Theory of Computing, STOC ’11, 627−636, California, 2011.
Jain A.K., Dubes R., Algorithms for Clustering Data. Englewood Cliffs, Prentice Hall,
1988.
Jain A.K., Murty M.N., Flynn P.J., Data Clustering: A Review. ACM Computing
Surveys, 31, 3, 264−323, 1999.
Jang J.S.R., Sun C.T., Mizutani E., Neuro-Fuzzy and Soft Computing. A Computational
Approach to Learning and Machine Intelligence. Prentice Hall, 1996.
Kaufmann L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
Kohonen T., Schroeder M.R., Huang T.S., Self Organizing Maps 3rd Edition. SpringerVerlag, 2001.
Kanungo T., Mount D.M., Netanyahu N.S., Piatko C.D., Silverman R., Wu A.Y., An
Efficient K-means Algorithm Analysis and Implementation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 24, 7, 881−892, 2002.
Langdon W.B., Poli R., Foundations of Genetic Programming. Springer, 2002.
Law M.H.C., Jain A.K., Figueiredo M.A.T., Feature Selection in Mixture Based
Clustering. Advances in Neural Information Processing Systems, 15, 609−616,
2003.
Lin P., Lei Z., Chen L., Yang J., Liu F., Decision Tree Network Traffic Classifier via
Adaptive Hierarchical Clustering for Imperfect Training Dataset. International
Conference on Wireless Communications Networking and Mobile Computing
WiCom’09, 1−6, Beijing, 2009.
Liu B., Xia Y., Yu P.S., Clustering via Decision-Tree Construction. Proceedings of the
9th International Conference on Information and Knowledge Management,
CIKM’00, 20-29, New York, 2000.
Liu T., Liu S., Chen Z., Ma W.Y., An Evaluation on Feature Selection for Text
Clustering. Proceedings of the 20th International Conference on Machine
Learning, ICML’03, Washington, 2003.
Luchian H., Three Evolutionary Approaches to Clustering. Evolutionary Algorithms in
Engineering and Computer Science, John Wiley & Sons, 1999.
Manning C.D., Raghavan P., Schutze H., An Introduction to Information Retrieval.
Cambridge University Press, 2009.
McLachlan G., Peel D., Finite Mixture Models. John Wiley & Sons, 2000.
Milenova B.L., Campos M.M., O-CLUSTER: Scalable Clustering of Large High
Dimensional Data Sets. IEEE International Conference on Data Mining
ICDM’02, 290−297, 2002.
40
Corina Cîmpanu and Lavinia Ferariu
Moellic P.A., Haugeard J.E., Pitel G., Image Clustering Based on a Shared Nearest
Neighbors Approach for Tagged Collections. Proceedings of the 2008
International Conference on Content based Image and Video Retrieval
CIVR’08, 269−278, ACM, 2008.
Moreira A., Santos M., Carneiro S., Density-based Clustering Algorithms – DBSCAN
and SNN. University of Minho, Portugal, 2005.
Mumtaz K., Duraiswamy K., An Analysis on Density Based Clustering of Multi
Dimensional Spatial Data. Indian Journal of Computer Science and
Engineering, 1, 1, 8−12, 2010.
Ng R.T., Han J., CLARANS: A Method for Clustering Objects for Spatial Data Mining.
IEEE Transactions on Knowledge and Data Engineering, 14, 5, 1003−1016,
2002.
Parsons L., Haque E., Liu H., Subspace Clustering for High Dimensional Data: A
Review. SIGKDD Explorations, 6, 1, 90−105, 2004.
Pelleg D., Moore A., X-Means Extending K-means with Efficient Estimation of the
Number of Clusters. In Langley P. (Ed.), Proceedings of the 17th International
Conference on Machine Learning ICML’00, 727−734, Stanford, 2000.
Periklis A., Tsaparas P., Miller R.J., Sevcik K.C., Clustering Categorical Data Based
On Information Loss Minimization. In 2nd Hellenic Data Management
Symposium, HDMS'03, Athens, 2003.
Rokach L., Maimon O., Chapter 15: Clustering Methods. In Maimon O., Rokach L.
(Eds.), Data Mining and Knowledge Discovery Handbook, 321−352, Springer
Heidelberg, 2005.
Rokach L., Chapter 14: A Survey of Clustering Algorithms. In Rokach L., Maimon O.
(Eds.), Data Mining and Knowledge Discovery Handbook 2nd Edition, Part III,
269−298, Springer, 2010.
Scanlan J., Hartnett J., Williams R., Dynamic Web Profile Correlation Using COBWEB.
Artificial Intelligence AI’06, Lecture Notes on Artificial Intelligence, 4304,
1059−1063, Springer-Verlag, 2006.
Seiffert U., Jain L.C., Self Organizing Maps. Recent Advances and Applications.
Springer-Verlag, 2001.
Shah G.H., Bhensdadia C.K., Ganatra A.P., An Empirical Evaluation of Density Based
Clustering Techniques. International Journal of Soft Computing and
Engineering IJSCE, 2, 1, 216−223, 2012.
Shamsinedadbabki P., Saraee M., A New Unsupervised Feature Selection Method for
Text Clustering Based in Genetic Algorithm. Journal of Intelligent Information
Systems, 38, 3, 669−684, Springer, 2011.
Strehl A., Ghosh J., Mooney R., Impact of Similarity Measures on Web Page
Clustering. In Proceedings of AAAI Workshop on AI for Web Search, 58−64,
2000.
Su M.Y., Using Classifiers to Improve the KNN Based Classifiers for Online Anomaly
Network Traffic Identification. Journal of Network and Computer Applications,
34, 2, 722−730, Elsevier, 2011.
Taheri S.M., Hesamian G., Goodman-Kruskal Measure of Association for Fuzzy
Categorized Variables. Kibernetik, 47, 1, 110−122, Institute of Information
Theory and Automation AS CR, 2001.
Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 3, 2012
41
Tan P.N., Steinbach M., Kumar V., Introduction to Data Mining, Chapter 8, Cluster
Analysis Basic Concepts and Algorithms. 125−146, Pearson Addison Wesley,
2006.
Van der Laan M., Pollard K., Bryan J., A New Partitioning Around Medoids Algorithm.
Journal of Statistical Computation and Simulation, 73, 8, 575−584, Taylor and
Francis, 2010.
Vesanto J., Alhoniemi E., Clustering of the Self Organizing Map. IEEE Transactions on
Neural Networks, 11, 3, 586−600, 2000.
Vijayarani S., Nithya S., An Efficient Clustering Algorithm for Outlier Detection.
International Journal of Computer Applications, 32, 7, 22−27, 2011.
Vinh N.X., Epps J., Bailey J., Information Theoretic Measures for Clustering’s
Comparison: Is a Correction for Chance Necessary?, Proceedings of the 26th
International Conference on Machine Learning, Montreal, 2009.
Wagstaff K., Cardie C., Rogers S., Schroedl S., Constraint K-means Clustering with
Background Knowledge. Proceedings of the 18th International Conference on
Machine Learning, 577−584, Morgan and Kaufmann Publishers, 2001.
Wang P., Weise T., Chiong R., Novel Evolutionary Algorithms for Supervised
Classification Problems. An Experimental Study, Evolutionary Intelligence, 4,
1, 3−16, Springer, 2011.
Wang W., Yang J., Muntz R., STING A Statistical Information Grid Approach to
Spatial Data Mining. Proceedings of the 23rd Very Large Data Base
Environment, VLDB, 186−195, Morgan and Kaufmann, Athens, 1997.
Wu W., Guo W., Tan K.L., Distributed Processing of Moving K Nearest Neighbor
Query on Moving Objects. ICDE’07, 1116−1125, 2007.
Xia Y., Xi B., Conceptual Clustering Categorical Data with Uncertainty. 19th IEEE
International Conference on Tools with Artificial Intelligence, ICTAI’07, 1,
329−336, Paris, 2007.
Xu R., Wunsch II R.C., Clustering. IEEE Press Series on Computational Intelligence,
John Wiley and Sons, 2009.
Yang M.S., Wu K.L., A Modified Mountain Clustering Algorithm. Pattern Analysis
Applications Journal, 8, 1, 125−138, Springer, London, 2005.
Zhao Y., Karypis G., Criterion Functions for Document Clustering - Experiments and
Analysis. Technical Report. University of Minnesota, Department of Computer
Science, Army HPC Research Center, 2001.
Zhao Y., Karypis G., Criterion Functions for Clustering on High-Dimensional Data. In
Kogan J., Nicholas C., Teboulle M. (Eds.), Grouping Multidimensional Data,
Recent Advances in Clustering, 211−237, Springer-Verlag, 2006.
STUDIU ASUPRA ALGORITMILOR DE CLASIFICARE
NESUPERVIZATĂ
(Rezumat)
Gruparea nesupervizată a tiparelor reprezintă încă o problemă de cercetare
deschisă, principalele dificultăŃi fiind legate în special de faptul că gruparea trebuie
realizată în condiŃiile folosirii unor informaŃii a priori reduse, uneori chiar fără a şti
42
Corina Cîmpanu and Lavinia Ferariu
numărul de categorii relevante. łinând cont de volumul mare de lucrări în domeniu, ne
propunem realizarea unei sinteze actualizate, care să cuprindă cele mai importante
abordări în domeniu. Algoritmii de dată recentă sunt comparaŃi cu cei clasici (bazaŃi pe
tehnici de grupare asemănătoare), ilustrând cu această ocazie principalele aspecte ce ar
necesita îmbunătăŃiri ulterioare. Lucrarea discută, de asemenea, oportunitatea folosirii
tehnicilor evolutive în rezolvarea problemelor de grupare nesupervizată şi oferă câteva
exemplificări ilustrative în acest sens.