Download gSOM - a new gravitational clustering algorithm based on the self

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
gSOM - a new gravitational clustering algorithm
based on the self-organizing map
Nejc Ilc
University of Ljubljana, Faculty of Computer and Information Science
Trzaska 25, 1000 Ljubljana, Slovenia
[email protected]
Data clustering is a fundamental data analysis method, widely used in data
mining, feature extraction, pattern recognition, function approximation, vector quantization and image segmentation. Numerous clustering algorithms exist
based on various theories and approaches. In this paper we present a novel twolevel clustering algorithm (gSOM) based on the Kohonen’s self-organizing map
(SOM) and gravitational clustering. In the first level, SOM is applied to the input data and prototypes are extracted. In the second level, each prototype acts
as a unit-mass point in a feature space, in which presence of a gravitational force
is simulated. The proposed approach is able to discover complex cluster shapes
not limited only to spherical ones. It automatically determines number of clusters in the data and can handle noise and outliers. Experiments with synthetic
and real data were conducted to show performance of the presented method in
comparison with other two well-known clustering techniques.
Key words: clustering, data mining, self-organizing map, gravitational force.
1
Introduction
Clustering is a process of organizing data into clusters or natural groups such that
data points assigned to the same cluster have high similarity, while the similarity
between points assigned to different clusters is low. Unlike classification, groups
are not predefined and input data points are unlabelled. Clustering has been
widely used in data mining, feature extraction, pattern recognition, function
approximation, vector quantization and image segmentation [1].
In this paper, we present gSOM, a new clustering algorithm, based on two
present data analysis approaches. We combined the self-organizing map with
gravitational clustering approach, exploiting advantages of both.
The self-organizing map (SOM), proposed by Kohonen [2], is a type of artificial neural network, where a set of neurons is arranged in a low-dimensional
structure - usually in 2-D grid. Through unsupervised training, each neuron
or prototype is attached to a feature vector of the same dimension as the input space, considering neighborhood relations between neurons. Therefore, SOM
preserves topological relations between points in input space. By assigning each
input data point to the neuron with the nearest feature vector, the SOM is
able to divide the input space into regions with common nearest feature vector, also called best matching unit (BMU). Several attempts have been made to
2
Nejc Ilc
cluster data with the SOM using two-level approach. In the first level, SOM is
used to form a feature map of neurons, which are in the second level divided
into as many different regions as the predefined number of clusters. Each input data point can be assigned to a cluster according to their nearest neuron.
This approach has been addressed in [3] and [4]. In [3], SOM is clustered using hierarchical clustering, where merging criterion is based on inter-cluster and
intra-cluster density and inter-cluster distances. In [4] a comparison was made
between classical clustering methods and those based on SOM, showing greatly
reduced computational complexity of latter. Both hierarchical and partitional
approaches were involved.
As mentioned, the SOM is combined with gravitational clustering, which
assumes that every point in the data set can be viewed as a mass particle in
the input space. If a gravitational force between points exists, they will begin
to move towards each other with respect to mass and distance. This natural
inspired notion was firstly adopted in the algorithm, proposed by Wright [5].
Several other papers extend his work, proposing algorithm robust to noise [6]
and solution to handle overlapping clusters with a fuzzy model [7].
In this paper we present the new algorithm based on both, the SOM and the
gravitational clustering. The clustering is carried out in a two-level process: in
the first level, the SOM is applied on the input data and best matching units prototypes are obtained. In the second level, each prototype acts as a unit-mass
point in a feature space, in which we simulate presence of a gravitational force
to group similar prototypes together.
The paper is organized as follows. In Section 2, the proposed algorithm called
gSOM is described. Section 3 presents performance of gSOM in comparison with
K-means and EM GMM algorithm over some synthetic and real data sets, along
with discussion. Finally, the conclusion is drawn in Section 4.
2
Proposed algorithm - the gSOM
We developed clustering algorithm based on a two-level approach that is depicted in Figure 1. First, a set of prototypes is created using SOM as a vector
quantization method. Nearest data points are attached to each prototype thus
becoming a small, first-level cluster. There are few times more prototypes than
expected number of clusters. In the next step, prototypes are observed as object in a feature space, where a force of gravity moves them towards each other.
When two prototypes are close enough, they merge into a single prototype with
mass unchanged, due to reasons explained later in this section.
The main benefit using the SOM in the first level of proposed algorithm is to
obtain topological ordered representatives of the data. Prototypes are connected
with each other in a grid and neighbors for each of them are known. We use
this information to bound influence of gravitational field around prototype and
therefore stabilize clustering process in the second step. Another advantage of
the SOM is a reduction of noise and therefore solving problem with outliers. The
prototypes are local averages of the data and therefore less sensitive to random
gSOM - a new gravitational clustering algorithm based on the SOM
3
variations than the original data. Finally, the SOM reduces computational complexity of clustering data set, especially when a huge number of input points is
a case as shown in [4].
Fig. 1. Two-level approach scheme. Data points in a) are mapped to the SOM prototypes in b). BMUs are then merged together by gravitational clustering algorithm as
shown in c).
2.1
The first level - training the SOM
The SOM is a regular two-dimensional grid of M map units, each of them represented by a prototype vector mi = [mi1 , . . . , mid ], where d is dimension of input
vector. The units are connected to adjacent ones by a neighborhood relation.
SOM can be viewed as elastic net that folds onto the shape of input data during
training. Data points lying near each other in the input space are mapped onto
nearby map units, thus preserving topology.
The SOM is trained iteratively in a ”batch” mode, where the whole data set
is presented to the map before any adjustments are made. In each training step,
the data set is partitioned according to the Voronoi regions of the map units. In
simple words, each data point xj belongs to the map unit to which it is closest.
After this, the new map units are calculated as:
n
j=1 hic(j) (t)xj
mi (t + 1) = n
,
(1)
j=1 hic(j) (t)
where c(j) is the the best matching unit (BMU) of data point xj and n number
of points in data set. The new value of map unit mi is weighted average of the
data points, where the weight of each data point is the neighborhood kernel
value hic(j) (t) centered on its BMU c(j). Neighborhood kernel function has a
width defined by parameter σ that decreases monotonically with time.
Before training, the initialization of map has to be made. Linear initialization
is used to fold map in the subspace spanned by the two eigenvectors with greatest
eigenvalues computed from the input data. The maps are then trained in two
4
Nejc Ilc
phases: a rough phase with larger initial neighborhood width σ1 and a finetunning phase with smaller initial neighborhood width σ2 . For training the SOM,
we use toolbox for MATLAB - the SOM Toolbox1 .
As the result of the first level of algorithm we obtain SOM prototypes, which
represent original data. Interpolating prototypes - those that are not BMU for
any data point - are eliminated. Only BMUs are taken into the second level,
representing portions of original data points attached to them.
2.2
The second level - applying the gravity
Identified BMUs from first level of algorithm are now interpreted as d-dimensional
particle in the d-dimensional space with mass equal to 1. During iterations of
algorithm, each particle is moved around according to a simplified version of the
Gravitation Law using Second Newton’s Motion Law, proposed in [6]. The new
position of point x influenced by gravity of point y is:
x(t + 1) = x(t) +
G
d
·
,
||d|| ||d||2
(2)
where d is the Euclidean distance between point x and y, and G is the gravitational constant. In order to avoid the big crunch effect, where all points collapse into a single point, each iteration G is reduced by a proportion ΔG, so
G = (1 − ΔG) · G. When two points are close enough, i.e. ||d|| is lower than
parameter α, they are merged into a single point with mass equal to 1, which
is rather strange, but doing so, clusters with greater density do not affect ones
with smaller density. The experiments presented in the Section 3 prove that such
approach is beneficial. It is obvious that number of points decreases during iterations, when appropriate G is chosen. At each iteration of algorithm, every point
x in remaining set of points, denoted with P , is visited once. Then we have to
choose another point y and move them according to Equation 2. Point y is one of
the neighbors of x or a random point. With probability p, which is parameter of
algorithm, a random point is selected, otherwise one of the existing neighbors of
x is chosen. Algorithm stops, when G is reduced enough, so movements of points
are under some threshold value. Alternative stopping criterion is a case when
only a single point remains in P or a predefined maximum number of iterations
is reached. Points in the set P are final clusters representatives as illustrated in
Figure 1c). The number of finally discovered clusters is obviously dependent on
a data set itself and mentioned parameters’ values.
3
Experiments and results
In the experiments, five synthetic (a–e) and two real (f and g) data sets were
used, demonstrating the capabilities of proposed clustering algorithm in comparison with well-known K-means and Expectation Maximization Gaussian Mixture
1
SOM Toolbox is available under the GNU General Public License at:
http://www.cis.hut.fi/somtoolbox/.
gSOM - a new gravitational clustering algorithm based on the SOM
5
Model (EM GMM) algorithm [8]. Table 1 summarizes properties of each data
set and their plots are presented in Figure 2, showing already clustered data
by gSOM algorithm. Note that data sets f) and g) are plotted using the PCA
(principal component analysis) projection due to higher-dimensional data. All
variables in each data set were linearly scaled to fit in range [0, 1] before clustering was carried out.
3.1
Data sets
a) Data set Giant consists of 862 2-D points and has two clusters: one small
spherical cluster on the right side (10 points) and one huge spherical cluster
(852 points) on the left side. A much greater density of the leftmost cluster,
compared to the other one, is difficulty here leading algorithms to split the
giant instead of finding the dwarf.
b) Data set Ring consists of 800 2-D points forming two concentric rings, each
contains 400 points. Non-linearity and sophisticated connectivity are presented here to challenge methods.
c) Data set Wave was generated to measure algorithms’ performance on highly
irregular, longitudinal and linearly non-separable clusters. 2-D data consists
of 148 points in the upper and 145 points in the lower wavy curve.
d) Data set Moon is another problem domain with linearly non-separable clusters. Here, four clusters are defined, containing 104, 150, 150 and 110 2-D
points, from the topmost to the lowermost cluster, respectively.
e) Data set Flag consists of 640 points that form three clusters. Spherical cluster
in the middle contains 100 2-D points; cluster above and cluster beneath
contains 270 points each. We created this domain to see, how methods deal
with different cluster shapes.
f) The Iris data set [10] has been widely used in classification tasks. It has
150 points of four dimensions, divided into three classes of iris plant with
50 points each. The first class is linearly separable from the other two. The
second and the third classes are overlapped and linearly non-separable.
g) The Wine data set [10] has 178 13-D points with known three classes of
wines derived from three different cultivars. The numbers of data points in
the classes are 59, 71 and 48, respectively.
Table 1. Data sets used for performance measurements. Number of clusters is given
as a ground truth by human.
data set
points
dimensions
clusters
a) Giant
862
2
2
b) Ring
800
2
2
c) Wave
293
2
2
d) Moon
514
2
4
e) Flag
640
2
3
f) Iris
150
4
3
g) Wine
178
13
3
6
Nejc Ilc
Fig. 2. Data sets used in experiments. Results of clustering with gSOM are displayed
using different shape and color of markers that belong to different clusters.
3.2
Parameters setting
As described in Section 2, eight parameters need to be set for the gSOM algorithm to work, which is the main disadvantage of the proposed approach.
Fortunately, experiments on sensitivity of parameters show that constant or automatically selected values for a majority of them can be used. Table 2 displays
experimentally selected values of parameters for each problem domain. Size of
the SOM was automatically defined for all data sets, except for one called Moon,
1
using following heuristic: number of prototypes is M = 5·n 2 , where n is a number of points in data set. Ratio between numbers of prototypes in each dimension
of 2-D grid is the square root of the ratio between the two largest eigenvalues of
data in input space. Lattice that defines neighborhood of prototypes is chosen to
be rectangular except in case of Wine data set where hexagonal is used. According to our work, we suggest to use following parameters’ values for the second
level clustering of some unknown data: G = 0.0008, ΔG = 0.045, α = 0.01 and
p = 0.1. Maximal allowed number of iterations for EM GMM and K-means was
30 and 100, respectively, in order to assure convergence.
3.3
Comparison and evaluation
Clustering of seven data sets was performed by proposed gSOM and two other
well-known algorithms - EM GMM and K-means. Results were compared with
predefined classes, i.e. ground truth, and clustering error was computed - its
minimal, maximal and mean value for 100 runs of each algorithm. Clustering
gSOM - a new gravitational clustering algorithm based on the SOM
7
Table 2. Selected values of the parameters for the gSOM algorithm. SOM size is the
size of 2-D grid of prototypes, lattice means their neighborhood relations (rectangular
and hexagonal), σ1 and σ2 are initial neighborhood widths for rough and fine-tunning
phase of the SOM, G is initial gravitational constant, ΔG reduction proportion of G, α
merging distance and p probability for choosing a random point instead of a neighbor.
data set
a) Giant
b) Ring
c) Wave
d) Moon
e) Flag
f) Iris
g) Wine
SOM size
14 × 10
13 × 11
11 × 8
20 × 10
17 × 8
13 × 5
11 × 6
lattice
rect
rect
rect
rect
rect
rect
hexa
σ1
2
2
2
3
3
2
2
σ2
1
1
1
1
1
1
1
G
0.00010
0.00085
0.00080
0.00080
0.00080
0.00080
0.00085
ΔG
0.020
0.040
0.045
0.045
0.045
0.045
0.030
α
0.01
0.01
0.01
0.01
0.01
0.01
0.07
p
0.0
0.1
0.1
0.1
0.1
0.1
0.2
error is a performance criterion defined as proportion of wrongly labeled points
compared to the true class information [9]. Also, average running time was measured. Results of experiments are collected in Table 3.
Considering the minimal and mean clustering error of clustering data sets
a–e, it is clear that gSOM outperformed both, EM GMM and K-means. More
concerning is relatively high deviation of its results, pointing out an unstable
behavior in the second level of gSOM, and longer execution time - for factor
of ten, approximately. The latter is an effect of using two-level algorithm on
a relatively small data sets. However, only gSOM was able to perfectly cluster
complex data shapes given in sets c) and d), what is quite a success. It was less
successful when it come to real data sets Iris and Wine (sets f and g), where
other two algorithms were slightly better.
4
Conclusion
A new two-level clustering algorithm gSOM was presented in the paper. According to the results of the experiments, the performance of the presented
method is acceptable and is useful in discovering clusters in complex, linearly
non-separable data sets with arbitrary shape. Also, it automatically determines
number of clusters in the data and is robust to noise and outliers, thus fully exploiting advantages of a two-level approach. Our further work will concentrate
on improving algorithm’s speed, employing more effective data structures like
graphs. Furthermore, relations between parameters, especially G and ΔG, have
to be studied in order to automatically set them according to the features of
input data. Additional comparison with other two-level methods will be made;
preliminary experiments have already shown some promising results.
8
Nejc Ilc
Table 3. Performance comparison of gSOM, EM GMM and K-Means algorithms.
data set
a)
b)
c)
d)
e)
f)
g)
algorithm
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
gSOM
EM GMM
K-means
max
0.466
0.095
0.463
0.388
0.500
0.500
0.495
0.498
0.457
0.494
0.523
0.502
0.000
0.366
0.348
0.333
0.493
0.113
0.404
0.461
0.051
Clustering
min
0.000
0.000
0.000
0.230
0.498
0.498
0.000
0.259
0.160
0.000
0.274
0.502
0.000
0.000
0.000
0.033
0.033
0.113
0.056
0.017
0.045
error (%)
mean
0.041
0.080
0.442
0.339
0.500
0.500
0.215
0.492
0.422
0.058
0.418
0.502
0.000
0.114
0.228
0.253
0.106
0.113
0.209
0.064
0.047
std
0.0942
0.0128
0.0902
0.0267
0.0007
0.0006
0.1909
0.0334
0.0805
0.1077
0.0579
0.0000
0.0000
0.1634
0.1325
0.0890
0.1487
0.0000
0.0770
0.0957
0.0029
elapsed time (s)
0.3200
0.0342
0.0076
0.2598
0.0314
0.0315
0.1537
0.0206
0.0081
0.4350
0.0390
0.0151
0.2496
0.0201
0.0069
0.1336
0.0179
0.0063
0.1852
0.0268
0.0110
References
1. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall
(2003)
2. Kohonen, T.: Self-organizing maps. Springer (2001)
3. Wu, S., Chow, T.W.: Clustering of the self-organizing map using a clustering validity
index based on inter-cluster and intra-cluster density. Pattern Recognition 37, 175–
188 (2004)
4. Vesanto, J., Alhoniemi, E.: Clustering of the Self-Organizing Map. IEEE Trans. on
Neural Networks 11(3), 586–600 (2000)
5. Wright, W.E.: Gravitational clustering. Pattern Recognition 9, 151–166 (1977)
6. Gomez, J., Dasgupta, D., Nasraoui, O.: A new gravitational clustering algorithm.
In: Proceedings of the Third SIAM International Conference on Data Mining (2003)
7. Orhan, U., Hekim, M., Ibrikci, T.: Gravitational Fuzzy Clustering. In: Gelbukh, A,
Morales, E.F. (eds.) MICAI 2008. LNAI, vol. 5317, pp. 542–531. Springer, Heidelberg (2008)
8. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
9. Meilã, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering
Methods. Machine Learning 42, 9-29 (2001)
10. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, http://www.ics.
uci.edu/~mlearn/MLRepository.html