Download pdf (preprint)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
International Journal of Geographical Information Science
Vol. 00, No. 00, Month 200x, 1–17
RESEARCH ARTICLE
Contextual Neural Gas for Spatial Clustering and Analysis
Julian Hagenauer and Marco Helbich
(Received 00 Month 200x; final version received 00 Month 200x)
This is an Author’s Original Manuscript of an article whose final and definitive form,
the Version of Record, has been published in the International Journal of Geographical
Information Science [26 Apr 2012] [copyright Taylor & Francis], available online at:
http://www.tandfonline.com/doi/abs/10.1080/13658816.2012.667106.
This study aims to introduce and discuss contextual Neural Gas (CNG), a variant of the Neural Gas algorithm, which explicitly accounts for spatial dependencies within spatial data. The main idea of the CNG is to map spatially close
observations to neurons, which are close with respect to their rank distance.
Thus, spatial dependency is incorporated independently from the attributes of
the data and the process of incorporation is less sensitive to local variations
in the input data. To discuss and compare the performance of the CNG and
GeoSOM, this study draws from a series experiments, which are based on two
artificial and one real-world dataset. The experimental results of the artificial
datasets show that the CNG produces more homogenous clusters, a better ratio
of positional accuracy, and a lower quantization error than the GeoSOM. The
results of the real-world dataset illustrate that the resulting patterns of the
CNG are theoretically more sound and coherent than the GeoSOM’s, which
emphasizes its applicability for geographic analysis tasks.
Keywords: Machine Learning; Self-Organizing Maps; Spatial Cluster Analysis
1.
Introduction
The increasing pervasiveness of digital technologies for aggregating, storing, and sharing spatial data (e.g., sensor networks, volunteered geoinformation) has resulted in vast
amounts of spatial data (Miller 2010). The wealth of spatial data cannot be fully tapped,
if implicit knowledge in the data is neglected (Miller and Han 2009). To obtain a comprehensive understanding of spatial processes in particular, it is necessary to extract
simplified and generalized knowledge from spatial data, considering the unique properties of space. The nontrivial process of identifying valid, novel, potentially useful, and
Corresponding author. E-mail: [email protected]
ISSN: 1365-8816 print/ISSN 1362-3087 online
c 200x Taylor & Francis
DOI: 10.1080/1365881YYxxxxxxxx
http://www.informaworld.com
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
2
ultimately understandable patterns in data is termed knowledge discovery in databases
(KDD Fayyad et al. 1996). Geographic knowledge discovery is subdomain of KDD, which
accounts for the unique requirements of spatial data, information, and knowledge (Miller
2010). A central component in the process of GKD is spatial data mining (SDM). SDM
discovers previously unknown patterns and relationships in large spatial databases. For
this purpose, low-level algorithms are usually applied under human control (Klösgen and
Zytkow 1996). Common tasks in SDM include spatial classification and prediction, spatial association rule mining, geovisualization, and spatial clustering (Guo and Mennis
2009). This paper focuses on the latter. Spatial clustering is the organization of spatial
objects into clusters such that objects in the same cluster are similar and objects in
different clusters are unlike each other (Miller 2010).
Spatial data is often subject to spatial autocorrelation, which is a fundamental concept in spatial sciences (Sui 2004), stating that nearby entities are likely to share more
similarities than ones that are far apart (Tobler 1970). Spatial autocorrelation implies
several important consequences for traditional statistical analysis, such as lack of sample independence and the loss of degrees of freedom in test statistics, among others
(Bailey and Gatrell 1995). These consequences possess high potential to an erroneous
understanding of spatial relationships, processes, and patterns (Openshaw 1999). Therefore, the presence of spatial autocorrelation makes the use of conventional data mining
algorithms (e.g., classification and clustering algorithms), which assume independence
between observations, unfavorable.
The self-organizing map (SOM; Kohonen 1982, 2001) is a flexible artificial neural network algorithm. The learning process of the SOM is competitive and unsupervised. The
SOM maps multi-dimensional input data onto a lower-dimensional map, while the topology of the input space is mostly preserved, which is especially appealing for visualization
(Vesanto 1999). These capabilities have made the SOM popular in GIScience (e.g. Openshaw 1994, Hewitson and Crane 2002, Skupin and Fabrikant 2003, Bação et al. 2005,
Andrienko et al. 2010, Arribas-Bel et al. 2011, Hagenauer et al. 2011)). Besides for visualization and analysis purposes, SOMs can also be used for clustering data on its own
(Flexer 2001).
A focus of spatial analysis is location and distance relationships (Anselin 1989). For this
study it is assumed that the input vectors and neurons feature geographic coordinates.
These coordinates permit the calculation of spatial distances. Following the assumption,
this study further distinguishes between spatial proximity and similarity in input space,
in which the spatial distances are disregarded.
Common clustering algorithms, including SOM, do not take account for the special
properties of spatial data. A convenient approach for clustering spatial data is to measure similarity between observations as a weighted sum of similarity in input space and
spatial proximity (see e.g. Jiang 2004). However, weighting assumes comparability between spatial location and attribute data, which cannot be expected due to the distinct
nature of space (Anselin 1989).
The GeoSOM (Bação et al. 2005), a spatial variant of the SOM algorithm, maps
spatially close observations to neurons which are close with respect to their distance on
the map. Because the map is an abstraction of the input space and independent of the
actual attribute values, the GeoSOM does not require weighting and scaling of spatial
proximity and attribute similarity. Further, because the map is represented by a set
of local averages of the input data, the process of autocorrelation incorporation is less
sensitive to random variations. The low-dimensional map representation of the GeoSOM
is also beneficial for visualization and analytical tasks. However, dimension reduction and
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
3
topology preservation complicates the clustering and quantization of high-dimensional
data and thus decreases the quality of the mapping.
Inspired by the SOM, Martinetz and Schulten (1991) proposed the Neural Gas (NG)
algorithm. In contrast to SOM, NG does not define a topology on the output space nor
does it reduce the dimensionality of the input space, making it a superior alternative to
SOM for clustering and vector quantization tasks (Martinetz et al. 1993). However, NGs
have only rarely been applied in GIScience.
This paper proposes to adapt the NG algorithm for spatial clustering, explicitly accounting for spatial autocorrelation, and demonstrates the advantages of the algorithm
in comparison to the GeoSOM for cluster analysis, vector quantization and practical applications. This paper will refer to the proposed variant as contextual Neural Gas (CNG),
since it maps input samples with respect to their context.
Section 2 briefly introduces the SOM, the GeoSOM, and the NG algorithms. Section
3 discusses the CNG for clustering spatial data. In section 4 the capability of CNG and
its advantages for spatial clustering, in comparison to the GeoSOM, are demonstrated
by means of two artificial and one real-world dataset. Finally, in section 5, the paper
concludes with some remarks and identifies future work.
2.
2.1.
Methodological background
Self-organizing map
The SOM (Kohonen 1982, 2001) consists of an arbitrary number of neurons, which are
connected to adjacent neurons by a neighborhood relation. Each neuron k is represented
by a n-dimensional prototype vector mk , where n is the dimension of the input space.
During each learning step t, an input vector x is selected from the set of input vectors
and the nearest neuron c of the map, commonly termed the best matching unit (BMU),
is determined by:
c = arg min(kx − mi k1 ) ,
i
where k·k1 denotes the distance in input space, most commonly the Euclidean distance.
The BMU and its neighboring neurons on the map are moved towards the presented input
vector as follows:
mk (t + 1) = mk (t) + α(t)hc,k (t)(x − mk (t)) ,
where α(t) is the learning rate and hc,k (t) the neighborhood kernel. The learning rate
controls the magnitude of neuron’s adaptation. The neighborhood kernel determines the
influence of the BMU on the neighboring neurons, taking into account its distance to
neuron k on the map. A common neighborhood function is the Gaussian kernel function:
!
kc − kk2 2
,
hc,k (t) = exp −
2δ(t)
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
4
where k · k2 is the distance in the map, and δ(t) a time-dependent parameter reducing
the width of the kernel during the learning process. Both the learning rate and the neighborhood kernel decrease monotonically over time. Thus, the learning process gradually
shifts from an initial rough phase, in which the neurons are coarsely arranged on the
map, to a fine-tune phase, in which only small changes to the map are made.
Numerous modifications and extensions of the basic SOM algorithm exist (e.g. Kangas
et al. 1990, Fritzke 1994). The following section explains in detail the modifications which
are most relevant to this study.
2.2.
Kangas map and GeoSOM
The Kangas Map (KM; Kangas 1992) is a modification of the SOM for temporal sequence
processing. The BMU is chosen from a subset of the previously found BMU’s neighboring
neurons. Hence, the temporal sequence of input signals determines the mapping. Since
temporally close observations are thus mapped to close regions on the map, the KM
accounts for temporal autocorrelation within the input data.
The GeoSOM, introduced by Bação et al. (2005), is an application of the KM to
spatial observations, in which proximity is considered spatially instead of temporally.
The GeoSOM can be used for determining spatially coherent and homogenous clusters.
In analogy to KMs, the search for a BMU is performed in two steps. First, the neuron
in the output space is determined, which is spatially closest to the input vector:
cgeo = arg min(kx − mi kgeo ) ,
i
where k·kgeo denotes the spatial distance. Second, within a fixed radius k of the spatially
closest neuron, the final BMU is searched. Therefore, a set Sk of neurons around cgeo
with a maximum distance of k is determined:
Sk := {i : ki − cgeo k2 < k} ,
where k · k2 is the distance in map. The final BMU is then selected from the set Sk by:
c = arg min(kx − mi k1 ) ,
i∈Sk
where k · k1 commonly denotes the distance in input space, ignoring spatial locations.1
The radius k enforces the adaptation of neurons to the input vectors that are spatially
close. If k = 0, then the BMU is necessarily the spatially closest neuron; hence, the
attribute values of the input vector are neglected for the BMU search. If k is increased,
then less spatial autocorrelation is incorporated, because neurons which are more distant
to the spatially closest neuron are also considered for the final BMU search. If k is larger
than or equal to the maximum distance of neurons in the grid, then the spatial locations
1 Bação et al. (2004) suggest including the weighted spatial distance for the final BMU search to obtain a continuum
of models between the GeoSOM and the basic SOM. However, for this it is necessary to determine adequate weights
for hardly comparable dimensions.
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
5
of observations are completely neglected, because every neuron on the map, disregarding
its distance to the spatially closest neuron, is a BMU candidate.
Because the map topology of SOMs is mostly two-dimensional, the size of the set S
increases quadratically with radius k, thereby permitting only a rough control of autocorrelation incorporation into the algorithm. Furthermore, the effects of the GeoSOM’s
radius do not depend merely on the size of the radius but also on the size, dimension,
and shape of the map. Having a fixed radius n/2 and a GeoSOM with a quadratic map
of n × n neurons, no autocorrelation is incorporated into the clustering process because
the area which the radius defines covers the whole map. Increasing the size of the map
gradually increases the importance of spatial location. The same effect occurs if the radius is decreased. Therefore, the ratio between radius and side matters for incorporation
of autocorrelation in quadratic maps. The relationship between map size and radius is
more complex in non-quadratic maps. Additionally, the borders of the map induces edge
effects. The areas defined by the radius around neurons at the border generally cover less
neurons than the radius around neurons located in the center of the map. Therefore, at
the border of the map, the spatially closest neuron Cgeo has a greater chance of being
selected as the final BMU, which results in a stronger incorporation of autocorrelation
at the border of the map.
2.3.
Neural gas
The NG (Martinetz and Schulten 1991) is strictly related to the principles of the SOM.
However, it provides non-linear mapping from a high-dimensional input space to an
output space of the same dimension. The NG consists of an arbitrary number of neurons
n which are, in contrast to SOMs, not connected by any neighborhood relation. The
neurons are dynamically distributed within the input space during the training process
in analogy to physical gas. This distribution is not subject to any topological restrictions.
Each neuron k is represented by a prototype vector mk . During each learning step t,
an input pattern x is selected from the set of input vectors and the neurons are moved
into the direction of the input vector.
Let ik denote the number of neurons in the set N that are closer to x than mk :
ik (N ) = |{j ∈ N : kx − mj kp < kx − mk kp }| ,
where k · kp is some distance function, mostly the Euclidean metric. ik (N ) defines a
rank order of the neurons in N . Then, each neuron in M is moved towards the presented
input vector x with respect its rank order ik (M ) as follows:
mk (t + 1) = mk (t) + (t)e−ik (M )/λ(t) (x − mk (t)) ,
where (t) is the adaptation rate, and λ(t) the range of neighbors to be adapted. Both
parameters commonly decrease with time. Common choices for the parameters are:
λ(t) = λinit (λf inal /λinit )t/tmax ,
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
6
(t) = init (f inal /init )t/tmax .
After a sufficient number of adaptation steps, the neurons cover the input space with
minimum quantization error. The adaptation step of the NG can be interpreted as a
gradient descent on a stochastic cost function (Martinetz et al. 1993). Because the NG
adapts all neurons with respect to its rank, which depends on the neurons distance to the
input pattern, much more robust convergence can be achieved, compared to the k-means
algorithm, which only adapts the closest neurons (Labusch et al. 2009).
A disadvantage of the NG is its lack of an output space topology which limits the
application of the NG to data projection and visualization. However, after training the
NG, the neurons can be projected onto a two-dimensional space using multi-dimensional
scaling (Cox and Cox 2001) or Sammon’s mapping (Sammon 1969).
3.
Contextual neural gas
CNG combines the concepts of the GeoSOM with the NG algorithm. While the adaptation phase of the neurons is identical in the NG and CNG algorithm, they differ in
their definition of the rank ordering. The CNG incorporates autocorrelation into the rank
ordering in a two-phase procedure, similar to the BMU search of the GeoSOM.
In the first step of the GeoSOM’s BMU search, the spatially closest neuron is identified,
and in the second step, the final BMU within a certain vicinity around spatially closest
neuron in the grid is determined. The CNG lacks of an output space topology. Therefore,
rank ordering of neurons is utilized.
First, let igeo
k (M ) be the spatial rank ordering of the neurons in M with respect to a
spatial distance metric.1 Using this rank order, a subset Sl is determined, which contains
the l spatially closest neurons:
Sl := {k ∈ M : igeo
k (M ) < l} .
In the following, the parameter l is referred to as neighborhood size (NS). The final
rank order for the adaption of the neurons is then determined by:
ik (M ) =
i0k (Sl )
igeo
k (M )
if k ∈ Sl
,
otherwise
where i0k (Sl ) is commonly a rank ordering based on the similarity of the neuron’s attribute
data, disregarding their spatial locations. The SNS determines the strength of spatial
autocorrelation incorporated into the mapping process of the CNG. If l = 1, then the
set Sl consist only of the spatially closest neuron. Consequently, the final rank ordering
ik (M ) is equal to igeo
k (M ), and the adaptation of the neurons is exclusively determined
by the spatial distances of the neurons to the input vector. The attribute values of the
input data are neglected. If l is increased, then the size of Sl increases as well. The
neurons of Sl are reordered according to i0k (Sl ) within their ranks. Because their spatial
1 In
general, other distance metrics can be used alternatively to incorporate autocorrelation for other dimensions
such as time.
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
7
distances are thus less relevant for the final ordering, the adaptation of the neurons is
less determined by spatial proximity. Therefore, less autocorrelation is incorporated into
the CNG. If l = M , then Sl consists of all neurons in M . The spatial rank ordering
igeo
k (M ) becomes obsolete, because the final ordering of the neurons is solely determined
by the similarity of their attributes. Therefore, no autocorrelation is incorporated into
the CNG at all.
Just like for the GeoSOM, it is not necessary to weight or scale spatial proximity and
attribute similarity for the CNG, because spatial proximity is enforced by means of rank
distances, which are independent of the actual attribute values. The neurons of the CNG
are basically local averages. Therefore, the process of incorporating autocorrelation is
also not sensitive to random variations in the input data.
Although the basic principle of the GeoSOM and CNG is comparable, there are many
important differences. Firstly, because the number of different reasonable SNSs equals the
total number of neurons, the CNG permits more fine-grained control over the strength
of autocorrelation incorporated into the algorithm. Secondly, the lack of an output space
topology avoids edge effects that stem from topological restrictions. In CNGs only the
number of neurons and the NS matter for incorporation of autocorrelation, which alleviates the interpretation and understanding of the CNG’s results.
However, because common projection methods do not account for the distinct nature
of space, visualization of the CNG is problematic. Therefore, data projection and visualization remain critical tasks. Nevertheless, as spatial data mostly comprise locational
information, geographic maps can visually display the clusters, neurons, and the their
properties.
Despite the differences between the GeoSOM and CNG, their application domain is
mainly the same. In especially they both are useful for clustering data, where spatial
dependence between observations is present. Therefore, the following section will evaluate
the performance of the GeoSOM and CNG.
4.
Experiments for performance evaluation
This section demonstrates the usefulness of the CNG and compares it to the GeoSOM
by means of multiple experiments, which are based on two artificial and one real-world
dataset. The objective of using artificial data is to produce repeatable experiments,
thereby facilitating an understanding of the algorithms’ different features. The realworld dataset demonstrates the extend to which CNG can be used to appropriately
solve practical geoscientific problems. The first experiment compares the vector quantization performance of the CNG and GeoSOM. The second experiment investigates their
clustering performances. The third experiment exemplifies the different clustering results
of the CNG’s and the GeoSOM’s application to the delimitation of urban patterns within
the urban fringe of Vienna, Austria. The following basic parameters of the GeoSOM and
CNG algorithms are applied in the experiments: In both algorithms the number of neurons N is chosen to match the desired number of clusters. The neurons are randomly
initialized and the training time is set to 100,000 iterations. The neighborhood range
parameters of the CNG are set to λinit = N/2 and λf inal = 0.01. Following Martinetz
et al. (1993), the adaptation rate parameters of the CNG are set to init = 0.5 and
f inal = 0.005. The parameters λ(t) and (t) are chosen as described in section 2.3. The
map of the GeoSOM is a two-dimensional hexagonal grid. The learning rate of the GeoSOM is linearly decreasing from 0.5 and the neighborhood function is a Gaussian kernel
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
8
HH
LL
Not significant
Figure 1. Cluster map of spatial association. HH is a statistically significant (p < 0.05) cluster
of high values, whereas LL a statistically significant (p < 0.05) cluster of low values.
with an initial neighborhood radius spanning the entire map (Skupin and Agarwal 2008).
After training, the set of input vectors is determined which are mapped to each neuron.
The resulting sets represent the clustering result.1
4.1.
Vector quantization
To evaluate the quantization performance of the GeoSOM and the CNG, they are applied
to an artificially created dataset, which comprises different continuous areas of local
spatial autocorrelation. The dataset consists of 1, 000 data points (x, y, z), where x and y
is randomly chosen from the interval [0, 1]. The z-value is calculated by z = x + y. Data
points, which are located near the edges at (0, 0) and (1, 1) exhibit high autocorrelation,
whereas data points near the diagonal from (0, 1) to (1, 0) show low autocorrelation.
Figure 1 visually represents the local Moran statistic (Anselin 1995) in the form of a
cluster map, which classifies observations by the type of spatial association. For ease of
visual perception, the data points are represented by their Voronoi areas.
A GeoSOM and a CNG with 25 neurons are trained for each reasonable radius and
SNS, respectively, with this dataset. The quantization error is a vector quantization
quality measure. It is calculated by determining the mean distance between each sample
and its BMU or cluster center, by which it is represented. In order to compare the
quantization performance of both algorithms with respect to spatial autocorrelation, the
spatial quantization error (SQE) is introduced. The SQE measures the mean spatial
1 It
is also possible to cluster the neurons or the map, respectively, in an additional post-processing step (for
SOM see e.g. Murtagh 1995, Vesanto 1999, Ultsch 2005). However, post-processing may alter the results of the
underlying clustering algorithms and thus makes their comparison difficult.
10:21
International Journal of Geographical Information Science
cng
0.25
9
0.15
0.20
GeoSOM
cNG
0.10
lQE
August 15, 2013
0.010
0.015
0.020
0.025
0.030
QE
Figure 2. Mean QEs versus mean SQEs for the GeoSOM and CNG.
distance between each sample and its BMU or cluster center, disregarding the attribute
values of the samples. The mean of the resulting QEs and SQEs for the CNG and
GeoSOM over 100 runs are shown in Figure 2.
The radius and SNS determine a trade-off between QE and SQE. Figure 2 shows that
the CNG mostly makes a superior trade-off between the QEs. However, if the radius of
the GeoSOM is set to its lowest value, the SQE is minimized and GeoSOM performs
almost equally well as the CNG with respect to both QEs. By setting the radius of the
GeoSOM to 0, quantization solely depends on the spatial locations of the input samples
in the two-dimensional plane. Since its dimension equals the topological dimension of the
GeoSOM, doing so avoids reduction of dimensionality and alleviates the quantization of
the input data.
However, although no clear clustering exists in the artificial dataset, the CNG mostly
outperforms the GeoSOM. The CNG not only quantizes the data with higher spatial
accuracy but also quantizes the data with a lower QE.
4.2.
Clustering
To compare the clustering performance of the GeoSOM and CNG an artificial dataset is
created according to the following procedure. First, a set of 5, 000 data points (x, y, z)
is created. The spatial location of each data point is represented by x and y, which
are randomly chosen from the interval [0, 1]. Additionally, the Voronoi areas of the data
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
10
Figure 3. Dataset consisting of 25 spatially distinct clusters.
points are calculated. A seed of 100 data points is randomly selected as initial clusters.
Then one of the clusters is randomly selected, and an arbitrary data point, which does
not belong to a cluster yet and whose Voronoi area is adjacent, is assigned to the current
cluster repeatedly until no unassigned data points are left. After that, the adjacency
graph of the clusters is colored by means of a greedy algorithm, and each color is coded
as an integer value and assigned as the z-values to the data-points. Hence, clusters are
created which are spatially distinct, but feature similar z-values. Clustering algorithms,
which do not take the spatial locations of the data points into account, are thus likely to
fail to delineate the clusters correctly. Figure 3 displays the 25 clusters of this dataset.
The colors represent the z-values.
The clustering performance of GeoSOM and CNG depends on the choice of parameters.
In particular, the SNS and radius are important, since they determine the degree of spatial dependence incorporated into the clustering process. Therefore, for each reasonable
radius and SNS, a GeoSOM and CNG, respectively, is applied to the dataset. The CNG
and GeoSOM both consist of 25 neurons, one neuron per existing cluster in the dataset.
The resulting clusterings are then evaluated by normalized mutual information (NMI;
Strehl and Ghosh 2003). Mutual information is a symmetric measure that quantifies the
mutual dependence between two distributions (Cover and Thomas 2006), and hence can
be used to assess the similarities of two different clusterings (Strehl and Ghosh 2003).
The NMI has a range with a lower bound of 0 and an upper bound of 1. The NMI
reaches the value of 1, if two clusterings are identical, and 0, if they have no information
in common and thus are totally independent. Figure 4 shows a plot of the mean NMI for
the GeoSOM and CNG with different radii and SNSs, respectively, over 100 runs.
Figure 4 shows that the mean NMI increases immediately with the radius and SNS,
respectively, until a maximum peak is reached, but decreases subsequently. However,
because of the distinct nature of the CNG and GeoSOM, the effects of the radius and
SNS on the NMIs are not strictly comparable. The graph of the GeoSOM suggests that
10:21
International Journal of Geographical Information Science
cng
11
NR
1/25
4/25
7/25
10/25
13/25
16/25
19/25
22/25
25/25
0.80
0.85
0.90
0.95
cNG
GeoSOM
0.65
0.70
0.75
NMI
August 15, 2013
0
1
2
3
4
5
6
Radius
Figure 4. Performance assessments based on the mean NMI for the GeoSOM and the CNG with
different radii and SNSs, respectively.
the mean NMI changes only marginally from radius 4 to 6. The reason for this is that
the area defined by the radius already covers the whole map for most neurons (i.e. on
average 23.08 neurons are located within distance of 4 on the map). Therefore, only few
spatially distant neurons are disregarded in the search for the final BMU and only a
small amount of autocorrelation is thus incorporated into the mapping process of the
GeoSOM. A maximum mean NMI of about 0.95 is obtained for the CNG with SNS 4/25,
whereas a maximum mean NMI of about 0.8 for the GeoSOM with radius 1. The results
for both algorithms are sensitive to the choice of the radius size and SNS, respectively.
Because of the large number of reasonable SNSs of the CNG, the CNG permits more
fine-grained control of the clustering results than the GeoSOM, for which only few radii
are reasonable. Because in general the NMIs for the CNG’s clusterings are higher than
for the GeoSOM’s, the CNG identifies the clusters in the input data more precisely.
4.3.
Practical application
Region building and delineation of spatial patterns are essential tasks in spatial planning.
This case study evaluates the practical application of the CNG and GeoSOM for the
delineation of urban patterns in the urban fringe of Vienna, Austria. Data previously
analyzed in a different context by Helbich and Görgl (2010) serve as input. The dataset
encompasses 268 municipalities and comprises a) the average in-migration rate for the
August 15, 2013
10:21
International Journal of Geographical Information Science
cng
12
years 2001–2003 (per mil; INM) and six hard location factors, namely b) the land price
index 2005/2006 (in Euro; LPI), c) scenic attractiveness index (ATI), d) accessibility
to the core city of Vienna by public transport (in minutes; PTV) and e) by motorized
individual transport (in minutes; MTV), and finally f,g) the accessibility to central places
at two different hierarchical levels (level 3 and 5) by motorized individual transport (in
minutes; CP3, CP5). Table 1 lists the sources and descriptive statistics of the dataset.
Table 1. Data sources and descriptive statistics of the used dataset.
Abbrev.
INM
LPI
ATI
PTV
MTV
CP3
CP5
Sources
Statistics Austria, 2001–2003
GEWINN 7/8/2006
SRTM-DEM (NASA), Tele Atlas,
OpenStreetMap
Verkehrsverbund Ost Region, 2009
Tele Atlas, 2005
Beier et al. (2007)
Beier et al. (2007)
Min.
2.94
12.00
21.69
Max.
21.35
550.00
242.44
Mean
7.66
106.25
116.18
St. dev.
2.53
77.89
43.93
22.00
18.00
0.99
1.67
128.00
81.00
34.85
66.10
66.52
40.15
14.26
25.25
21.99
11.60
6.66
10.93
Before applying spatial algorithms, it is useful to explore whether spatial autocorrelation is actually present in the input data by using the global Moran’s I statistic (Cliff and
Ord 1973). Applying a first-order queen contiguity to define the municipalities’ configuration of spatial neighborhood, all Moran’s I coefficients show significant positive spatial
autocorrelation (p < 0.001). These results legitimate the use of the spatial regionalization
algorithms such as the CNG or GeoSOM to account for these effects.
Henceforth, the location of every municipality is represented by its centroid coordinates. All attributes are normalized to be in the same range. The spatial aspect ratio
of the geographic coordinates is maintained. The training parameters of the CNG and
GeoSOM are set according to section 4.
For a comparison of GeoSOM and CNG, it is first necessary to determine the number
of clusters. However, automatic computation of the number of clusters is a difficult problem in cluster analysis. Therefore, in the majority of real-world studies the number is
pragmatically estimated (Djeraba and Fernandez 2003). This study considers 5 clusters,
because they empirically build a sound compromise between specificity and generality.
It is also necessary to determine a SNS or radius, respectively, such that the resulting
clusterings represent spatial patterns adequately. To analyze the sensitivity of the GeoSOM and CNG, both are trained for all reasonable SNSs and radii, respectively, and the
mean QE over 100 runs is calculated. As in section 4.1, the QE is distinguished between
QE and SQE. Figure 5 depicts the results. While the QEs of both algorithms’ results are
similar, the CNG outperforms the GeoSOM with respect to the SQE (Figure 5).
Further, a GeoSOM with a radius of 1 and a CNG with a SNS of 3/5 are trained
to investigate the differences of the algorithms. The results exhibit similar locational
accuracy within these parameters and are thus applicable for comparing their spatial
patterns.
Figure 6 visualizes the resulting clusterings. Both methods result in similar clusterings
in the northern parts of Vienna but differ in the south-eastern and southern surroundings,
which represent highly dynamic areas within the Viennese urban fringe that had been
shaped and reorganized by current (post-)suburbanization processes (Helbich and Leitner
2010, Helbich 2012).
The CNG allocates the primary settlement areas along the main traffic routes in the
surroundings, namely Wiener Neustadt and Bruck, to the current suburbs (Cluster 3).
10:21
International Journal of Geographical Information Science
cng
13
NR
1/5
2/5
3/5
4/5
5/5
0.25
0.30
0.35
QE (GeoSOM)
lQE (GeoSOM)
QE (cNG)
lQE (cNG)
0.15
0.20
QE and lQE
August 15, 2013
0
1
2
3
4
Radius
Figure 5. Mean QEs and mean SQEs of the GeoSOM and CNG for different radii and SNSs,
respectively.
These results conform to the Federal State Government of Lower Austria’s general principle for spatial planning, which has intended development and growth corridors to connect
important urban centers and central places within the urban fringe (Office of the Lower
Austrian Government 2005). In contrast, the GeoSOM assigns these areas to Cluster 2,
which reflects the transition zone from suburban to rural areas. Therefore, the resulting
spatial pattern of the CNG is theoretically more sound and coherent than the GeoSOM’s,
although their clusterings exhibit similar SQEs.
5.
Conclusions and future work
This study presented the CNG, a variant of the basic NG, and showed that it is particularly well suited to analyze spatial data. The CNG accounts for spatial dependency
by mapping spatially close observations to neurons, which are close with respect to their
rank distance. In this way, the CNG incorporates autocorrelation independently of the
data’s attributes and local variations in the input data have less affect on he process of
incorporation. A concept closely related to the CNG is the GeoSOM. It has been shown
that the CNG is superior to the GeoSOM in many aspects. In contrast to the GeoSOM,
incorporation of autocorrelation for the CNG is not subject to topological restrictions,
which alleviates its understanding and avoids edge effects. Additionally, the CNG allows
a more fine-grained control on the degree of autocorrelation incorporation. The artificial
Tulln
Korneuburg
Cluster 3
District capitals
Main highways
Figure 6. The resulting clusters of the GeoSOM (left) and the CNG (right).
Cluster 2
Cluster 1
Wiener Neustadt
Wiener Neustadt
0
5
Mistelbach
10
20 km
Bruck
Gänserndorf
¯
International Journal of Geographical Information Science
Cluster 0
Baden
Baden
Mödling
Cluster 4
Korneuburg
Klosterneuburg
Mödling
Tulln
Vienna
Bruck
Gänserndorf
Hollabrunn
Vienna
Klosterneuburg
Mistelbach
10:21
Hollabrunn
August 15, 2013
cng
14
August 15, 2013
10:21
International Journal of Geographical Information Science
REFERENCES
cng
15
experiments revealed, that the CNG approximates spatial data with higher positional
accuracy and lower QE than the GeoSOM. Also, the CNG resulted in better clusterings
than the GeoSOM with respect to the obtained NMI. The capabilities of the CNG for
identifying coherent homogenous and reasonable regions were verified by means of a realworld application, demonstrating the potential of the CNG as regionalization tool, which
is an important task in spatial analysis (Miller and Han 2009). However, the results of
the CNG depend strictly on its parametrization, in particular on the choice of SNS and
the number of neurons. In this study sensitivity analysis was performed to demonstrate
that the SNS basically represents a trade-off between QE and SQE. However, it will be
beneficial to obtain a deeper understanding of the established relationships of the CNG.
Further work will concentrate on this issue. Furthermore, it is necessary to integrate the
CNG with other computational, visual, and cartographic methods such that analysts
can access a broad set of tools for exploring and understanding spatial and multivariate
patterns in a combined framework.
References
Andrienko, G., et al., 2010. Space-in-time and time-in-space self-organizing maps for
exploring spatiotemporal patterns. Computer Graphics Forum, 29, 913–922.
Anselin, L., 1989. What is special about spatial data? Alternative perspectives on spatial
data analysis. In: Symposium on spatial statistics, past, present and future Department of Geography, Syracuse University.
Anselin, L., 1995. Local indicators of spatial association - LISA. Geographical Analysis,
27, 93–115.
Arribas-Bel, D., Nijkamp, P., and Scholten, H., 2011. Multidimensional urban sprawl
in Europe: A self-organizing map approach. Computers, Environment and Urban
Systems, 35 (4), 263–275.
Bação, F., Lobo, V., and Painho, M., 2004. Geo-self-organizing map (Geo-SOM) for
building and exploring homogeneous regions. In: M.J. Egenhofer, C. Freksa and
H.J. Miller, eds. Geographic Information Science., Vol. 3234 of Lecture Notes in
Computer Science Springer Berlin / Heidelberg, 22–37.
Bação, F., Lobo, V., and Painho, M., 2005. The self-organizing map, the Geo-SOM, and
relevant variants for geosciences. Computational Geosciences, 31, 155–163.
Bailey, T. and Gatrell, A., 1995. Interactive spatial data analysis. Longman Scientific &
Technical.
Beier, R., et al., 2007. Erreichbarkeitsverhältnisse in Österreich 2005. Modellrechnungen
für den ÖPNRV und den MIV (Accessibility in Austria 2005. Model estimations for
motorized individual transport and public transport). Vienna: ÖROK.
Cliff, A. and Ord, J., 1973. Spatial autocorrelation. London, UK: Pion.
Cover, T.M. and Thomas, J.A., 2006. Elements of information theory. New York, NY,
USA: Wiley-Interscience.
Cox, T. and Cox, M., 2001. Multidimensional scaling. No. 1 Monographs on statistics
and applied probability Chapman & Hall/CRC.
Djeraba, C. and Fernandez, G., 2003. Mining image data. In: The handbook of data
mining., 637–656 Lawrence Erlbaum Associates.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., 1996. From data mining to knowledge
discovery: An overview. In: Advances in knowledge discovery and data mining. MIT
Press, Cambridge.
August 15, 2013
10:21
International Journal of Geographical Information Science
16
REFERENCES
cng
Flexer, A., 2001. On the use of self-organizing maps for clustering and visualization.
Intelligent Data Analysis 5, 5, 373–384.
Fritzke, B., 1994. Growing cell structuresA self-organizing network for unsupervised and
supervised learning. Neural Networks, 7 (9), 1441 – 1460.
Guo, D. and Mennis, J., 2009. Spatial data mining and geographic knowledge discovery
- An introduction. Computers Environment and Urban Systems, 33 (6), 403–408.
Hagenauer, J., Helbich, M., and Leitner, M., 2011. Visualization of crime trajectories
with self-organizing maps: A case study on evaluating the impact of hurricanes on
spatio-temporal crime hotspots. In: Proceedings of the 25th International Cartographic Conference, July. ISBN: 978-1-907075-05-6, CO-173, Paris, France.
Helbich, M., 2012. Beyond postsuburbia? Multifunctional service agglomeration in Vienna’s urban fringe. Journal of Economic & Social Geography, in press.
Helbich, M. and Görgl, P.J., 2010. Räumliche Regressionsmodelle als leistungsfähige
Methoden zur Erklärung der Driving Forces von Zuzügen in der Stadtregion Wien?
(Spatial regression as a useful technique to explore the driving forces of in-migration
in the Viennese urban region?). Raumforschung und Raumordnung, 68, 103–113.
Helbich, M. and Leitner, M., 2010. Postsuburban spatial evolution of Vienna’s urban
fringe: Evindence from point process modeling. Urban Geography, 31 (8), 1100–1117.
Hewitson, B.C. and Crane, R.G., 2002. Self-organizing maps: Applications to synoptic
climatology. Climate Research, 22 (1), 13–26.
Jiang, B., 2004. Extraction of spatial objects from laser-scanning data using a clustering
technique. In: XXth ISPRS Congress: Geo-Imagery bridging continents, Vol. 3, 7.,
Istanbul, Turkey, 219–224.
Kangas, J., Kohonen, T., and Laaksonen, J., 1990. Variants of self-organizing maps.
IEEE Transactions on Neural Networks, 1 (1), 93 –99.
Kangas, J., 1992. Temporal Knowledge in locations of activations in a self-organizing
map. In: I. Aleksander and J. Taylor, eds. Artificial Neural Networks, 2, Vol. 1
Amsterdam, Netherlands: North-Holland, 117–120.
Klösgen, W. and Zytkow, J., 1996. Knowledge Discovery in Databases Terminology. In:
Advances in Knowledge Discovery and Data Mining., 573–592 MIT Press, Cambridge.
Kohonen, T., 1982. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69 10.1007/BF00337288.
Kohonen, T., 2001. Self-organizing maps. 3rd Secaucus, NJ, USA: Springer, New York.
Labusch, K., Barth, E., and Martinetz, T., 2009. Sparse coding neural gas: Learning of
overcomplete data representations. Neurocomputing, 72, 1547–1555.
Martinetz, T. and Schulten, K., 1991. A ”Neural-Gas” network learns topologies. Artificial Neural Networks, 1, 397–402.
Martinetz, T., Berkovich, S., and Schulten, K., 1993. ”Neural-gas” network for vector
quantization and its application to time-series prediction. IEEE Transactions on
Neural Networks, 4 (4), 558–569.
Miller, H.J., 2010. The data avalanche is here. Shouldn’t we be digging?. Journal of
Regional Science, 50, 181–201.
Miller, H.J. and Han, J., 2009. Geographic data mining and knowledge discovery. Bristol,
PA, USA: CRC Press, Boca Raton.
Murtagh, F., 1995. Interpreting the Kohonen self-organizing feature map using
contiguity-constrained clustering. Pattern Recognition Letters, 16 (4), 399–408.
Office of the Lower Austrian Government, ed. , 2005. Perspektiven fr die Hauptregionen
(Perspectives for the main regions). St. Pölten: Gugler GmbH.
August 15, 2013
10:21
International Journal of Geographical Information Science
REFERENCES
cng
17
Openshaw, S., 1994. Neuroclassification of spatial data. In: Neural nets: Applications in
geography., 53–70 Boston, MA, USA: Kluwer.
Openshaw, S.O., 1999. Geographical data mining: Key design issues. In: 4th International
Conference on Geocomputation, Fredericksburg.
Sammon, J.W., J., 1969. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18 (5), 401–409.
Skupin, A. and Agarwal, P., 2008. Introduction: What is a self-organizing map?. In: Selforganising maps: Applications in geographical information science. Chichester, UK:
John Wiley & Sons, Ltd.
Skupin, A. and Fabrikant, S.I., 2003. Spatialization methods: A cartographic research
agenda for non-geographic information visualization. Cartography and Geographic
Information Science, 30, 95–119.
Strehl, A. and Ghosh, J., 2003. Cluster ensembles - a knowledge reuse framework for
combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.
Sui, D.Z., 2004. Tobler’s first law of geography: A big idea for a small world?. Annals of
the Association of American Geographers, 94 (2), 269–277.
Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region.
Economic Geography, 46, 234–240.
Ultsch, A., 2005. Clustering with SOM: U*C. In: Proceedings of the Workshop on SelfOrganizing Maps, Paris, France, 75–82.
Vesanto, J., 1999. SOM-based data visualization methods. Intelligent Data Analysis, 3,
111–126.