Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
August 15, 2013 10:21 International Journal of Geographical Information Science cng International Journal of Geographical Information Science Vol. 00, No. 00, Month 200x, 1–17 RESEARCH ARTICLE Contextual Neural Gas for Spatial Clustering and Analysis Julian Hagenauer and Marco Helbich (Received 00 Month 200x; final version received 00 Month 200x) This is an Author’s Original Manuscript of an article whose final and definitive form, the Version of Record, has been published in the International Journal of Geographical Information Science [26 Apr 2012] [copyright Taylor & Francis], available online at: http://www.tandfonline.com/doi/abs/10.1080/13658816.2012.667106. This study aims to introduce and discuss contextual Neural Gas (CNG), a variant of the Neural Gas algorithm, which explicitly accounts for spatial dependencies within spatial data. The main idea of the CNG is to map spatially close observations to neurons, which are close with respect to their rank distance. Thus, spatial dependency is incorporated independently from the attributes of the data and the process of incorporation is less sensitive to local variations in the input data. To discuss and compare the performance of the CNG and GeoSOM, this study draws from a series experiments, which are based on two artificial and one real-world dataset. The experimental results of the artificial datasets show that the CNG produces more homogenous clusters, a better ratio of positional accuracy, and a lower quantization error than the GeoSOM. The results of the real-world dataset illustrate that the resulting patterns of the CNG are theoretically more sound and coherent than the GeoSOM’s, which emphasizes its applicability for geographic analysis tasks. Keywords: Machine Learning; Self-Organizing Maps; Spatial Cluster Analysis 1. Introduction The increasing pervasiveness of digital technologies for aggregating, storing, and sharing spatial data (e.g., sensor networks, volunteered geoinformation) has resulted in vast amounts of spatial data (Miller 2010). The wealth of spatial data cannot be fully tapped, if implicit knowledge in the data is neglected (Miller and Han 2009). To obtain a comprehensive understanding of spatial processes in particular, it is necessary to extract simplified and generalized knowledge from spatial data, considering the unique properties of space. The nontrivial process of identifying valid, novel, potentially useful, and Corresponding author. E-mail: [email protected] ISSN: 1365-8816 print/ISSN 1362-3087 online c 200x Taylor & Francis DOI: 10.1080/1365881YYxxxxxxxx http://www.informaworld.com August 15, 2013 10:21 International Journal of Geographical Information Science cng 2 ultimately understandable patterns in data is termed knowledge discovery in databases (KDD Fayyad et al. 1996). Geographic knowledge discovery is subdomain of KDD, which accounts for the unique requirements of spatial data, information, and knowledge (Miller 2010). A central component in the process of GKD is spatial data mining (SDM). SDM discovers previously unknown patterns and relationships in large spatial databases. For this purpose, low-level algorithms are usually applied under human control (Klösgen and Zytkow 1996). Common tasks in SDM include spatial classification and prediction, spatial association rule mining, geovisualization, and spatial clustering (Guo and Mennis 2009). This paper focuses on the latter. Spatial clustering is the organization of spatial objects into clusters such that objects in the same cluster are similar and objects in different clusters are unlike each other (Miller 2010). Spatial data is often subject to spatial autocorrelation, which is a fundamental concept in spatial sciences (Sui 2004), stating that nearby entities are likely to share more similarities than ones that are far apart (Tobler 1970). Spatial autocorrelation implies several important consequences for traditional statistical analysis, such as lack of sample independence and the loss of degrees of freedom in test statistics, among others (Bailey and Gatrell 1995). These consequences possess high potential to an erroneous understanding of spatial relationships, processes, and patterns (Openshaw 1999). Therefore, the presence of spatial autocorrelation makes the use of conventional data mining algorithms (e.g., classification and clustering algorithms), which assume independence between observations, unfavorable. The self-organizing map (SOM; Kohonen 1982, 2001) is a flexible artificial neural network algorithm. The learning process of the SOM is competitive and unsupervised. The SOM maps multi-dimensional input data onto a lower-dimensional map, while the topology of the input space is mostly preserved, which is especially appealing for visualization (Vesanto 1999). These capabilities have made the SOM popular in GIScience (e.g. Openshaw 1994, Hewitson and Crane 2002, Skupin and Fabrikant 2003, Bação et al. 2005, Andrienko et al. 2010, Arribas-Bel et al. 2011, Hagenauer et al. 2011)). Besides for visualization and analysis purposes, SOMs can also be used for clustering data on its own (Flexer 2001). A focus of spatial analysis is location and distance relationships (Anselin 1989). For this study it is assumed that the input vectors and neurons feature geographic coordinates. These coordinates permit the calculation of spatial distances. Following the assumption, this study further distinguishes between spatial proximity and similarity in input space, in which the spatial distances are disregarded. Common clustering algorithms, including SOM, do not take account for the special properties of spatial data. A convenient approach for clustering spatial data is to measure similarity between observations as a weighted sum of similarity in input space and spatial proximity (see e.g. Jiang 2004). However, weighting assumes comparability between spatial location and attribute data, which cannot be expected due to the distinct nature of space (Anselin 1989). The GeoSOM (Bação et al. 2005), a spatial variant of the SOM algorithm, maps spatially close observations to neurons which are close with respect to their distance on the map. Because the map is an abstraction of the input space and independent of the actual attribute values, the GeoSOM does not require weighting and scaling of spatial proximity and attribute similarity. Further, because the map is represented by a set of local averages of the input data, the process of autocorrelation incorporation is less sensitive to random variations. The low-dimensional map representation of the GeoSOM is also beneficial for visualization and analytical tasks. However, dimension reduction and August 15, 2013 10:21 International Journal of Geographical Information Science cng 3 topology preservation complicates the clustering and quantization of high-dimensional data and thus decreases the quality of the mapping. Inspired by the SOM, Martinetz and Schulten (1991) proposed the Neural Gas (NG) algorithm. In contrast to SOM, NG does not define a topology on the output space nor does it reduce the dimensionality of the input space, making it a superior alternative to SOM for clustering and vector quantization tasks (Martinetz et al. 1993). However, NGs have only rarely been applied in GIScience. This paper proposes to adapt the NG algorithm for spatial clustering, explicitly accounting for spatial autocorrelation, and demonstrates the advantages of the algorithm in comparison to the GeoSOM for cluster analysis, vector quantization and practical applications. This paper will refer to the proposed variant as contextual Neural Gas (CNG), since it maps input samples with respect to their context. Section 2 briefly introduces the SOM, the GeoSOM, and the NG algorithms. Section 3 discusses the CNG for clustering spatial data. In section 4 the capability of CNG and its advantages for spatial clustering, in comparison to the GeoSOM, are demonstrated by means of two artificial and one real-world dataset. Finally, in section 5, the paper concludes with some remarks and identifies future work. 2. 2.1. Methodological background Self-organizing map The SOM (Kohonen 1982, 2001) consists of an arbitrary number of neurons, which are connected to adjacent neurons by a neighborhood relation. Each neuron k is represented by a n-dimensional prototype vector mk , where n is the dimension of the input space. During each learning step t, an input vector x is selected from the set of input vectors and the nearest neuron c of the map, commonly termed the best matching unit (BMU), is determined by: c = arg min(kx − mi k1 ) , i where k·k1 denotes the distance in input space, most commonly the Euclidean distance. The BMU and its neighboring neurons on the map are moved towards the presented input vector as follows: mk (t + 1) = mk (t) + α(t)hc,k (t)(x − mk (t)) , where α(t) is the learning rate and hc,k (t) the neighborhood kernel. The learning rate controls the magnitude of neuron’s adaptation. The neighborhood kernel determines the influence of the BMU on the neighboring neurons, taking into account its distance to neuron k on the map. A common neighborhood function is the Gaussian kernel function: ! kc − kk2 2 , hc,k (t) = exp − 2δ(t) August 15, 2013 10:21 International Journal of Geographical Information Science cng 4 where k · k2 is the distance in the map, and δ(t) a time-dependent parameter reducing the width of the kernel during the learning process. Both the learning rate and the neighborhood kernel decrease monotonically over time. Thus, the learning process gradually shifts from an initial rough phase, in which the neurons are coarsely arranged on the map, to a fine-tune phase, in which only small changes to the map are made. Numerous modifications and extensions of the basic SOM algorithm exist (e.g. Kangas et al. 1990, Fritzke 1994). The following section explains in detail the modifications which are most relevant to this study. 2.2. Kangas map and GeoSOM The Kangas Map (KM; Kangas 1992) is a modification of the SOM for temporal sequence processing. The BMU is chosen from a subset of the previously found BMU’s neighboring neurons. Hence, the temporal sequence of input signals determines the mapping. Since temporally close observations are thus mapped to close regions on the map, the KM accounts for temporal autocorrelation within the input data. The GeoSOM, introduced by Bação et al. (2005), is an application of the KM to spatial observations, in which proximity is considered spatially instead of temporally. The GeoSOM can be used for determining spatially coherent and homogenous clusters. In analogy to KMs, the search for a BMU is performed in two steps. First, the neuron in the output space is determined, which is spatially closest to the input vector: cgeo = arg min(kx − mi kgeo ) , i where k·kgeo denotes the spatial distance. Second, within a fixed radius k of the spatially closest neuron, the final BMU is searched. Therefore, a set Sk of neurons around cgeo with a maximum distance of k is determined: Sk := {i : ki − cgeo k2 < k} , where k · k2 is the distance in map. The final BMU is then selected from the set Sk by: c = arg min(kx − mi k1 ) , i∈Sk where k · k1 commonly denotes the distance in input space, ignoring spatial locations.1 The radius k enforces the adaptation of neurons to the input vectors that are spatially close. If k = 0, then the BMU is necessarily the spatially closest neuron; hence, the attribute values of the input vector are neglected for the BMU search. If k is increased, then less spatial autocorrelation is incorporated, because neurons which are more distant to the spatially closest neuron are also considered for the final BMU search. If k is larger than or equal to the maximum distance of neurons in the grid, then the spatial locations 1 Bação et al. (2004) suggest including the weighted spatial distance for the final BMU search to obtain a continuum of models between the GeoSOM and the basic SOM. However, for this it is necessary to determine adequate weights for hardly comparable dimensions. August 15, 2013 10:21 International Journal of Geographical Information Science cng 5 of observations are completely neglected, because every neuron on the map, disregarding its distance to the spatially closest neuron, is a BMU candidate. Because the map topology of SOMs is mostly two-dimensional, the size of the set S increases quadratically with radius k, thereby permitting only a rough control of autocorrelation incorporation into the algorithm. Furthermore, the effects of the GeoSOM’s radius do not depend merely on the size of the radius but also on the size, dimension, and shape of the map. Having a fixed radius n/2 and a GeoSOM with a quadratic map of n × n neurons, no autocorrelation is incorporated into the clustering process because the area which the radius defines covers the whole map. Increasing the size of the map gradually increases the importance of spatial location. The same effect occurs if the radius is decreased. Therefore, the ratio between radius and side matters for incorporation of autocorrelation in quadratic maps. The relationship between map size and radius is more complex in non-quadratic maps. Additionally, the borders of the map induces edge effects. The areas defined by the radius around neurons at the border generally cover less neurons than the radius around neurons located in the center of the map. Therefore, at the border of the map, the spatially closest neuron Cgeo has a greater chance of being selected as the final BMU, which results in a stronger incorporation of autocorrelation at the border of the map. 2.3. Neural gas The NG (Martinetz and Schulten 1991) is strictly related to the principles of the SOM. However, it provides non-linear mapping from a high-dimensional input space to an output space of the same dimension. The NG consists of an arbitrary number of neurons n which are, in contrast to SOMs, not connected by any neighborhood relation. The neurons are dynamically distributed within the input space during the training process in analogy to physical gas. This distribution is not subject to any topological restrictions. Each neuron k is represented by a prototype vector mk . During each learning step t, an input pattern x is selected from the set of input vectors and the neurons are moved into the direction of the input vector. Let ik denote the number of neurons in the set N that are closer to x than mk : ik (N ) = |{j ∈ N : kx − mj kp < kx − mk kp }| , where k · kp is some distance function, mostly the Euclidean metric. ik (N ) defines a rank order of the neurons in N . Then, each neuron in M is moved towards the presented input vector x with respect its rank order ik (M ) as follows: mk (t + 1) = mk (t) + (t)e−ik (M )/λ(t) (x − mk (t)) , where (t) is the adaptation rate, and λ(t) the range of neighbors to be adapted. Both parameters commonly decrease with time. Common choices for the parameters are: λ(t) = λinit (λf inal /λinit )t/tmax , August 15, 2013 10:21 International Journal of Geographical Information Science cng 6 (t) = init (f inal /init )t/tmax . After a sufficient number of adaptation steps, the neurons cover the input space with minimum quantization error. The adaptation step of the NG can be interpreted as a gradient descent on a stochastic cost function (Martinetz et al. 1993). Because the NG adapts all neurons with respect to its rank, which depends on the neurons distance to the input pattern, much more robust convergence can be achieved, compared to the k-means algorithm, which only adapts the closest neurons (Labusch et al. 2009). A disadvantage of the NG is its lack of an output space topology which limits the application of the NG to data projection and visualization. However, after training the NG, the neurons can be projected onto a two-dimensional space using multi-dimensional scaling (Cox and Cox 2001) or Sammon’s mapping (Sammon 1969). 3. Contextual neural gas CNG combines the concepts of the GeoSOM with the NG algorithm. While the adaptation phase of the neurons is identical in the NG and CNG algorithm, they differ in their definition of the rank ordering. The CNG incorporates autocorrelation into the rank ordering in a two-phase procedure, similar to the BMU search of the GeoSOM. In the first step of the GeoSOM’s BMU search, the spatially closest neuron is identified, and in the second step, the final BMU within a certain vicinity around spatially closest neuron in the grid is determined. The CNG lacks of an output space topology. Therefore, rank ordering of neurons is utilized. First, let igeo k (M ) be the spatial rank ordering of the neurons in M with respect to a spatial distance metric.1 Using this rank order, a subset Sl is determined, which contains the l spatially closest neurons: Sl := {k ∈ M : igeo k (M ) < l} . In the following, the parameter l is referred to as neighborhood size (NS). The final rank order for the adaption of the neurons is then determined by: ik (M ) = i0k (Sl ) igeo k (M ) if k ∈ Sl , otherwise where i0k (Sl ) is commonly a rank ordering based on the similarity of the neuron’s attribute data, disregarding their spatial locations. The SNS determines the strength of spatial autocorrelation incorporated into the mapping process of the CNG. If l = 1, then the set Sl consist only of the spatially closest neuron. Consequently, the final rank ordering ik (M ) is equal to igeo k (M ), and the adaptation of the neurons is exclusively determined by the spatial distances of the neurons to the input vector. The attribute values of the input data are neglected. If l is increased, then the size of Sl increases as well. The neurons of Sl are reordered according to i0k (Sl ) within their ranks. Because their spatial 1 In general, other distance metrics can be used alternatively to incorporate autocorrelation for other dimensions such as time. August 15, 2013 10:21 International Journal of Geographical Information Science cng 7 distances are thus less relevant for the final ordering, the adaptation of the neurons is less determined by spatial proximity. Therefore, less autocorrelation is incorporated into the CNG. If l = M , then Sl consists of all neurons in M . The spatial rank ordering igeo k (M ) becomes obsolete, because the final ordering of the neurons is solely determined by the similarity of their attributes. Therefore, no autocorrelation is incorporated into the CNG at all. Just like for the GeoSOM, it is not necessary to weight or scale spatial proximity and attribute similarity for the CNG, because spatial proximity is enforced by means of rank distances, which are independent of the actual attribute values. The neurons of the CNG are basically local averages. Therefore, the process of incorporating autocorrelation is also not sensitive to random variations in the input data. Although the basic principle of the GeoSOM and CNG is comparable, there are many important differences. Firstly, because the number of different reasonable SNSs equals the total number of neurons, the CNG permits more fine-grained control over the strength of autocorrelation incorporated into the algorithm. Secondly, the lack of an output space topology avoids edge effects that stem from topological restrictions. In CNGs only the number of neurons and the NS matter for incorporation of autocorrelation, which alleviates the interpretation and understanding of the CNG’s results. However, because common projection methods do not account for the distinct nature of space, visualization of the CNG is problematic. Therefore, data projection and visualization remain critical tasks. Nevertheless, as spatial data mostly comprise locational information, geographic maps can visually display the clusters, neurons, and the their properties. Despite the differences between the GeoSOM and CNG, their application domain is mainly the same. In especially they both are useful for clustering data, where spatial dependence between observations is present. Therefore, the following section will evaluate the performance of the GeoSOM and CNG. 4. Experiments for performance evaluation This section demonstrates the usefulness of the CNG and compares it to the GeoSOM by means of multiple experiments, which are based on two artificial and one real-world dataset. The objective of using artificial data is to produce repeatable experiments, thereby facilitating an understanding of the algorithms’ different features. The realworld dataset demonstrates the extend to which CNG can be used to appropriately solve practical geoscientific problems. The first experiment compares the vector quantization performance of the CNG and GeoSOM. The second experiment investigates their clustering performances. The third experiment exemplifies the different clustering results of the CNG’s and the GeoSOM’s application to the delimitation of urban patterns within the urban fringe of Vienna, Austria. The following basic parameters of the GeoSOM and CNG algorithms are applied in the experiments: In both algorithms the number of neurons N is chosen to match the desired number of clusters. The neurons are randomly initialized and the training time is set to 100,000 iterations. The neighborhood range parameters of the CNG are set to λinit = N/2 and λf inal = 0.01. Following Martinetz et al. (1993), the adaptation rate parameters of the CNG are set to init = 0.5 and f inal = 0.005. The parameters λ(t) and (t) are chosen as described in section 2.3. The map of the GeoSOM is a two-dimensional hexagonal grid. The learning rate of the GeoSOM is linearly decreasing from 0.5 and the neighborhood function is a Gaussian kernel August 15, 2013 10:21 International Journal of Geographical Information Science cng 8 HH LL Not significant Figure 1. Cluster map of spatial association. HH is a statistically significant (p < 0.05) cluster of high values, whereas LL a statistically significant (p < 0.05) cluster of low values. with an initial neighborhood radius spanning the entire map (Skupin and Agarwal 2008). After training, the set of input vectors is determined which are mapped to each neuron. The resulting sets represent the clustering result.1 4.1. Vector quantization To evaluate the quantization performance of the GeoSOM and the CNG, they are applied to an artificially created dataset, which comprises different continuous areas of local spatial autocorrelation. The dataset consists of 1, 000 data points (x, y, z), where x and y is randomly chosen from the interval [0, 1]. The z-value is calculated by z = x + y. Data points, which are located near the edges at (0, 0) and (1, 1) exhibit high autocorrelation, whereas data points near the diagonal from (0, 1) to (1, 0) show low autocorrelation. Figure 1 visually represents the local Moran statistic (Anselin 1995) in the form of a cluster map, which classifies observations by the type of spatial association. For ease of visual perception, the data points are represented by their Voronoi areas. A GeoSOM and a CNG with 25 neurons are trained for each reasonable radius and SNS, respectively, with this dataset. The quantization error is a vector quantization quality measure. It is calculated by determining the mean distance between each sample and its BMU or cluster center, by which it is represented. In order to compare the quantization performance of both algorithms with respect to spatial autocorrelation, the spatial quantization error (SQE) is introduced. The SQE measures the mean spatial 1 It is also possible to cluster the neurons or the map, respectively, in an additional post-processing step (for SOM see e.g. Murtagh 1995, Vesanto 1999, Ultsch 2005). However, post-processing may alter the results of the underlying clustering algorithms and thus makes their comparison difficult. 10:21 International Journal of Geographical Information Science cng 0.25 9 0.15 0.20 GeoSOM cNG 0.10 lQE August 15, 2013 0.010 0.015 0.020 0.025 0.030 QE Figure 2. Mean QEs versus mean SQEs for the GeoSOM and CNG. distance between each sample and its BMU or cluster center, disregarding the attribute values of the samples. The mean of the resulting QEs and SQEs for the CNG and GeoSOM over 100 runs are shown in Figure 2. The radius and SNS determine a trade-off between QE and SQE. Figure 2 shows that the CNG mostly makes a superior trade-off between the QEs. However, if the radius of the GeoSOM is set to its lowest value, the SQE is minimized and GeoSOM performs almost equally well as the CNG with respect to both QEs. By setting the radius of the GeoSOM to 0, quantization solely depends on the spatial locations of the input samples in the two-dimensional plane. Since its dimension equals the topological dimension of the GeoSOM, doing so avoids reduction of dimensionality and alleviates the quantization of the input data. However, although no clear clustering exists in the artificial dataset, the CNG mostly outperforms the GeoSOM. The CNG not only quantizes the data with higher spatial accuracy but also quantizes the data with a lower QE. 4.2. Clustering To compare the clustering performance of the GeoSOM and CNG an artificial dataset is created according to the following procedure. First, a set of 5, 000 data points (x, y, z) is created. The spatial location of each data point is represented by x and y, which are randomly chosen from the interval [0, 1]. Additionally, the Voronoi areas of the data August 15, 2013 10:21 International Journal of Geographical Information Science cng 10 Figure 3. Dataset consisting of 25 spatially distinct clusters. points are calculated. A seed of 100 data points is randomly selected as initial clusters. Then one of the clusters is randomly selected, and an arbitrary data point, which does not belong to a cluster yet and whose Voronoi area is adjacent, is assigned to the current cluster repeatedly until no unassigned data points are left. After that, the adjacency graph of the clusters is colored by means of a greedy algorithm, and each color is coded as an integer value and assigned as the z-values to the data-points. Hence, clusters are created which are spatially distinct, but feature similar z-values. Clustering algorithms, which do not take the spatial locations of the data points into account, are thus likely to fail to delineate the clusters correctly. Figure 3 displays the 25 clusters of this dataset. The colors represent the z-values. The clustering performance of GeoSOM and CNG depends on the choice of parameters. In particular, the SNS and radius are important, since they determine the degree of spatial dependence incorporated into the clustering process. Therefore, for each reasonable radius and SNS, a GeoSOM and CNG, respectively, is applied to the dataset. The CNG and GeoSOM both consist of 25 neurons, one neuron per existing cluster in the dataset. The resulting clusterings are then evaluated by normalized mutual information (NMI; Strehl and Ghosh 2003). Mutual information is a symmetric measure that quantifies the mutual dependence between two distributions (Cover and Thomas 2006), and hence can be used to assess the similarities of two different clusterings (Strehl and Ghosh 2003). The NMI has a range with a lower bound of 0 and an upper bound of 1. The NMI reaches the value of 1, if two clusterings are identical, and 0, if they have no information in common and thus are totally independent. Figure 4 shows a plot of the mean NMI for the GeoSOM and CNG with different radii and SNSs, respectively, over 100 runs. Figure 4 shows that the mean NMI increases immediately with the radius and SNS, respectively, until a maximum peak is reached, but decreases subsequently. However, because of the distinct nature of the CNG and GeoSOM, the effects of the radius and SNS on the NMIs are not strictly comparable. The graph of the GeoSOM suggests that 10:21 International Journal of Geographical Information Science cng 11 NR 1/25 4/25 7/25 10/25 13/25 16/25 19/25 22/25 25/25 0.80 0.85 0.90 0.95 cNG GeoSOM 0.65 0.70 0.75 NMI August 15, 2013 0 1 2 3 4 5 6 Radius Figure 4. Performance assessments based on the mean NMI for the GeoSOM and the CNG with different radii and SNSs, respectively. the mean NMI changes only marginally from radius 4 to 6. The reason for this is that the area defined by the radius already covers the whole map for most neurons (i.e. on average 23.08 neurons are located within distance of 4 on the map). Therefore, only few spatially distant neurons are disregarded in the search for the final BMU and only a small amount of autocorrelation is thus incorporated into the mapping process of the GeoSOM. A maximum mean NMI of about 0.95 is obtained for the CNG with SNS 4/25, whereas a maximum mean NMI of about 0.8 for the GeoSOM with radius 1. The results for both algorithms are sensitive to the choice of the radius size and SNS, respectively. Because of the large number of reasonable SNSs of the CNG, the CNG permits more fine-grained control of the clustering results than the GeoSOM, for which only few radii are reasonable. Because in general the NMIs for the CNG’s clusterings are higher than for the GeoSOM’s, the CNG identifies the clusters in the input data more precisely. 4.3. Practical application Region building and delineation of spatial patterns are essential tasks in spatial planning. This case study evaluates the practical application of the CNG and GeoSOM for the delineation of urban patterns in the urban fringe of Vienna, Austria. Data previously analyzed in a different context by Helbich and Görgl (2010) serve as input. The dataset encompasses 268 municipalities and comprises a) the average in-migration rate for the August 15, 2013 10:21 International Journal of Geographical Information Science cng 12 years 2001–2003 (per mil; INM) and six hard location factors, namely b) the land price index 2005/2006 (in Euro; LPI), c) scenic attractiveness index (ATI), d) accessibility to the core city of Vienna by public transport (in minutes; PTV) and e) by motorized individual transport (in minutes; MTV), and finally f,g) the accessibility to central places at two different hierarchical levels (level 3 and 5) by motorized individual transport (in minutes; CP3, CP5). Table 1 lists the sources and descriptive statistics of the dataset. Table 1. Data sources and descriptive statistics of the used dataset. Abbrev. INM LPI ATI PTV MTV CP3 CP5 Sources Statistics Austria, 2001–2003 GEWINN 7/8/2006 SRTM-DEM (NASA), Tele Atlas, OpenStreetMap Verkehrsverbund Ost Region, 2009 Tele Atlas, 2005 Beier et al. (2007) Beier et al. (2007) Min. 2.94 12.00 21.69 Max. 21.35 550.00 242.44 Mean 7.66 106.25 116.18 St. dev. 2.53 77.89 43.93 22.00 18.00 0.99 1.67 128.00 81.00 34.85 66.10 66.52 40.15 14.26 25.25 21.99 11.60 6.66 10.93 Before applying spatial algorithms, it is useful to explore whether spatial autocorrelation is actually present in the input data by using the global Moran’s I statistic (Cliff and Ord 1973). Applying a first-order queen contiguity to define the municipalities’ configuration of spatial neighborhood, all Moran’s I coefficients show significant positive spatial autocorrelation (p < 0.001). These results legitimate the use of the spatial regionalization algorithms such as the CNG or GeoSOM to account for these effects. Henceforth, the location of every municipality is represented by its centroid coordinates. All attributes are normalized to be in the same range. The spatial aspect ratio of the geographic coordinates is maintained. The training parameters of the CNG and GeoSOM are set according to section 4. For a comparison of GeoSOM and CNG, it is first necessary to determine the number of clusters. However, automatic computation of the number of clusters is a difficult problem in cluster analysis. Therefore, in the majority of real-world studies the number is pragmatically estimated (Djeraba and Fernandez 2003). This study considers 5 clusters, because they empirically build a sound compromise between specificity and generality. It is also necessary to determine a SNS or radius, respectively, such that the resulting clusterings represent spatial patterns adequately. To analyze the sensitivity of the GeoSOM and CNG, both are trained for all reasonable SNSs and radii, respectively, and the mean QE over 100 runs is calculated. As in section 4.1, the QE is distinguished between QE and SQE. Figure 5 depicts the results. While the QEs of both algorithms’ results are similar, the CNG outperforms the GeoSOM with respect to the SQE (Figure 5). Further, a GeoSOM with a radius of 1 and a CNG with a SNS of 3/5 are trained to investigate the differences of the algorithms. The results exhibit similar locational accuracy within these parameters and are thus applicable for comparing their spatial patterns. Figure 6 visualizes the resulting clusterings. Both methods result in similar clusterings in the northern parts of Vienna but differ in the south-eastern and southern surroundings, which represent highly dynamic areas within the Viennese urban fringe that had been shaped and reorganized by current (post-)suburbanization processes (Helbich and Leitner 2010, Helbich 2012). The CNG allocates the primary settlement areas along the main traffic routes in the surroundings, namely Wiener Neustadt and Bruck, to the current suburbs (Cluster 3). 10:21 International Journal of Geographical Information Science cng 13 NR 1/5 2/5 3/5 4/5 5/5 0.25 0.30 0.35 QE (GeoSOM) lQE (GeoSOM) QE (cNG) lQE (cNG) 0.15 0.20 QE and lQE August 15, 2013 0 1 2 3 4 Radius Figure 5. Mean QEs and mean SQEs of the GeoSOM and CNG for different radii and SNSs, respectively. These results conform to the Federal State Government of Lower Austria’s general principle for spatial planning, which has intended development and growth corridors to connect important urban centers and central places within the urban fringe (Office of the Lower Austrian Government 2005). In contrast, the GeoSOM assigns these areas to Cluster 2, which reflects the transition zone from suburban to rural areas. Therefore, the resulting spatial pattern of the CNG is theoretically more sound and coherent than the GeoSOM’s, although their clusterings exhibit similar SQEs. 5. Conclusions and future work This study presented the CNG, a variant of the basic NG, and showed that it is particularly well suited to analyze spatial data. The CNG accounts for spatial dependency by mapping spatially close observations to neurons, which are close with respect to their rank distance. In this way, the CNG incorporates autocorrelation independently of the data’s attributes and local variations in the input data have less affect on he process of incorporation. A concept closely related to the CNG is the GeoSOM. It has been shown that the CNG is superior to the GeoSOM in many aspects. In contrast to the GeoSOM, incorporation of autocorrelation for the CNG is not subject to topological restrictions, which alleviates its understanding and avoids edge effects. Additionally, the CNG allows a more fine-grained control on the degree of autocorrelation incorporation. The artificial Tulln Korneuburg Cluster 3 District capitals Main highways Figure 6. The resulting clusters of the GeoSOM (left) and the CNG (right). Cluster 2 Cluster 1 Wiener Neustadt Wiener Neustadt 0 5 Mistelbach 10 20 km Bruck Gänserndorf ¯ International Journal of Geographical Information Science Cluster 0 Baden Baden Mödling Cluster 4 Korneuburg Klosterneuburg Mödling Tulln Vienna Bruck Gänserndorf Hollabrunn Vienna Klosterneuburg Mistelbach 10:21 Hollabrunn August 15, 2013 cng 14 August 15, 2013 10:21 International Journal of Geographical Information Science REFERENCES cng 15 experiments revealed, that the CNG approximates spatial data with higher positional accuracy and lower QE than the GeoSOM. Also, the CNG resulted in better clusterings than the GeoSOM with respect to the obtained NMI. The capabilities of the CNG for identifying coherent homogenous and reasonable regions were verified by means of a realworld application, demonstrating the potential of the CNG as regionalization tool, which is an important task in spatial analysis (Miller and Han 2009). However, the results of the CNG depend strictly on its parametrization, in particular on the choice of SNS and the number of neurons. In this study sensitivity analysis was performed to demonstrate that the SNS basically represents a trade-off between QE and SQE. However, it will be beneficial to obtain a deeper understanding of the established relationships of the CNG. Further work will concentrate on this issue. Furthermore, it is necessary to integrate the CNG with other computational, visual, and cartographic methods such that analysts can access a broad set of tools for exploring and understanding spatial and multivariate patterns in a combined framework. References Andrienko, G., et al., 2010. Space-in-time and time-in-space self-organizing maps for exploring spatiotemporal patterns. Computer Graphics Forum, 29, 913–922. Anselin, L., 1989. What is special about spatial data? Alternative perspectives on spatial data analysis. In: Symposium on spatial statistics, past, present and future Department of Geography, Syracuse University. Anselin, L., 1995. Local indicators of spatial association - LISA. Geographical Analysis, 27, 93–115. Arribas-Bel, D., Nijkamp, P., and Scholten, H., 2011. Multidimensional urban sprawl in Europe: A self-organizing map approach. Computers, Environment and Urban Systems, 35 (4), 263–275. Bação, F., Lobo, V., and Painho, M., 2004. Geo-self-organizing map (Geo-SOM) for building and exploring homogeneous regions. In: M.J. Egenhofer, C. Freksa and H.J. Miller, eds. Geographic Information Science., Vol. 3234 of Lecture Notes in Computer Science Springer Berlin / Heidelberg, 22–37. Bação, F., Lobo, V., and Painho, M., 2005. The self-organizing map, the Geo-SOM, and relevant variants for geosciences. Computational Geosciences, 31, 155–163. Bailey, T. and Gatrell, A., 1995. Interactive spatial data analysis. Longman Scientific & Technical. Beier, R., et al., 2007. Erreichbarkeitsverhältnisse in Österreich 2005. Modellrechnungen für den ÖPNRV und den MIV (Accessibility in Austria 2005. Model estimations for motorized individual transport and public transport). Vienna: ÖROK. Cliff, A. and Ord, J., 1973. Spatial autocorrelation. London, UK: Pion. Cover, T.M. and Thomas, J.A., 2006. Elements of information theory. New York, NY, USA: Wiley-Interscience. Cox, T. and Cox, M., 2001. Multidimensional scaling. No. 1 Monographs on statistics and applied probability Chapman & Hall/CRC. Djeraba, C. and Fernandez, G., 2003. Mining image data. In: The handbook of data mining., 637–656 Lawrence Erlbaum Associates. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., 1996. From data mining to knowledge discovery: An overview. In: Advances in knowledge discovery and data mining. MIT Press, Cambridge. August 15, 2013 10:21 International Journal of Geographical Information Science 16 REFERENCES cng Flexer, A., 2001. On the use of self-organizing maps for clustering and visualization. Intelligent Data Analysis 5, 5, 373–384. Fritzke, B., 1994. Growing cell structuresA self-organizing network for unsupervised and supervised learning. Neural Networks, 7 (9), 1441 – 1460. Guo, D. and Mennis, J., 2009. Spatial data mining and geographic knowledge discovery - An introduction. Computers Environment and Urban Systems, 33 (6), 403–408. Hagenauer, J., Helbich, M., and Leitner, M., 2011. Visualization of crime trajectories with self-organizing maps: A case study on evaluating the impact of hurricanes on spatio-temporal crime hotspots. In: Proceedings of the 25th International Cartographic Conference, July. ISBN: 978-1-907075-05-6, CO-173, Paris, France. Helbich, M., 2012. Beyond postsuburbia? Multifunctional service agglomeration in Vienna’s urban fringe. Journal of Economic & Social Geography, in press. Helbich, M. and Görgl, P.J., 2010. Räumliche Regressionsmodelle als leistungsfähige Methoden zur Erklärung der Driving Forces von Zuzügen in der Stadtregion Wien? (Spatial regression as a useful technique to explore the driving forces of in-migration in the Viennese urban region?). Raumforschung und Raumordnung, 68, 103–113. Helbich, M. and Leitner, M., 2010. Postsuburban spatial evolution of Vienna’s urban fringe: Evindence from point process modeling. Urban Geography, 31 (8), 1100–1117. Hewitson, B.C. and Crane, R.G., 2002. Self-organizing maps: Applications to synoptic climatology. Climate Research, 22 (1), 13–26. Jiang, B., 2004. Extraction of spatial objects from laser-scanning data using a clustering technique. In: XXth ISPRS Congress: Geo-Imagery bridging continents, Vol. 3, 7., Istanbul, Turkey, 219–224. Kangas, J., Kohonen, T., and Laaksonen, J., 1990. Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1 (1), 93 –99. Kangas, J., 1992. Temporal Knowledge in locations of activations in a self-organizing map. In: I. Aleksander and J. Taylor, eds. Artificial Neural Networks, 2, Vol. 1 Amsterdam, Netherlands: North-Holland, 117–120. Klösgen, W. and Zytkow, J., 1996. Knowledge Discovery in Databases Terminology. In: Advances in Knowledge Discovery and Data Mining., 573–592 MIT Press, Cambridge. Kohonen, T., 1982. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69 10.1007/BF00337288. Kohonen, T., 2001. Self-organizing maps. 3rd Secaucus, NJ, USA: Springer, New York. Labusch, K., Barth, E., and Martinetz, T., 2009. Sparse coding neural gas: Learning of overcomplete data representations. Neurocomputing, 72, 1547–1555. Martinetz, T. and Schulten, K., 1991. A ”Neural-Gas” network learns topologies. Artificial Neural Networks, 1, 397–402. Martinetz, T., Berkovich, S., and Schulten, K., 1993. ”Neural-gas” network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4 (4), 558–569. Miller, H.J., 2010. The data avalanche is here. Shouldn’t we be digging?. Journal of Regional Science, 50, 181–201. Miller, H.J. and Han, J., 2009. Geographic data mining and knowledge discovery. Bristol, PA, USA: CRC Press, Boca Raton. Murtagh, F., 1995. Interpreting the Kohonen self-organizing feature map using contiguity-constrained clustering. Pattern Recognition Letters, 16 (4), 399–408. Office of the Lower Austrian Government, ed. , 2005. Perspektiven fr die Hauptregionen (Perspectives for the main regions). St. Pölten: Gugler GmbH. August 15, 2013 10:21 International Journal of Geographical Information Science REFERENCES cng 17 Openshaw, S., 1994. Neuroclassification of spatial data. In: Neural nets: Applications in geography., 53–70 Boston, MA, USA: Kluwer. Openshaw, S.O., 1999. Geographical data mining: Key design issues. In: 4th International Conference on Geocomputation, Fredericksburg. Sammon, J.W., J., 1969. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18 (5), 401–409. Skupin, A. and Agarwal, P., 2008. Introduction: What is a self-organizing map?. In: Selforganising maps: Applications in geographical information science. Chichester, UK: John Wiley & Sons, Ltd. Skupin, A. and Fabrikant, S.I., 2003. Spatialization methods: A cartographic research agenda for non-geographic information visualization. Cartography and Geographic Information Science, 30, 95–119. Strehl, A. and Ghosh, J., 2003. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. Sui, D.Z., 2004. Tobler’s first law of geography: A big idea for a small world?. Annals of the Association of American Geographers, 94 (2), 269–277. Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography, 46, 234–240. Ultsch, A., 2005. Clustering with SOM: U*C. In: Proceedings of the Workshop on SelfOrganizing Maps, Paris, France, 75–82. Vesanto, J., 1999. SOM-based data visualization methods. Intelligent Data Analysis, 3, 111–126.