Download Exploration of Statistical and Textual Information by

Exploration of Statistical and Textual Information by Means of Self-Organizing Maps Teuvo Kohonen Helsinki University of Technology Neural Networks Research Centre P.O. Box 5400 FIN-02015 HUT, Finland [email protected] Abstract. The self-organizing map (SOM) converts statistical relationships between highdimensional data into geometric relationships on a low-dimensional grid. It can thus be regarded as a projection and a similarity graph of the primary data. As it preserves the most important topological relationships of the data elements on the display, it may be thought of as producing some form of abstraction. These two aspects, visualization and abstraction, can be utilized in data mining, process analysis, machine perception, and organization of document collections. 1. Introduction Natural information has properties that have not been taken into account in mathematical statistics. The dimensionalities of such data tend to be immense, a priori statististics are not available, and the data elements are usually nonlinearly and dynamically dependent. Therefore, in the early 1980s a new line of computational approaches based on simple ”formal neurons” was launched. Among the neural-network algorithms, the Self-Organizing Map (SOM) (Kohonen, 1995) is in a special position, because it is able to form ordered representations of large and often highdimensional data sets. It converts complex, nonlinear statistical relationships between highdimensional data elements into simple geometric relationships between points on a low-dimensional display. The central idea in this algorithm is to use a large number of relatively simple and structurally similar, but interacting statistical models. Each model then describes only a limited domain of observations, but as the models are able to communicate, they can mutually decide what and how large a domain belongs to which model. Thereby it becomes possible to span the whole data space nonlinearly, minimizing the average overall modeling error. The SOM usually consists of a two-dimensional regular grid of nodes. The SOM algorithms compute the models so that they describe the domain of observations in the sense of a certain minimal distortion. The models will also be organized in a two-dimensional order such that similar models become closer to each other in the grid than the more dissimilar ones. The resulting SOM is both a similarity graph, and a clustering diagram. Its computation is a nonparametric, recursive regression process. 2. The Basic SOM Algorithms Regression of a set of model vectors m i ∈ℜ n into the space of observation vectors x ∈ℜ n is often made by the following sequential process, whereupon the resulting models will become ordered: (1) mi (t + 1) = mi (t ) + hc( x ), i [x(t ) − mi (t )] . Here t is the sample index of the regression step, and the regression is performed recursively for each presentation of a sample of x = x ( t ) . Index c = c(x) (”winner”) is defined by the condition (2) ∀i , || x(t ) − m c (t )|| ≤ || x(t ) − m i (t )|| . Here hc( x), i is called the neighborhood function, and it acts like a smoothing kernel that is timevariable and its location depends on condition (2). It is a decreasing function of the distance between the ith and cth models on the map grid. The norm is usually assumed as Euclidean. The neighborhood function is often taken to be the Gaussian  || ri − rc ||2   , hc( x), i = α (t ) exp − (3)   2σ 2 (t )  where 0 < α (t ) < 1 is the learning-rate factor, which decreases monotonically with the regression steps, ri ∈ℜ 2 and rc ∈ℜ 2 are the vectorial locations in the display grid, and σ (t ) corresponds to the width of the neighborhood function, which is also decreasing monotonically with the regression step. A simpler definition of hc( x), i is the ”bubble neighborhood” defined as in the following: hc( x), i = α (t ) if || ri − rc || is smaller than a given radius around node c (whereupon this radius is also a monotonically decreasing function of the regression steps), but otherwise hc( x), i = 0 . Some mathematicians may be more familiar with the so-called ”principal curves” of Hastie and Stuetzle (1989) and see a relationship between them and the SOM. However, the SOM was introduced eight years earlier than the ”principal curves.” It can be computed much more conveniently and effectively than the latter. There are also other differences, for instance, the SOM can be generalized in many ways, which is not possible for the principal curves. Another principal alternative to the SOM is the generative topological mapping (GTM) (Bishop et al., 1996), in which the mapping directly tends to preserve the topological-metric relations on the output array. It has turned out, however, that numerous shortcut computations can be applied to make very large SOMs, while these methods are not applicable to the GTM. Assuming that the model vectors converge to some ordered state, we may require that the expectation values of m i (t + 1) and m i (t ) for t → ∞ must be equal. If hc( x), i represents the kernels used during the last phases of learning and x ensues from an ergodic process, in the stationary state the values m i∗ will satisfy the equilibrium equation (4) { } ∀i , E t hc( x ),i (x − m ∗i ) = 0 . In the special case where we have a finite number (batch) of the x(t ) with respect to which (4) has to be solved for the m i∗ , we have (5) m i∗ = ∑ t hc( x),i x(t ) ∑ t hc( x),i . This, however, is not yet an explicit solution for m i∗ , because the subscript c(x) on the righthand side still depends on x(t ) and all the m i∗ . The way of writing (5), however, allows us to apply the contractive mapping method known from the theory of nonlinear equations: even starting with coarse approximations for the m i∗ , (2) is first utilized to find the indices c(x) for all the x(t ) . On the basis of the approximate hc( x), i values, the improved approximations for the m i∗ are computed from (5), which are then applied to (2), whereafter the computed c(x) are substituted into (5), and so on. The optimal solutions m i∗ are usually obtained in a few iteration cycles, after the discretevalued indices c(x) have settled down and are no longer changed in further iterations. This procedure is called the Batch Map principle. An even simpler Batch Map is obtained if hc( x), i is defined in terms of the neighborhood set N c . Further we need the concept of the Voronoi set. It means a domain Vi in the x space, or actually the set of those samples x(t) that lie closest to m i∗ . Let N i be the set of nodes that lie up to a certain radius from node i in the array. The union of Voronoi sets Vi corresponding to the nodes in N i shall be denoted by U i . Then (5) can be written (6) m ∗i = ∑ x (t ) ∈U i x(t ) n(U i ) , where n(U i ) means the number of samples x(t) that belong to U i . Notice again that the U i depend on the m i∗ , and therefore (6) must be solved iteratively. The procedure can be described as the following steps: 1. Initialize the values of the m i∗ in some proper (eventually random) way. Input the x(t) and list each of them under the model m i∗ that is closest to x(t) according to 2. (2). 3. Let U i denote the union of the above lists at model m i∗ and its neighbors that constitute the neighborhood N i . Compute the means of the vectors x(t) in each U i , and replace the old values of the m i∗ by the respective means. 4. Repeat from 2 until the solution can be regarded as steady. 3. A Brief Overview of SOM Applications The four most promising application areas of the SOM are: • exploratory data analysis and knowledge discovery in databases (KDD) • analysis and control of industrial processes and machines • various tasks in telecommunications • biomedical analyses and applications. One may also report numerous tasks in finance, ranging from the analysis and prediction of time series to the classification and evaluation of macroeconomic systems. We have cooperated with the World Bank, analyzing their socioeconomic data in many ways (Deboeck and Kohonen, 1998). Analyses of financial performance (Back et al., 1997) and bankruptcies (Kiviluoto, 1998) of companies are being made using the SOM method. The reform of the Finnish forest taxation in 1992, i.e., an option given to the owners to choose between two taxation policies, was based on a cluster analysis made by the SOM method. It is not possible to survey the whole range of applications of the SOM method in more detail in this paper. Let it suffice to refer to the list of 3343 research papers on the SOM (Kaski et al., 1998) that is also available at the Internet address http://www.icsi.berkeley.edu/~jagota/NCS/vol1.html. The basic SOM carries out a clustering in the Euclidean vector space. We shall point out in the following that it is also possible to perform the clustering of free-text natural-language documents, if their contents are described statistically by the usage of different words in them. The word histograms are usually very sparsely occupied: in one document one may use only, say, a few dozen to a couple of hundred different words, depending on its length. Therefore a simple but still effective method to reduce the dimensionality of the representation vectors, without essentially decreasing their discriminatory power, is to project them randomly onto a much lowerdimensional Euclidean space. The document-clustering SOM called the WEBSOM produces the visual display of the document collection in the following steps: 1. Some preprocessing of the texts is first carried out, removing nontextual symbols and very rare words. Eventually, a stemmer is used to transform all word forms into their most probable stem words. 2. The word histogram of each document is projected randomly onto a space of dimensionality 300 to 500, thereby obtaining a representation vector x for each document. 3. A SOM is formed using the x as input data. 4. The models m i formed at the nodes of the SOM are labeled by all those documents that are mapped onto the said node. In practice, the nodes are linked to the proper document data base by address pointers (indexing). 5. Standard browsing software tools are used to read the documents mapped to the SOM nodes. We have already implemented WEBSOM systems for the following applications: − Internet Usenet newsgroups; the largest experiment consisted of 85 newsgroups, with a total of over 1 million documents. The size of the SOM was thereby 104 448 nodes. − News bulletins (in Finnish) of the Finnish News Agency (Finnish Reuter). − Patent abstracts (English) that are available in electronic form. The largest demonstration, being finished during the writing of this report, consists of seven million patent abstracts from the U.S., Japan and European patent offices and the SOM array consists of 1 002 240 nodes. Demonstrations of various WEBSOM displays are available on the Internet at the address http://websom.hut.fi/websom/, where they can be browsed with standard www browsers. REFERENCES Back, B., Sere, K., and Vanharanta, H. (1997). Analyzing financial performance with selforganizing maps. Proc. WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, pp. 356361. Bishop, C., Svensen, M., and Williams, C. (1996). GTM: a principled alternative to the selforganizing map. In Artificial Neural Networks - ICANN 96, 1996 Int. Conf. Proc., C. v.d. Malsburg, W. von Seelen, J. Vorbruggen, and B. Sendhoff, Eds. Springer-Verlag, Berlin, pp. 165170. Deboeck, G. and Kohonen, T., Eds. (1998). Visual Exploration in Finance with Self-Organizing Maps. Springer-Verlag, London. Hastie ,T. and Stuetzle, W. (1989). Principal curves. J. Am. Stat. Assoc., 84, 502-516. Kaski, S., Kangas, J., and Kohonen, T. (1998). Bibliography of self-organizing map (SOM) papers: 1981-1997. Neural Computing Surveys, 1(3&4), 1-176. http://www.icsi.berkeley.edu/~jagota/NCS/vol1.html Kiviluoto, K. (1998). Predicting bankruptcies with the self-organizing map. Neurocomputing, 21, 191-201. Kohonen, T. (1995). Self-Organizing Maps. Series in Information Sciences, Vol. 30. Springer, Heidelberg. 2nd ed. 1997. FRENCH RÉSUME La carte à auto-organisation permet de convertir les relations statistiques entre des données qui co-existent dans un espace de grande dimension en des relations géométriques sur un espace de petite dimension (grille). Elle peut être utlisée en data mining, perception pour la robotique, intelligence artificielle, et organisation d’ensemble de documents complexes.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Exploration of Statistical and Textual Information by