Download Exploration of Statistical and Textual Information by

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuroinformatics wikipedia , lookup

Theoretical computer science wikipedia , lookup

General circulation model wikipedia , lookup

Computer simulation wikipedia , lookup

Generalized linear model wikipedia , lookup

Geographic information system wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Data analysis wikipedia , lookup

Least squares wikipedia , lookup

Predictive analytics wikipedia , lookup

Regression analysis wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Operational transformation wikipedia , lookup

Pattern recognition wikipedia , lookup

Data assimilation wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Exploration of Statistical and Textual Information by Means
of Self-Organizing Maps
Teuvo Kohonen
Helsinki University of Technology
Neural Networks Research Centre
P.O. Box 5400
FIN-02015 HUT, Finland
[email protected]
Abstract. The self-organizing map (SOM) converts statistical relationships between highdimensional data into geometric relationships on a low-dimensional grid. It can thus be regarded as
a projection and a similarity graph of the primary data. As it preserves the most important
topological relationships of the data elements on the display, it may be thought of as producing
some form of abstraction. These two aspects, visualization and abstraction, can be utilized in data
mining, process analysis, machine perception, and organization of document collections.
1. Introduction
Natural information has properties that have not been taken into account in mathematical statistics.
The dimensionalities of such data tend to be immense, a priori statististics are not available, and the
data elements are usually nonlinearly and dynamically dependent. Therefore, in the early 1980s a
new line of computational approaches based on simple ”formal neurons” was launched.
Among the neural-network algorithms, the Self-Organizing Map (SOM) (Kohonen, 1995) is
in a special position, because it is able to form ordered representations of large and often highdimensional data sets. It converts complex, nonlinear statistical relationships between highdimensional data elements into simple geometric relationships between points on a low-dimensional
display. The central idea in this algorithm is to use a large number of relatively simple and
structurally similar, but interacting statistical models. Each model then describes only a limited
domain of observations, but as the models are able to communicate, they can mutually decide what
and how large a domain belongs to which model. Thereby it becomes possible to span the whole
data space nonlinearly, minimizing the average overall modeling error.
The SOM usually consists of a two-dimensional regular grid of nodes. The SOM algorithms
compute the models so that they describe the domain of observations in the sense of a certain
minimal distortion. The models will also be organized in a two-dimensional order such that similar
models become closer to each other in the grid than the more dissimilar ones. The resulting SOM is
both a similarity graph, and a clustering diagram. Its computation is a nonparametric, recursive
regression process.
2. The Basic SOM Algorithms
Regression of a set of model vectors m i ∈ℜ n into the space of observation vectors x ∈ℜ n is often
made by the following sequential process, whereupon the resulting models will become ordered:
(1)
mi (t + 1) = mi (t ) + hc( x ), i [x(t ) − mi (t )] .
Here t is the sample index of the regression step, and the regression is performed recursively for
each presentation of a sample of x = x ( t ) . Index c = c(x) (”winner”) is defined by the condition
(2)
∀i , || x(t ) − m c (t )|| ≤ || x(t ) − m i (t )|| .
Here hc( x), i is called the neighborhood function, and it acts like a smoothing kernel that is timevariable and its location depends on condition (2). It is a decreasing function of the distance
between the ith and cth models on the map grid. The norm is usually assumed as Euclidean.
The neighborhood function is often taken to be the Gaussian
 || ri − rc ||2 
 ,
hc( x), i = α (t ) exp −
(3)

 2σ 2 (t ) 
where 0 < α (t ) < 1 is the learning-rate factor, which decreases monotonically with the regression
steps, ri ∈ℜ 2 and rc ∈ℜ 2 are the vectorial locations in the display grid, and σ (t ) corresponds to
the width of the neighborhood function, which is also decreasing monotonically with the regression
step.
A simpler definition of hc( x), i is the ”bubble neighborhood” defined as in the following:
hc( x), i = α (t ) if || ri − rc || is smaller than a given radius around node c (whereupon this radius is
also a monotonically decreasing function of the regression steps), but otherwise hc( x), i = 0 .
Some mathematicians may be more familiar with the so-called ”principal curves” of Hastie
and Stuetzle (1989) and see a relationship between them and the SOM. However, the SOM was
introduced eight years earlier than the ”principal curves.” It can be computed much more
conveniently and effectively than the latter. There are also other differences, for instance, the SOM
can be generalized in many ways, which is not possible for the principal curves.
Another principal alternative to the SOM is the generative topological mapping (GTM)
(Bishop et al., 1996), in which the mapping directly tends to preserve the topological-metric
relations on the output array. It has turned out, however, that numerous shortcut computations can
be applied to make very large SOMs, while these methods are not applicable to the GTM.
Assuming that the model vectors converge to some ordered state, we may require that the
expectation values of m i (t + 1) and m i (t ) for t → ∞ must be equal. If hc( x), i represents the
kernels used during the last phases of learning and x ensues from an ergodic process, in the
stationary state the values m i∗ will satisfy the equilibrium equation
(4)
{
}
∀i , E t hc( x ),i (x − m ∗i ) = 0 .
In the special case where we have a finite number (batch) of the x(t ) with respect to which
(4) has to be solved for the m i∗ , we have
(5)
m i∗ =
∑ t hc( x),i x(t )
∑ t hc( x),i
.
This, however, is not yet an explicit solution for m i∗ , because the subscript c(x) on the righthand side still depends on x(t ) and all the m i∗ . The way of writing (5), however, allows us to apply
the contractive mapping method known from the theory of nonlinear equations: even starting with
coarse approximations for the m i∗ , (2) is first utilized to find the indices c(x) for all the x(t ) . On
the basis of the approximate hc( x), i values, the improved approximations for the m i∗ are computed
from (5), which are then applied to (2), whereafter the computed c(x) are substituted into (5), and so
on. The optimal solutions m i∗ are usually obtained in a few iteration cycles, after the discretevalued indices c(x) have settled down and are no longer changed in further iterations. This
procedure is called the Batch Map principle. An even simpler Batch Map is obtained if hc( x), i is
defined in terms of the neighborhood set N c . Further we need the concept of the Voronoi set. It
means a domain Vi in the x space, or actually the set of those samples x(t) that lie closest to m i∗ .
Let N i be the set of nodes that lie up to a certain radius from node i in the array. The union of
Voronoi sets Vi corresponding to the nodes in N i shall be denoted by U i . Then (5) can be written
(6)
m ∗i =
∑ x (t ) ∈U i x(t )
n(U i )
,
where n(U i ) means the number of samples x(t) that belong to U i .
Notice again that the U i depend on the m i∗ , and therefore (6) must be solved iteratively. The
procedure can be described as the following steps:
1.
Initialize the values of the m i∗ in some proper (eventually random) way.
Input the x(t) and list each of them under the model m i∗ that is closest to x(t) according to
2.
(2).
3.
Let U i denote the union of the above lists at model m i∗ and its neighbors that constitute the
neighborhood N i . Compute the means of the vectors x(t) in each U i , and replace the old values
of the m i∗ by the respective means.
4.
Repeat from 2 until the solution can be regarded as steady.
3. A Brief Overview of SOM Applications
The four most promising application areas of the SOM are:
•
exploratory data analysis and knowledge discovery in databases (KDD)
•
analysis and control of industrial processes and machines
•
various tasks in telecommunications
•
biomedical analyses and applications.
One may also report numerous tasks in finance, ranging from the analysis and prediction of
time series to the classification and evaluation of macroeconomic systems. We have cooperated
with the World Bank, analyzing their socioeconomic data in many ways (Deboeck and Kohonen,
1998). Analyses of financial performance (Back et al., 1997) and bankruptcies (Kiviluoto, 1998) of
companies are being made using the SOM method. The reform of the Finnish forest taxation in
1992, i.e., an option given to the owners to choose between two taxation policies, was based on a
cluster analysis made by the SOM method. It is not possible to survey the whole range of
applications of the SOM method in more detail in this paper. Let it suffice to refer to the list of 3343
research papers on the SOM (Kaski et al., 1998) that is also available at the Internet address
http://www.icsi.berkeley.edu/~jagota/NCS/vol1.html.
The basic SOM carries out a clustering in the Euclidean vector space. We shall point out in
the following that it is also possible to perform the clustering of free-text natural-language
documents, if their contents are described statistically by the usage of different words in them.
The word histograms are usually very sparsely occupied: in one document one may use only,
say, a few dozen to a couple of hundred different words, depending on its length. Therefore a simple
but still effective method to reduce the dimensionality of the representation vectors, without
essentially decreasing their discriminatory power, is to project them randomly onto a much lowerdimensional Euclidean space.
The document-clustering SOM called the WEBSOM produces the visual display of the
document collection in the following steps: 1. Some preprocessing of the texts is first carried out,
removing nontextual symbols and very rare words. Eventually, a stemmer is used to transform all
word forms into their most probable stem words. 2. The word histogram of each document is
projected randomly onto a space of dimensionality 300 to 500, thereby obtaining a representation
vector x for each document. 3. A SOM is formed using the x as input data. 4. The models m i
formed at the nodes of the SOM are labeled by all those documents that are mapped onto the said
node. In practice, the nodes are linked to the proper document data base by address pointers
(indexing). 5. Standard browsing software tools are used to read the documents mapped to the SOM
nodes.
We have already implemented WEBSOM systems for the following applications:
− Internet Usenet newsgroups; the largest experiment consisted of 85 newsgroups, with a
total of over 1 million documents. The size of the SOM was thereby 104 448 nodes.
− News bulletins (in Finnish) of the Finnish News Agency (Finnish Reuter).
− Patent abstracts (English) that are available in electronic form. The largest demonstration,
being finished during the writing of this report, consists of seven million patent abstracts
from the U.S., Japan and European patent offices and the SOM array consists of 1 002 240
nodes.
Demonstrations of various WEBSOM displays are available on the Internet at the address
http://websom.hut.fi/websom/, where they can be browsed with standard www browsers.
REFERENCES
Back, B., Sere, K., and Vanharanta, H. (1997). Analyzing financial performance with selforganizing maps. Proc. WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, pp. 356361.
Bishop, C., Svensen, M., and Williams, C. (1996). GTM: a principled alternative to the selforganizing map. In Artificial Neural Networks - ICANN 96, 1996 Int. Conf. Proc., C. v.d.
Malsburg, W. von Seelen, J. Vorbruggen, and B. Sendhoff, Eds. Springer-Verlag, Berlin, pp. 165170.
Deboeck, G. and Kohonen, T., Eds. (1998). Visual Exploration in Finance with Self-Organizing
Maps. Springer-Verlag, London.
Hastie ,T. and Stuetzle, W. (1989). Principal curves. J. Am. Stat. Assoc., 84, 502-516.
Kaski, S., Kangas, J., and Kohonen, T. (1998). Bibliography of self-organizing map (SOM) papers:
1981-1997. Neural Computing Surveys, 1(3&4), 1-176. http://www.icsi.berkeley.edu/~jagota/NCS/vol1.html
Kiviluoto, K. (1998). Predicting bankruptcies with the self-organizing map. Neurocomputing, 21,
191-201.
Kohonen, T. (1995). Self-Organizing Maps. Series in Information Sciences, Vol. 30. Springer,
Heidelberg. 2nd ed. 1997.
FRENCH RÉSUME
La carte à auto-organisation permet de convertir les relations statistiques entre des données
qui co-existent dans un espace de grande dimension en des relations géométriques sur un espace de
petite dimension (grille). Elle peut être utlisée en data mining, perception pour la robotique,
intelligence artificielle, et organisation d’ensemble de documents complexes.