Download Interactive Visualization and Navigation in Large Data Collections

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Interactive Visualization and Navigation in
Large Data Collections using the Hyperbolic Space
Jörg Walter · Jörg Ontrup · Daniel Wessling · Helge Ritter
Neuroinformatics Group · Department of Computer Science
University of Bielefeld · D-33615 Bielefeld · Germany
E-mail: [email protected]
Abstract
background knowledge, intuition and creativity – he is also
vested with powerful pattern recognition and processing capabilities especially for the visual information channel. The
design goals for an optimal user–data interaction strongly
depend on the given exploration task but they certainly include an easy and intuitive navigation with strong support
for the user’s orientation.
We propose the combination of two recently introduced
methods for the interactive visual data mining of large
collections of data. Both, Hyperbolic Multi-Dimensional
Scaling (HMDS) and Hyperbolic Self-Organizing Maps
(HSOM) employ the extraordinary advantages of the hyperbolic plane (H2): (i) the underlying space grows exponentially with its radius around each point - ideal for embedding high-dimensional (or hierarchical) data; (ii) the
Poincaré model of the IH 2 exhibits a fish-eye perspective
with a focus area and a context preserving surrounding; (iii)
the mouse binding of focus-transfer allows intuitive interactive navigation.
The HMDS approach extends multi-dimensional scaling
and generates a spatial embedding of the data representing their dissimilarity structure as faithfully as possible. It
is very suitable for interactive browsing of data object collections, but calls for batch precomputation for larger collection sizes.
The HSOM is an extension of Kohonen’s Self-Organizing
Map and generates a partitioning of the data collection assigned to an IH 2 tessellating grid. While the algorithm’s
complexity is linear in the collection size, the data browsing is rigidly bound to the underlying grid.
By integrating the two approaches we gain the synergetic effect of adding advantages of both. And the hybrid architecture uses consistently the IH 2 visualization and navigation
concept. We present the successfully application to a text
mining example involving the Reuters-21578 text corpus.
Data
Layout
Technique
Layout
space
Projection
Technique
Display
Area
Figure 1. Displaying larger collections of data with limited display area requires careful usage of space. The allocation of spatial representation for providing overview and
detail is a challenge. After choosing the level of detail the
data layout operated on the canvas (“layout space”) before
it is suitably projected onto the display area (e.g. after panning and zooming).
Visualizing large data collections has to provide means
to effectively use a limited display space and give the user
the overview as well as the details. Since most of available data display devices are two-dimensional – paper and
screens – the following problem must be solved: finding a
meaningful spatial mapping of data onto the display area.
One limiting factor is the “restricted neighborhood” around
a point in a Euclidean 2D surface. Hyperbolic spaces open
an interesting loophole. The extraordinary property of exponential growth of neighborhood with increasing radius
around all points enables us to build novel displays.
The “hyperbolic tree viewer”, developed at Xerox Parc
[6], demonstrated the remarkably elegant interactive capabilities. The hyperbolic model appears as a continuously
graded, focus+context mapping to the display. See [9, 10]
for comparative studies with traditional display types.
Unfortunately, previous usage of direct hyperbolic visualization was limited to hierarchical, tree-like, or “quasigraph” data. Two reliefs were recently introduced, suggesting more general IH 2 -layout techniques: one generalizes Kohonen’s SOM algorithm to the Hyperbolic SelfOrganizing Map algorithm (HSOM) [11]; the other intro-
1. Introduction
The demand for techniques handling large collections of
data is rapidly growing. While the power of information
systems increases – the amount of information a human
user can directly digest does not. The challenge is to provide good clues for the right questions, which is the key to
discoveries. The human expert possess not only valuable
1
duces Hyperbolic Multi-Dimensional Scaling (HMDS) [16]
for a direct construction of a distance preserving embedding
of high-dimensional data into the hyperbolic space.
In Sec. 2 and 3 we review the hyperbolic space and the
three mentioned layout techniques for visualizing data in
IH 2 . Sec. 4 discusses a synthesis for a two step visualization
architecture. Even though the look and feel of an interactive
visualization and navigation cannot be really conveyed in a
static format, we report in Sec. 5 several results and snapshots of an application to text mining. This approach allows
to visualize, navigate and search in a space of documents
appearing in the Reuters news stream.
2. The Hyperbolic Space IH 2
Figure 2. There is literally more room in hyperbolic space
Historically, the hyperbolic space is a “recent” discovery in the 18th century. Lobachevsky, Bolyai, and Gauss
independently discovered the non-Euclidean geometries by
negating the “parallel axiom” which Euclid formulated
2300 years ago. Today we know three geometries with uniform curvature. Our daily experience is governed by the
flat or Euclidean geometry with zero curvature. Still familiar is the spherical geometry with positive curvature – describing the surface of a sphere, like the earth or an orange.
Its counterpart with constant negative curvature is known as
the hyperbolic plane IH 2 (with analogous generalizations to
higher dimensions) [2, 14]. Unfortunately, there is no “perfect” embedding of the IH 2 in IR3 , which makes it harder to
grasp the unusual properties of the IH 2 . Local patches resemble the situation at a saddle point, where the neighborhood grows faster than in flat space (see Fig. 2). Standard
textbooks on Riemannian geometry (see, e.g. [7]) show that
the circumference c and area a of a circle of radius ρ in IH 2
are given by
area: a(ρ) = 4π sinh2 (ρ/2)
circumference: c(ρ) = 2π sinh(ρ) .
(1)
(2)
This bears two remarkable asymptotic properties, (i) for
small radius ρ the space “looks flat” since a(ρ) ≈ πρ2 and
c(r) ≈ 2πρ. (ii) For larger ρ both grow exponentially with
the radius. As observed in [6, 5], this trait makes the hyperbolic space ideal for embedding hierarchical structures.
Fig. 2 illustrates the spatial relations by embedding a small
patch of the IH 2 in IR3 .
To use the visualization potential of the IH 2 we must
solve the two problems displayed before in Fig. 1. Now we
turn to the projection problem, which was solved for the
IH 2 more than a century ago.
2.1. The Projection Solution for IH 2
The perfect projection into the flat display area should
preserve length, area, and angles (≈form). But it lays in the
nature of a curvated space to resist the attempt to simultaneously achieve these goals. Consequently several projections
than in Euclidean space, as shown in this illustrated embedding of the hyperbolic plane into 3D Euclidean space (courtesy of Jeffrey Weeks).
(Right:) Exponential growth
(Eq. 1) of the circumference c(ρ) and area a(ρ) is experienced if a “circle” with radius ρ is drawn in the wrinkling
structure.
(Left:) The sum of angles in a triangle is
smaller than 180◦ .
or maps of the hyperbolic space were developed, four are
especially well examined: (i) the Minkowski, (ii) the upperhalf plane, (iii) the Klein-Beltrami, and (iv) the Poincaré or
disk mapping. See [2] for more details and geometric mappings to transform in-between (i)–(iv).
The Poincaré projection is for our purpose the most suitable. Its main characteristics are:
Display compatibility: The infinite large area of the IH 2 is
mapped entirely into a circle, the Poincaré disk PD . This infinity representation fascinated Maurits Escher and inspired
him to several wood cuts [15].
Circle rim “= ∞”: All remote points are close to the rim,
without touching it.
Focus+Context: The focus can be moved to each location in IH 2 , like a “fovea”. The zooming factor is 0.5 in
the center and falls (exponentially) off with distance to the
fovea. Therefore, the context appears very natural. As more
remote things are, the less spatial representation is assigned
in the current display.
Lines become circles: All IH 2 -lines1 appear as circle arc
segments of centered straight lines in PD (both belong to
the set of so-called “generalized circles”). There extensions
cross the PD -rim always perpendicular on both ends.
Conformal mapping: Angles (and therefore form) relations are preserved in PD , area and length relations obviously not.
Regular tessellations with triangles offer richer possibilities than the IR2 . It turns out that there is an infinite set of
choices to tessellate IH 2 : for any integer n ≥ 7, one can
construct a regular tessellations in which n triangles meet
at each vertex (in contrast to the plane with allows only
1A
line is by definition the shortest path between two points
n = 3, 4, 6 and the sphere only n = 3, 4, 5). Fig. 3 depicts
examples for n = 7 and n = 10.
One way to compute these tessellations algorithmically
is by repeated application of a suitable set of generators of
their symmetry group to a suitably sized “starting triangle”
(see also Eq. 3 and [11]).
2.2. Moving the Focus
For changing the focus point in PD we need a translation operation which can be bound to mouse click and drag
events. In the Poincaré disk model the Möbius transformation T (z) is the appropriate solution. By describing the
Poincaré disk PD as the unit circle in the complex plane,
the isometric transformations for a point z ∈ PD can be
written
θz + c
z 0 = T (z; c, θ) =
, |θ| = 1, |c| < 1. (3)
c̄θz + 1
Here the complex number θ describes a pure rotation of PD
around the origin 0. The following translation by c maps
the origin to c and −c becomes the new center 0 (if θ = 1).
The Möbius transformations are also called the “circle automorphies” of the complex plane, since they describe the
transformations from circles to (generalized) circles. Here
they serve to translate IH 2 straight lines to lines – both appearing as generalized circles in the PD projection. For
further details, see [5, 15].
question for the case of acyclic, tree-like graph data was
provided by Lamping and Rao [6, 5]. By using mainly successive applications of transformation Eq. 3 they developed
(and patented) a method to find a suitable layout for this data
type in IH 2 . Each tree node receives a certain open space
“pie segment”, where the node chooses the locations of its
siblings. For all its siblings i it calls recursively the layoutroutine after applying the Möbius transformation Eq. 3 in
order to center i.
Tamara Munzner developed another graph layout algorithm for the three-dimensional hyperbolic space [8]. While
she gains much more space for the layout, the problem of
more complex navigation (and viewport control) in 3D and,
more serious, the problem of occlusion appears.
The next two layout techniques are freed from the requirement of hierarchical data.
3.2. Hyperbolic Self-Organizing Map (HSOM)
a*
x
wa*
Grid of
Neurons a
Input Space X
Figure 4. The “Self-Organizing Map” (“SOM”) is formed
Figure 3. Regular IH 2 tessellation with congruent triangles.
(Left:) Here, n = 7 triangles meet at each
vertex. Due to the angular deficit of the triangle sum in
IH 2 -triangles, the minimal number to complete a circle is 7.
(Right:) For n = 10 the triangle side length increases to
perfectly fill the plane. Note, that all IH 2 -lines appear as
circle arcs, which extend perpendicular to the “∞-rim”.
3. Layout Techniques in IH 2
In the following section we discuss three layout techniques for the IH 2 .
3.1. Hyperbolic Tree Layout (HTL)
for Tree-Like Graph Data
Now we turn to the question raised earlier: how to accommodate data in the hyperbolic space. A solution to this
by a grid of processing units, called formal neurons. Here
the usual case, a two-dimensional grid is illustrated at the
right side. Each neuron has a reference, or prototype vector
wa attached, which is a point in the embedding input space
X. A presented input x will select that neuron with wa
closest to it. The HSOM uses a hyperbolic grid as displayed
in Fig. 3 and the appropriate neighborhood function h(·).
The standard Self-Organizing Map (SOM) algorithm is
used in many application for learning and visualization (Kohonen, e.g. [3]). Fig. 4 illustrated the basic operation. The
feature map is built by a lattice of nodes (or formal neurons) a ∈ A, each with a reference vector or “prototype
vector” wa attached, projecting into the input space X. The
response of a SOM to an input vector x is determined by
the reference vector wa∗ of the discrete “best-match” node
a∗ , i.e. the node which has its prototype vector wa closest
to the given input
a∗ = argmin kwa − xk .
∀a∈A
(4)
The distribution of the reference vectors wa , is iteratively
adapted by a sequence of training vectors x. After finding
the best-match neuron a∗ all reference vectors are updated
(new)
(old)
wa
:= wa + ∆wa by the adaptation rule
∆wa = h(a, a∗ ) (x − wa ). with
h(a, a∗ ) = exp [−(da,a∗ /λ)2 ]
(5)
(6)
Here h(a, a∗ ) is a bell shaped Gaussian centered at the
“winner” a∗ and decaying with increasing distance da,a∗ =
|ga − ga∗ | in the neuron grid {ga }. Thus, each node or “neuron” in the neighborhood of the “winner” a∗ participates in
the current learning step (as indicated by the gray shading
in Fig. 4.)
The network starts with a given node grid A and a random initialization of the reference vectors. During the
course of learning, the width λ of the neighborhood bell
function and the learning step size parameter is continuously decreased in order to allow more and more specialization and fine tuning of the (then increasingly) weakly coupled neurons.
This neighborhood cooperation in the adaptation algorithm has important advantages: (i) it is able to generate
topological order between the wa which means that similar
inputs are mapped to neighboring nodes; (ii) As a result,
the convergence of the algorithm can be sped up by involving a whole group of neighboring neurons in each learning
step.
The structure of this neighborhood is essentially governed by the structure of h(a, a∗ ) = h(da,a∗ ) – therefore
also called the neighborhood function. Most learning and
visualization applications choose da,a∗ as distances in a regular two and three-dimensional euclidian lattice.
SOMs in non-euclidian spaces were suggested by one
of the authors [11]. The core idea of the Hyperbolic SelfOrganizing Map (HSOM) is to employ a IH 2 -grid of nodes.
A particular convenient choice is to take the {ga } ∈ PD of
a finite patch of the triangular tessellation grid introduced in
Sec. 2.1. The internode distance is computed in the appropriate the Poincaré metric as
|ga − ga∗ |
∗
.
(7)
da,a = 2 arctanh
|1 − ga ḡa∗ |
3.3. Hyperbolic Multidimensional Scaling
(HMDS)
Multidimensional scaling refers to a class of algorithms
for finding a suitable representation of proximity relations
of N objects by distances between points in a low dimensional – usually Euclidean – space. In the following we
represent proximity as dissimilarity values between pairs of
objects, mathematically written as dissimilarity δij ∈ IR+
0
between the i and j item. As usual we assume symmetry,
i.e. δij = δji . Often the raw dissimilarity distribution is
not suitable for the low-dimensional embedding and an additional δ-processing step is applied. We model it here as
a monotonic transformation D(.) of dissimilarities δij into
disparities Dij = D(δij ).
The goal of the MDS algorithm is to find a spatial representation xi of each object i in the L-dimensional space,
where the pair distances dij ≡ d(xi , xj ) match the disparities Dij as faithfully as possible ∀i6=j Dij ≈ dij . The pair
distance is usually measured by the Euclidian distance:
with xi ∈ IRL , i, j ∈ {1, 2, ..N} (8)
dij = ||xi − xj ||
One of the most widely known MDS algorithms was introduced by Sammon [12] in 1969. He formulates a minimization problem of a cost function which sums over the
squares of disparities–distance misfits, here written as
E({xi }) =
N X
X
wij (dij − Dij )2 .
(9)
i=1 j>i
The factors wij are introduced to weight the disparities individually and also to normalize the cost function E to be
independent to the absolute scale of the disparities Dij . Depending on the given analysis task the factors can be chosen to weight all the disparities equally – the global variant
(g)
(wij = const) – or to emphasize the local structure by
(l)
reducing the influence of larger disparities (wij , which we
are using in the following)
2
1
2 .
N
(N
−
1)
D
ij
k=1
(10)
Note that the latter is undefined if any pair has zero disparity. In his original work [12] Sammon suggested an interPN P
(m)
mediate normalization wij = ( k=1 l>k Dkl )−1 D−1
ij .
The set of xi is found by a gradient descent procedure, minimizing iteratively the cost or stress function Eq. 9. The
reader is referred to [12, 1] for further details on this and
other MDS algorithms.
The recently introduced Hyperbolic Multi-Dimensional
Scaling (HMDS) [16] combines the concept of MDS and
hyperbolic geometry. The core idea turns out to be very
simple: instead of finding a MDS solution in the lowdimensional Euclidean IRL and transferring it to the IH 2
(which can not work well), the MDS formalism operates in
the hyperbolic space from the beginning. The key point is
Eq. 8. The Euclidean distance in the target space is replaced
by the appropriate distance metric for the Poincaré model
(see, e.g. [7] and compare Eq. 7)
|xi − xj |
dij = 2 arctanh
, xi , xj ∈ PD.
(11)
|1 − xi x̄j |
(g)
wij = PN
1
P
2
l>k Dkl
,
(l)
wij =
While the gradients ∂dij,q /∂xi,q required for the gradient
descent are rather simple to compute for the Euclidean ge-
ometry, the case becomes complex for Eq. 11.2 Details can
be found in [16, 15].
Disparity preprocessing:
Due to the non-linearity of
Eq. 11, the preprocessing function D(.) (see Sec. 3.3) has
more influence in IH 2 . Consider, e.g., linear rescaling of the
dissimilarities Dij = αδij : in the Euclidean case the visual
structure is not affected – only magnified by α. In contrast
in IH 2 , α scales the distribution and with it the amount of
curvature felt by the data. The optimal α depends on the
given task, the dataset, and its dissimilarity structure. We
set α manually and choose a compromise between visibility of the entire structure and space for navigation in the
detail-rich areas.
4. Hyperbolic Data Viewer:
Combining the advantages
Before we introduce a new hybrid architecture, we first
compare the advantages and disadvantages of the three IH 2
layout methods with respect to several aspects.
Input data type: The HTL requires acyclic graph data and
is therefore limited to hierarchically ordered data (preferably balanced with a branch count ≈4–12).
The HSOM processed only vectorial data representations
– while the HMDS uses dissimilarity data. Since a suitable
distance function can directly transform any data type into
dissimilarly data (not vice versa) and handling of missing
data is easy, this can considered the most general data type.
Scaling behavior for the number of objects N : Both,
HTL and HSOM share the advantage of linear scaling with
the number of objects. HSOM scales also linearly with the
input space dimension and the number of nodes. HMDS
does not scale well and requires to process N (N − 1)/2
distance pairs. When N grows to several hundred objects,
the layout generation becomes slow and the results less convincing. Then precomputation may help for undamped interactive exploration.
Layout result: the HTL returns the IH 2 -location determined by the recursive space partitioning.
The HSOM returns the rigid grid, i.e., the triangular
tessellation grid. Each object or document is mapped to
the node with the best matching prototype vectors assigned
(Eq. 4). Each node is associated with two sources of descriptive information: the collected set of assigned objects
and the prototype vector representing the group. Those informations can be transformed in various kinds of graphical
attribution and annotation.
In contrast to the former, the layout results of the HMDS
directly carry information on the data level, since the spatial
locations represent the similarity structure of the given pair
distance data. Therefore, the map metaphor of closeness
and proximity is here brought to the detail level.
2 Note,
proach.
no complex gradient information is required in the HSOM ap-
HSOM
H2
prototypes
Poincare
Projection
Display
Area
navigate
select
Selection
Data
HMDS
H2
objects
Poincare
Projection
Display
Area
navigate
Figure 5. The proposed architecture combines the advantages of the two layout techniques: (upper part) the HSOM
for obtaining a coarse map of a large data collection and
(lower part) the HMDS for a mapping smaller data set to
a spatially continuous representation of data relationships.
The display concept is unified: both employ the extraordinary visualization and navigation features of the hyperbolic
plane.
New objects: For the HTL a new object requires a new
partial layout of the smallest subtree(s) containing the new
objects.
The HSOM maps a new object to the best-matching
node, i.e. a location in the map. The mapping time scales
with the grid size since it involves number of node many
comparisons. Furthermore the architecture many choose to
implement online learning in order to adapt to new training
data.
The HMDS requires a new minimization of the global
cost function. For speedup, it can employ the previous
object locations as start configuration.
New Hybrid Architecture:
The previous discussion
of advantages and disadvantages of the layout techniques
motivates the here proposed synthesis of three core components: (i) the HSOM for building a coarse-grain theme map
in a self-organized manner; (ii) HMDS for detailed inspection of data subsets where data similarities are continously
reflected as spatial proximities; (iii) the display paradigm
employs in both cases the hyperbolic plane in order to profit
from its focus and context technique. Fig. 5 displays the basic architecture.
5. Application to Navigation
in Unstructured Text
In times of exponential growth of digital information the
semantic navigation in datasets – particularly for the case of
unstructured text documents – is a major challenge. In this
experiment we demonstrate the application of the proposed
architecture to this situation.
As an example we use the “Reuters-21578”3 collection
3 As
compiled by David Lewis from the AT&T Research Lab in 1987.
The data can be found at
http://www.daviddlewis.com/resources/testcollections/reuters21578
of news articles that appeared on Reuters newswire between
1987/02/26 and 1987/10/20. Most of the documents were
manually tagged with 135 different category names such as
“earn”, “trade” or “jobs”. We employed the 9603 documents of the training set from the “ModApte” split to form
the HSOM input vectors x using the standard bag-of-words
model using the TFIDF scheme (term frequency times inverse document frequency) with a 5561 dimensional vector
space (equals # derived word stems).
Distances and therewith dissimilarities of two documents
are computed with the cosine metric
of 80 documents per node, which are given to the HMDS
module for further inspection. This number of documents
can be handled in real time by HMDS and allows an on-line
interactive text mining process.
f~
δij = 1 − cos(f~ti , f~tj ) = 1 − f~t0i f~t0j , with f~0 =
(12)
||f~||
and efficiently implemented by storing the normalized document feature vectors f~0 .
5.1. Interactive Browsing of the Overview Map
Figure 7. The Reuters-21578 corpus coarsly mapped with
Embedding the 2D Poincaré Model in 3D Euclidean
space and placing at each node position a 3D glyph allows
for the simultaneous visualization of several attributes at
once. The glyph size, form, color or height above the PD
ground plane might characterize the number of documents,
the predominant category in the corresponding node or the
number of new documents mapped.
Depending on the size of the text database to be mined,
the number of nodes for the HSOM is chosen. In Figure 6
a HSOM with a total of 1306 nodes is shown. Since the
HMDS approach can handle sizes of several hundred documents, such a map could easily contain a million articles.
Figure 6. A HSOM projection of a large collection of
newswire articles forms semantically related category clusters (shown as different glyphs).
In case of the Reuters-21578 collection we show results
with a HSOM consisting of a tessellation with 3 “rings” and
8 neighbors per node, resulting in 161 prototype vectors as
shown in Figure 7. By mapping the 12902 documents of the
Reuters-21578 training and test collection we have a mean
a HSOM containing 161 nodes. The glyph size corresponds
to the number of documents before 1987/04/07, the height
above ground plane the number of articles after 1987/04/07.
Figure 7 is the initial point for an interactive text mining session we describe in the following. The global hyperbolic overview map reveals several large clusters which
mainly contain the top 10 topics. There is only one larger
glyph (at the 3 o’clock position) indicating documents not
belonging to the 10 top categories. The area marked with
the question mark “?” contains an isolated yellow sphere
which is surrounded by green cubes. In Figure 8 the user
has adjusted the fovea of the hyperbolic map to inspect this
selected region more closely. The figure shows that the relation of the nodes in this region can now be inspected more
easily while the global context is still in view. By interactively selecting a node, the HSOM prototype vectors are
used to automatically generate a key word list which annotate and semantically describe the selected glyph. To this
end, the words corresponding to the ten largest components
of the reference vector are selected. These describe the prototype document which resembles a non-linear superposition of the texts for which this node is the best-match node.
In our example the most prominent words “strike”, “union”
and “port” indicate that this area of the map probably contains articles describing worker strikes in ports.
By mouse selection, all node-assigned-documents are
send to the HMDS module. Fig. 9 displays the HMDS result. The presentation here is a lean 2D display with minimal occlusion and uses markers for category indication of
each object. Several clusters can be easily recognized. The
“A” marked group is a category mixture while the others are
quite homogeneous.
By turning on the labels the document title become visible and the semantic homogeneity can be verified. Fig. 10
displays a screen shot after sweeping the navigation focus
H2-MDS
Earn
Acquisition
Money-FX
Crude
Grain
Trade
Interest
Wheat
Ship
Corn
Other
CARGILL_U.K._STRIKE_TALKS_POSTPONED-486
CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966
CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833
CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOON-9197
CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425
CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAY-6993
CARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710
Figure 8. Inspection of a picked node by adjusting the
Figure 10. Navigation to the “C”-marked document clus-
fovea. The nodes’ annotations were generated by evaluation of the corresponding prototype vectors. They indicate
the nodes’ contents and show the semantical relationship of
adjacent nodes.
ter in Fig. 9. Now the cluster is focused and labels are
turned on. All document are related to a strike at Cargill
U.K. Ltd’s oilseed processing plant at Seaforth in the beginning of 1987. Note how the quartering lines in Fig. 9 are
transfered to other IH 2 -lines(!) which appear as circle arc
perpendicular to the rim.
H2-MDS
E
A
D
Earn
Acquisition
Money-FX
Crude
Grain
Trade
Interest
Wheat
Ship
Corn
Other
C
B
Figure 9. The screenshot of the HMDS visualizes all doc-
shows a map where the focus is centered to the winning
node of the query: “USA leading the strike in a Gulf war
against Iraq?”.
Fig. 12 presents the drill-down with the HMDS and labels the neighboring documents to the new query. We find
texts which deal with tensions in the gulf at that time and
also mentions the word “strike” – but in another meaning than in the previously inspected node on a very distant HSOM node. A further query is a query from another
news stream: CNN reported a very promising article “Bush:
Ending Saddam’s regime will bring stability to Mideast”
(03/02/274 ) which we find in the upper left corner in Fig. 12.
uments in the selected node #143 positioned in the IH 2 . The
legend at the right explains the marker type used for the top
10 categories each document can be labels with. The cross
is for visual indication of a zero-point and markers A–E are
an overlay for explanation.
to the cluster structure “C” in the previously lower left direction.
5.2. Similarity Search for Queries
or New Documents
The hybrid approach can also be used to find similar documents to a query within a large collection. This is achieved
by generating a query feature vector which is compared to
all prototypes. The corresponding best match node is then
visually highlighted, such that the user can adjust the focus
of attention and “zoom” into that region. In order to demonstrate that context plays an important role, we formulate a
query containing the word “strike” (which was the most important entry for the node inspected in Figure 8). Figure 11
Figure 11. A query document was mapped to the HSOM
and the fovea moved to the highlighted “best match” node.
The automatic annotation scheme provides insightful informations about the semantic content in this area of the map.
4 http://www.cnn.com/2003/WORLD/meast/02/27/sprj.irq.bush.speech/index.html
H2-MDS
Earn
Acquisition
Money-FX
Crude
.GULF_AND_WESTERN_<GW>_UPS_INTEREST_IN_NETWORK.<19380>
.U.S._HOUSE_PASSES_GULF_BILL_DESPITE_OPPOSITION.<18328>
Grain
.U.S._HOUSE_PASSES_MIDEAST_GULF_BILL.<18357>
Trade
.REAGAN_HINTS_U.S._WANTS_HELP_IN_PATROLLING_GULF.<17750>
Interest
Q: Bush: Ending Saddam’s regime...<CNN:2003/02/27>
Wheat
Ship
Corn
.U.S._TO_PROTECT_ONLY_AMERICAN_SHIPS.<18231>
Other
.US_SENATE_CUTS_OFF_STALL_TACTICS_ON_GULF_BILL.<20624>
Search
.US_WARNS_IRAN,_BEGINS_ESCORTING_TANKER_CONVOY.<20464>
.CONVOY_RUNS_GULF_GAUNTLET,_OTHER_SHIPS_STAY_CLEAR.<20774>
ing agents offering the potential to visualize temporal developments in data streams on the map. In the second stage
the HMDS can represent the semantic closeness of the individual documents and is able to give a much more precise
representation since is is decoupled from the rigid grid.
While the hybrid scheme may appear conceptually simple, we think that hybrid approaches to strike a flexible
balance between scaling of computational demands and
achievable precision can be crucial for making new methods
applicable to massive data collections, an important goal towards which the present research is meant to be a modest
but useful step.
.SENATE_BACKS_U.S._RETALIATION_IN_GULF.<20828>
.U.S._GULF_OF_MEXICO_RIG_COUNT_CLIMBS_TO_38.9_PCT.<17658>
.U.S._LAWMAKERS_SUPPORT_GULF_ACTION.<20890>
Q: USA leading the strike in a gulf war against iraq?
.EC_WATCHING_GULF_WAR_DEVELOPMENTS.<18340>
.U.S._SENATE_TEAM_WANTS_MULTINATIONAL_GULF_FORCE.<18329>
References
Figure 12. HMDS location of a manual query (4 o’clock)
and another news document from these days (10 o’clock).
The title reveal the successful mapping in a meaningful
manner.
6. Discussion and Conclusion
Document visualization efforts like ThemeScapes [17]
or the SOM based WebSOM [4] as well as Skupin’s cartographic approach [13] have impressively demonstrated the
usefulness of compressed, map-like 2D-representations of
massive data collections, even if the data items contain extremely high-dimensional information such as text.
Recent work, such as [6, 11, 16] shows that the task of
information visualization can significantly benefit from the
use of hyperbolic space as a projection manifold. On the
one hand it gains the exponentially growing space around
each point which provides extra space for compressing semantic relationships. On the other hand the Poincaré model
offers superb visualization and navigation properties, which
were found to yield significant improvement in task time
compared to traditional browsing methods [10]. By simple mouse interaction the focus can be transfered to any
location of interest. The core area close to the center of
the Poincaré disk magnifies the data with a zoom factor of
0.5 and decreases exponentially to the outer area. By this
means a very natural visualization behavior is constructed:
The fovea is an area with high resolution, while remote area
are gradually compressed and still visible as context.
Another advantage is scalability. Due to the favorable
linear scaling of O(N ) the HSOM can be used to form
an initial overview map for very large data collections.
This map then offers all strengths of the hyperbolic focus+context navigation, permitting the user to rapidly narrow down the to-be-investigated data to a much smaller subset which then can be interactively mapped to the individual
level with the HMDS technique. Again, the same hyperbolic focus+context navigation is available.
Both approaches produce spatial representations of data
similarity: The HSOM produces on a coarse level the “master map”, providing the thematic overview. Additionally,
the neurons of the HSOM can be regarded as data collect-
[1] T. Cox and M. Cox. Multidimensional Scaling. Chapman
and Hall, 1994.
[2] H.S.M. Coxeter. Non-Euclidean Geometry. University of
Toronto Press, 1957.
[3] T. Kohonen. Self-Organizing Maps, volume 30 of Springer
Series in Information Sciences. Springer, 1995.
[4] T. Kohonen et al. Organization of a massive document
collection. IEEE TNN Spec Issue Neural Networks for Data
Mining and Knowledge Discovery, 11(3):574–585, 2000.
[5] J. Lamping, R. Rao, and P. Pirolli. A focus+context
technique based on hyperbolic geometry for viewing large
hierarchies. In ACM SIGCHI, pages 401–408, 1995.
[6] J. Lamping and R. Rao. Laying out and visualizing large
trees using a hyperbolic space. In ACM Symp User Interface
Software and Technology, pages 13–14, 1994.
[7] F. Morgan. Riemannian Geometry: A Beginner’s Guide.
Jones and Bartlett Publishers, 1993.
[8] T. Munzner. H3: Laying out large directed graphs in 3d
hyperbolic space. In Proc IEEE Symp Info Vis, pages 2–10,
1997.
[9] P. Pirolli, S. Card, and M. M. Van Der Wege. Visual
information foraging in a focus + context visualization. In
CHI, pages 506–513, 2001.
[10] K. Risden, M. Czerwinski, T. Munzner, and D. Cook. An
initial examination of ease of use for 2d and 3d information
visualizations of web content. Int J Human Computer
Studies, 53(5):695–714, 2000.
[11] H. Ritter. Self-organizing maps on non-euclidean spaces. In
Kohonen Maps, pages 97–110. Elsevier, 1999.
[12] J. W. Sammon, Jr. A non-linear mapping for data structure
analysis. IEEE Trans Computers, 18:401–409, 1969.
[13] A. Skupin. A cartographic approach to visualizing conference abstracts. IEEE Computer Graphics and Applications,
pages 50–58, 2002.
[14] J.A. Thorpe. Elementary Topics in Differential Geometry.
Springer, 1979.
[15] J. Walter. H-MDS: a new approach for interactive visualization with multidimensional scaling in the hyperbolic space.
Information Systems, (in print), 2003.
[16] J. Walter and H. Ritter. On interactive visualization of
high-dimensional data using the hyperbolic plane. In ACM
SIGKDD Int Conf Knowledge Discovery and Data Mining,
pages 123–131. 2002.
[17] J. Wise. The ecological approach to text visualizationt. J Am
Soc Information Sci, 50(13):1224–1233, 1999.