Download Interactive Visualization and Navigation in Large Data Collections

Interactive Visualization and Navigation in Large Data Collections using the Hyperbolic Space Jörg Walter · Jörg Ontrup · Daniel Wessling · Helge Ritter Neuroinformatics Group · Department of Computer Science University of Bielefeld · D-33615 Bielefeld · Germany E-mail: [email protected] Abstract background knowledge, intuition and creativity – he is also vested with powerful pattern recognition and processing capabilities especially for the visual information channel. The design goals for an optimal user–data interaction strongly depend on the given exploration task but they certainly include an easy and intuitive navigation with strong support for the user’s orientation. We propose the combination of two recently introduced methods for the interactive visual data mining of large collections of data. Both, Hyperbolic Multi-Dimensional Scaling (HMDS) and Hyperbolic Self-Organizing Maps (HSOM) employ the extraordinary advantages of the hyperbolic plane (H2): (i) the underlying space grows exponentially with its radius around each point - ideal for embedding high-dimensional (or hierarchical) data; (ii) the Poincaré model of the IH 2 exhibits a fish-eye perspective with a focus area and a context preserving surrounding; (iii) the mouse binding of focus-transfer allows intuitive interactive navigation. The HMDS approach extends multi-dimensional scaling and generates a spatial embedding of the data representing their dissimilarity structure as faithfully as possible. It is very suitable for interactive browsing of data object collections, but calls for batch precomputation for larger collection sizes. The HSOM is an extension of Kohonen’s Self-Organizing Map and generates a partitioning of the data collection assigned to an IH 2 tessellating grid. While the algorithm’s complexity is linear in the collection size, the data browsing is rigidly bound to the underlying grid. By integrating the two approaches we gain the synergetic effect of adding advantages of both. And the hybrid architecture uses consistently the IH 2 visualization and navigation concept. We present the successfully application to a text mining example involving the Reuters-21578 text corpus. Data Layout Technique Layout space Projection Technique Display Area Figure 1. Displaying larger collections of data with limited display area requires careful usage of space. The allocation of spatial representation for providing overview and detail is a challenge. After choosing the level of detail the data layout operated on the canvas (“layout space”) before it is suitably projected onto the display area (e.g. after panning and zooming). Visualizing large data collections has to provide means to effectively use a limited display space and give the user the overview as well as the details. Since most of available data display devices are two-dimensional – paper and screens – the following problem must be solved: finding a meaningful spatial mapping of data onto the display area. One limiting factor is the “restricted neighborhood” around a point in a Euclidean 2D surface. Hyperbolic spaces open an interesting loophole. The extraordinary property of exponential growth of neighborhood with increasing radius around all points enables us to build novel displays. The “hyperbolic tree viewer”, developed at Xerox Parc [6], demonstrated the remarkably elegant interactive capabilities. The hyperbolic model appears as a continuously graded, focus+context mapping to the display. See [9, 10] for comparative studies with traditional display types. Unfortunately, previous usage of direct hyperbolic visualization was limited to hierarchical, tree-like, or “quasigraph” data. Two reliefs were recently introduced, suggesting more general IH 2 -layout techniques: one generalizes Kohonen’s SOM algorithm to the Hyperbolic SelfOrganizing Map algorithm (HSOM) [11]; the other intro- 1. Introduction The demand for techniques handling large collections of data is rapidly growing. While the power of information systems increases – the amount of information a human user can directly digest does not. The challenge is to provide good clues for the right questions, which is the key to discoveries. The human expert possess not only valuable 1 duces Hyperbolic Multi-Dimensional Scaling (HMDS) [16] for a direct construction of a distance preserving embedding of high-dimensional data into the hyperbolic space. In Sec. 2 and 3 we review the hyperbolic space and the three mentioned layout techniques for visualizing data in IH 2 . Sec. 4 discusses a synthesis for a two step visualization architecture. Even though the look and feel of an interactive visualization and navigation cannot be really conveyed in a static format, we report in Sec. 5 several results and snapshots of an application to text mining. This approach allows to visualize, navigate and search in a space of documents appearing in the Reuters news stream. 2. The Hyperbolic Space IH 2 Figure 2. There is literally more room in hyperbolic space Historically, the hyperbolic space is a “recent” discovery in the 18th century. Lobachevsky, Bolyai, and Gauss independently discovered the non-Euclidean geometries by negating the “parallel axiom” which Euclid formulated 2300 years ago. Today we know three geometries with uniform curvature. Our daily experience is governed by the flat or Euclidean geometry with zero curvature. Still familiar is the spherical geometry with positive curvature – describing the surface of a sphere, like the earth or an orange. Its counterpart with constant negative curvature is known as the hyperbolic plane IH 2 (with analogous generalizations to higher dimensions) [2, 14]. Unfortunately, there is no “perfect” embedding of the IH 2 in IR3 , which makes it harder to grasp the unusual properties of the IH 2 . Local patches resemble the situation at a saddle point, where the neighborhood grows faster than in flat space (see Fig. 2). Standard textbooks on Riemannian geometry (see, e.g. [7]) show that the circumference c and area a of a circle of radius ρ in IH 2 are given by area: a(ρ) = 4π sinh2 (ρ/2) circumference: c(ρ) = 2π sinh(ρ) . (1) (2) This bears two remarkable asymptotic properties, (i) for small radius ρ the space “looks flat” since a(ρ) ≈ πρ2 and c(r) ≈ 2πρ. (ii) For larger ρ both grow exponentially with the radius. As observed in [6, 5], this trait makes the hyperbolic space ideal for embedding hierarchical structures. Fig. 2 illustrates the spatial relations by embedding a small patch of the IH 2 in IR3 . To use the visualization potential of the IH 2 we must solve the two problems displayed before in Fig. 1. Now we turn to the projection problem, which was solved for the IH 2 more than a century ago. 2.1. The Projection Solution for IH 2 The perfect projection into the flat display area should preserve length, area, and angles (≈form). But it lays in the nature of a curvated space to resist the attempt to simultaneously achieve these goals. Consequently several projections than in Euclidean space, as shown in this illustrated embedding of the hyperbolic plane into 3D Euclidean space (courtesy of Jeffrey Weeks). (Right:) Exponential growth (Eq. 1) of the circumference c(ρ) and area a(ρ) is experienced if a “circle” with radius ρ is drawn in the wrinkling structure. (Left:) The sum of angles in a triangle is smaller than 180◦ . or maps of the hyperbolic space were developed, four are especially well examined: (i) the Minkowski, (ii) the upperhalf plane, (iii) the Klein-Beltrami, and (iv) the Poincaré or disk mapping. See [2] for more details and geometric mappings to transform in-between (i)–(iv). The Poincaré projection is for our purpose the most suitable. Its main characteristics are: Display compatibility: The infinite large area of the IH 2 is mapped entirely into a circle, the Poincaré disk PD . This infinity representation fascinated Maurits Escher and inspired him to several wood cuts [15]. Circle rim “= ∞”: All remote points are close to the rim, without touching it. Focus+Context: The focus can be moved to each location in IH 2 , like a “fovea”. The zooming factor is 0.5 in the center and falls (exponentially) off with distance to the fovea. Therefore, the context appears very natural. As more remote things are, the less spatial representation is assigned in the current display. Lines become circles: All IH 2 -lines1 appear as circle arc segments of centered straight lines in PD (both belong to the set of so-called “generalized circles”). There extensions cross the PD -rim always perpendicular on both ends. Conformal mapping: Angles (and therefore form) relations are preserved in PD , area and length relations obviously not. Regular tessellations with triangles offer richer possibilities than the IR2 . It turns out that there is an infinite set of choices to tessellate IH 2 : for any integer n ≥ 7, one can construct a regular tessellations in which n triangles meet at each vertex (in contrast to the plane with allows only 1A line is by definition the shortest path between two points n = 3, 4, 6 and the sphere only n = 3, 4, 5). Fig. 3 depicts examples for n = 7 and n = 10. One way to compute these tessellations algorithmically is by repeated application of a suitable set of generators of their symmetry group to a suitably sized “starting triangle” (see also Eq. 3 and [11]). 2.2. Moving the Focus For changing the focus point in PD we need a translation operation which can be bound to mouse click and drag events. In the Poincaré disk model the Möbius transformation T (z) is the appropriate solution. By describing the Poincaré disk PD as the unit circle in the complex plane, the isometric transformations for a point z ∈ PD can be written θz + c z 0 = T (z; c, θ) = , |θ| = 1, |c| < 1. (3) c̄θz + 1 Here the complex number θ describes a pure rotation of PD around the origin 0. The following translation by c maps the origin to c and −c becomes the new center 0 (if θ = 1). The Möbius transformations are also called the “circle automorphies” of the complex plane, since they describe the transformations from circles to (generalized) circles. Here they serve to translate IH 2 straight lines to lines – both appearing as generalized circles in the PD projection. For further details, see [5, 15]. question for the case of acyclic, tree-like graph data was provided by Lamping and Rao [6, 5]. By using mainly successive applications of transformation Eq. 3 they developed (and patented) a method to find a suitable layout for this data type in IH 2 . Each tree node receives a certain open space “pie segment”, where the node chooses the locations of its siblings. For all its siblings i it calls recursively the layoutroutine after applying the Möbius transformation Eq. 3 in order to center i. Tamara Munzner developed another graph layout algorithm for the three-dimensional hyperbolic space [8]. While she gains much more space for the layout, the problem of more complex navigation (and viewport control) in 3D and, more serious, the problem of occlusion appears. The next two layout techniques are freed from the requirement of hierarchical data. 3.2. Hyperbolic Self-Organizing Map (HSOM) a* x wa* Grid of Neurons a Input Space X Figure 4. The “Self-Organizing Map” (“SOM”) is formed Figure 3. Regular IH 2 tessellation with congruent triangles. (Left:) Here, n = 7 triangles meet at each vertex. Due to the angular deficit of the triangle sum in IH 2 -triangles, the minimal number to complete a circle is 7. (Right:) For n = 10 the triangle side length increases to perfectly fill the plane. Note, that all IH 2 -lines appear as circle arcs, which extend perpendicular to the “∞-rim”. 3. Layout Techniques in IH 2 In the following section we discuss three layout techniques for the IH 2 . 3.1. Hyperbolic Tree Layout (HTL) for Tree-Like Graph Data Now we turn to the question raised earlier: how to accommodate data in the hyperbolic space. A solution to this by a grid of processing units, called formal neurons. Here the usual case, a two-dimensional grid is illustrated at the right side. Each neuron has a reference, or prototype vector wa attached, which is a point in the embedding input space X. A presented input x will select that neuron with wa closest to it. The HSOM uses a hyperbolic grid as displayed in Fig. 3 and the appropriate neighborhood function h(·). The standard Self-Organizing Map (SOM) algorithm is used in many application for learning and visualization (Kohonen, e.g. [3]). Fig. 4 illustrated the basic operation. The feature map is built by a lattice of nodes (or formal neurons) a ∈ A, each with a reference vector or “prototype vector” wa attached, projecting into the input space X. The response of a SOM to an input vector x is determined by the reference vector wa∗ of the discrete “best-match” node a∗ , i.e. the node which has its prototype vector wa closest to the given input a∗ = argmin kwa − xk . ∀a∈A (4) The distribution of the reference vectors wa , is iteratively adapted by a sequence of training vectors x. After finding the best-match neuron a∗ all reference vectors are updated (new) (old) wa := wa + ∆wa by the adaptation rule ∆wa = h(a, a∗ ) (x − wa ). with h(a, a∗ ) = exp [−(da,a∗ /λ)2 ] (5) (6) Here h(a, a∗ ) is a bell shaped Gaussian centered at the “winner” a∗ and decaying with increasing distance da,a∗ = |ga − ga∗ | in the neuron grid {ga }. Thus, each node or “neuron” in the neighborhood of the “winner” a∗ participates in the current learning step (as indicated by the gray shading in Fig. 4.) The network starts with a given node grid A and a random initialization of the reference vectors. During the course of learning, the width λ of the neighborhood bell function and the learning step size parameter is continuously decreased in order to allow more and more specialization and fine tuning of the (then increasingly) weakly coupled neurons. This neighborhood cooperation in the adaptation algorithm has important advantages: (i) it is able to generate topological order between the wa which means that similar inputs are mapped to neighboring nodes; (ii) As a result, the convergence of the algorithm can be sped up by involving a whole group of neighboring neurons in each learning step. The structure of this neighborhood is essentially governed by the structure of h(a, a∗ ) = h(da,a∗ ) – therefore also called the neighborhood function. Most learning and visualization applications choose da,a∗ as distances in a regular two and three-dimensional euclidian lattice. SOMs in non-euclidian spaces were suggested by one of the authors [11]. The core idea of the Hyperbolic SelfOrganizing Map (HSOM) is to employ a IH 2 -grid of nodes. A particular convenient choice is to take the {ga } ∈ PD of a finite patch of the triangular tessellation grid introduced in Sec. 2.1. The internode distance is computed in the appropriate the Poincaré metric as |ga − ga∗ | ∗ . (7) da,a = 2 arctanh |1 − ga ḡa∗ | 3.3. Hyperbolic Multidimensional Scaling (HMDS) Multidimensional scaling refers to a class of algorithms for finding a suitable representation of proximity relations of N objects by distances between points in a low dimensional – usually Euclidean – space. In the following we represent proximity as dissimilarity values between pairs of objects, mathematically written as dissimilarity δij ∈ IR+ 0 between the i and j item. As usual we assume symmetry, i.e. δij = δji . Often the raw dissimilarity distribution is not suitable for the low-dimensional embedding and an additional δ-processing step is applied. We model it here as a monotonic transformation D(.) of dissimilarities δij into disparities Dij = D(δij ). The goal of the MDS algorithm is to find a spatial representation xi of each object i in the L-dimensional space, where the pair distances dij ≡ d(xi , xj ) match the disparities Dij as faithfully as possible ∀i6=j Dij ≈ dij . The pair distance is usually measured by the Euclidian distance: with xi ∈ IRL , i, j ∈ {1, 2, ..N} (8) dij = ||xi − xj || One of the most widely known MDS algorithms was introduced by Sammon [12] in 1969. He formulates a minimization problem of a cost function which sums over the squares of disparities–distance misfits, here written as E({xi }) = N X X wij (dij − Dij )2 . (9) i=1 j>i The factors wij are introduced to weight the disparities individually and also to normalize the cost function E to be independent to the absolute scale of the disparities Dij . Depending on the given analysis task the factors can be chosen to weight all the disparities equally – the global variant (g) (wij = const) – or to emphasize the local structure by (l) reducing the influence of larger disparities (wij , which we are using in the following) 2 1 2 . N (N − 1) D ij k=1 (10) Note that the latter is undefined if any pair has zero disparity. In his original work [12] Sammon suggested an interPN P (m) mediate normalization wij = ( k=1 l>k Dkl )−1 D−1 ij . The set of xi is found by a gradient descent procedure, minimizing iteratively the cost or stress function Eq. 9. The reader is referred to [12, 1] for further details on this and other MDS algorithms. The recently introduced Hyperbolic Multi-Dimensional Scaling (HMDS) [16] combines the concept of MDS and hyperbolic geometry. The core idea turns out to be very simple: instead of finding a MDS solution in the lowdimensional Euclidean IRL and transferring it to the IH 2 (which can not work well), the MDS formalism operates in the hyperbolic space from the beginning. The key point is Eq. 8. The Euclidean distance in the target space is replaced by the appropriate distance metric for the Poincaré model (see, e.g. [7] and compare Eq. 7) |xi − xj | dij = 2 arctanh , xi , xj ∈ PD. (11) |1 − xi x̄j | (g) wij = PN 1 P 2 l>k Dkl , (l) wij = While the gradients ∂dij,q /∂xi,q required for the gradient descent are rather simple to compute for the Euclidean geometry, the case becomes complex for Eq. 11.2 Details can be found in [16, 15]. Disparity preprocessing: Due to the non-linearity of Eq. 11, the preprocessing function D(.) (see Sec. 3.3) has more influence in IH 2 . Consider, e.g., linear rescaling of the dissimilarities Dij = αδij : in the Euclidean case the visual structure is not affected – only magnified by α. In contrast in IH 2 , α scales the distribution and with it the amount of curvature felt by the data. The optimal α depends on the given task, the dataset, and its dissimilarity structure. We set α manually and choose a compromise between visibility of the entire structure and space for navigation in the detail-rich areas. 4. Hyperbolic Data Viewer: Combining the advantages Before we introduce a new hybrid architecture, we first compare the advantages and disadvantages of the three IH 2 layout methods with respect to several aspects. Input data type: The HTL requires acyclic graph data and is therefore limited to hierarchically ordered data (preferably balanced with a branch count ≈4–12). The HSOM processed only vectorial data representations – while the HMDS uses dissimilarity data. Since a suitable distance function can directly transform any data type into dissimilarly data (not vice versa) and handling of missing data is easy, this can considered the most general data type. Scaling behavior for the number of objects N : Both, HTL and HSOM share the advantage of linear scaling with the number of objects. HSOM scales also linearly with the input space dimension and the number of nodes. HMDS does not scale well and requires to process N (N − 1)/2 distance pairs. When N grows to several hundred objects, the layout generation becomes slow and the results less convincing. Then precomputation may help for undamped interactive exploration. Layout result: the HTL returns the IH 2 -location determined by the recursive space partitioning. The HSOM returns the rigid grid, i.e., the triangular tessellation grid. Each object or document is mapped to the node with the best matching prototype vectors assigned (Eq. 4). Each node is associated with two sources of descriptive information: the collected set of assigned objects and the prototype vector representing the group. Those informations can be transformed in various kinds of graphical attribution and annotation. In contrast to the former, the layout results of the HMDS directly carry information on the data level, since the spatial locations represent the similarity structure of the given pair distance data. Therefore, the map metaphor of closeness and proximity is here brought to the detail level. 2 Note, proach. no complex gradient information is required in the HSOM ap- HSOM H2 prototypes Poincare Projection Display Area navigate select Selection Data HMDS H2 objects Poincare Projection Display Area navigate Figure 5. The proposed architecture combines the advantages of the two layout techniques: (upper part) the HSOM for obtaining a coarse map of a large data collection and (lower part) the HMDS for a mapping smaller data set to a spatially continuous representation of data relationships. The display concept is unified: both employ the extraordinary visualization and navigation features of the hyperbolic plane. New objects: For the HTL a new object requires a new partial layout of the smallest subtree(s) containing the new objects. The HSOM maps a new object to the best-matching node, i.e. a location in the map. The mapping time scales with the grid size since it involves number of node many comparisons. Furthermore the architecture many choose to implement online learning in order to adapt to new training data. The HMDS requires a new minimization of the global cost function. For speedup, it can employ the previous object locations as start configuration. New Hybrid Architecture: The previous discussion of advantages and disadvantages of the layout techniques motivates the here proposed synthesis of three core components: (i) the HSOM for building a coarse-grain theme map in a self-organized manner; (ii) HMDS for detailed inspection of data subsets where data similarities are continously reflected as spatial proximities; (iii) the display paradigm employs in both cases the hyperbolic plane in order to profit from its focus and context technique. Fig. 5 displays the basic architecture. 5. Application to Navigation in Unstructured Text In times of exponential growth of digital information the semantic navigation in datasets – particularly for the case of unstructured text documents – is a major challenge. In this experiment we demonstrate the application of the proposed architecture to this situation. As an example we use the “Reuters-21578”3 collection 3 As compiled by David Lewis from the AT&T Research Lab in 1987. The data can be found at http://www.daviddlewis.com/resources/testcollections/reuters21578 of news articles that appeared on Reuters newswire between 1987/02/26 and 1987/10/20. Most of the documents were manually tagged with 135 different category names such as “earn”, “trade” or “jobs”. We employed the 9603 documents of the training set from the “ModApte” split to form the HSOM input vectors x using the standard bag-of-words model using the TFIDF scheme (term frequency times inverse document frequency) with a 5561 dimensional vector space (equals # derived word stems). Distances and therewith dissimilarities of two documents are computed with the cosine metric of 80 documents per node, which are given to the HMDS module for further inspection. This number of documents can be handled in real time by HMDS and allows an on-line interactive text mining process. f~ δij = 1 − cos(f~ti , f~tj ) = 1 − f~t0i f~t0j , with f~0 = (12) ||f~|| and efficiently implemented by storing the normalized document feature vectors f~0 . 5.1. Interactive Browsing of the Overview Map Figure 7. The Reuters-21578 corpus coarsly mapped with Embedding the 2D Poincaré Model in 3D Euclidean space and placing at each node position a 3D glyph allows for the simultaneous visualization of several attributes at once. The glyph size, form, color or height above the PD ground plane might characterize the number of documents, the predominant category in the corresponding node or the number of new documents mapped. Depending on the size of the text database to be mined, the number of nodes for the HSOM is chosen. In Figure 6 a HSOM with a total of 1306 nodes is shown. Since the HMDS approach can handle sizes of several hundred documents, such a map could easily contain a million articles. Figure 6. A HSOM projection of a large collection of newswire articles forms semantically related category clusters (shown as different glyphs). In case of the Reuters-21578 collection we show results with a HSOM consisting of a tessellation with 3 “rings” and 8 neighbors per node, resulting in 161 prototype vectors as shown in Figure 7. By mapping the 12902 documents of the Reuters-21578 training and test collection we have a mean a HSOM containing 161 nodes. The glyph size corresponds to the number of documents before 1987/04/07, the height above ground plane the number of articles after 1987/04/07. Figure 7 is the initial point for an interactive text mining session we describe in the following. The global hyperbolic overview map reveals several large clusters which mainly contain the top 10 topics. There is only one larger glyph (at the 3 o’clock position) indicating documents not belonging to the 10 top categories. The area marked with the question mark “?” contains an isolated yellow sphere which is surrounded by green cubes. In Figure 8 the user has adjusted the fovea of the hyperbolic map to inspect this selected region more closely. The figure shows that the relation of the nodes in this region can now be inspected more easily while the global context is still in view. By interactively selecting a node, the HSOM prototype vectors are used to automatically generate a key word list which annotate and semantically describe the selected glyph. To this end, the words corresponding to the ten largest components of the reference vector are selected. These describe the prototype document which resembles a non-linear superposition of the texts for which this node is the best-match node. In our example the most prominent words “strike”, “union” and “port” indicate that this area of the map probably contains articles describing worker strikes in ports. By mouse selection, all node-assigned-documents are send to the HMDS module. Fig. 9 displays the HMDS result. The presentation here is a lean 2D display with minimal occlusion and uses markers for category indication of each object. Several clusters can be easily recognized. The “A” marked group is a category mixture while the others are quite homogeneous. By turning on the labels the document title become visible and the semantic homogeneity can be verified. Fig. 10 displays a screen shot after sweeping the navigation focus H2-MDS Earn Acquisition Money-FX Crude Grain Trade Interest Wheat Ship Corn Other CARGILL_U.K._STRIKE_TALKS_POSTPONED-486 CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966 CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833 CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOON-9197 CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425 CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAY-6993 CARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710 Figure 8. Inspection of a picked node by adjusting the Figure 10. Navigation to the “C”-marked document clus- fovea. The nodes’ annotations were generated by evaluation of the corresponding prototype vectors. They indicate the nodes’ contents and show the semantical relationship of adjacent nodes. ter in Fig. 9. Now the cluster is focused and labels are turned on. All document are related to a strike at Cargill U.K. Ltd’s oilseed processing plant at Seaforth in the beginning of 1987. Note how the quartering lines in Fig. 9 are transfered to other IH 2 -lines(!) which appear as circle arc perpendicular to the rim. H2-MDS E A D Earn Acquisition Money-FX Crude Grain Trade Interest Wheat Ship Corn Other C B Figure 9. The screenshot of the HMDS visualizes all doc- shows a map where the focus is centered to the winning node of the query: “USA leading the strike in a Gulf war against Iraq?”. Fig. 12 presents the drill-down with the HMDS and labels the neighboring documents to the new query. We find texts which deal with tensions in the gulf at that time and also mentions the word “strike” – but in another meaning than in the previously inspected node on a very distant HSOM node. A further query is a query from another news stream: CNN reported a very promising article “Bush: Ending Saddam’s regime will bring stability to Mideast” (03/02/274 ) which we find in the upper left corner in Fig. 12. uments in the selected node #143 positioned in the IH 2 . The legend at the right explains the marker type used for the top 10 categories each document can be labels with. The cross is for visual indication of a zero-point and markers A–E are an overlay for explanation. to the cluster structure “C” in the previously lower left direction. 5.2. Similarity Search for Queries or New Documents The hybrid approach can also be used to find similar documents to a query within a large collection. This is achieved by generating a query feature vector which is compared to all prototypes. The corresponding best match node is then visually highlighted, such that the user can adjust the focus of attention and “zoom” into that region. In order to demonstrate that context plays an important role, we formulate a query containing the word “strike” (which was the most important entry for the node inspected in Figure 8). Figure 11 Figure 11. A query document was mapped to the HSOM and the fovea moved to the highlighted “best match” node. The automatic annotation scheme provides insightful informations about the semantic content in this area of the map. 4 http://www.cnn.com/2003/WORLD/meast/02/27/sprj.irq.bush.speech/index.html H2-MDS Earn Acquisition Money-FX Crude .GULF_AND_WESTERN_<GW>_UPS_INTEREST_IN_NETWORK.<19380> .U.S._HOUSE_PASSES_GULF_BILL_DESPITE_OPPOSITION.<18328> Grain .U.S._HOUSE_PASSES_MIDEAST_GULF_BILL.<18357> Trade .REAGAN_HINTS_U.S._WANTS_HELP_IN_PATROLLING_GULF.<17750> Interest Q: Bush: Ending Saddam’s regime...<CNN:2003/02/27> Wheat Ship Corn .U.S._TO_PROTECT_ONLY_AMERICAN_SHIPS.<18231> Other .US_SENATE_CUTS_OFF_STALL_TACTICS_ON_GULF_BILL.<20624> Search .US_WARNS_IRAN,_BEGINS_ESCORTING_TANKER_CONVOY.<20464> .CONVOY_RUNS_GULF_GAUNTLET,_OTHER_SHIPS_STAY_CLEAR.<20774> ing agents offering the potential to visualize temporal developments in data streams on the map. In the second stage the HMDS can represent the semantic closeness of the individual documents and is able to give a much more precise representation since is is decoupled from the rigid grid. While the hybrid scheme may appear conceptually simple, we think that hybrid approaches to strike a flexible balance between scaling of computational demands and achievable precision can be crucial for making new methods applicable to massive data collections, an important goal towards which the present research is meant to be a modest but useful step. .SENATE_BACKS_U.S._RETALIATION_IN_GULF.<20828> .U.S._GULF_OF_MEXICO_RIG_COUNT_CLIMBS_TO_38.9_PCT.<17658> .U.S._LAWMAKERS_SUPPORT_GULF_ACTION.<20890> Q: USA leading the strike in a gulf war against iraq? .EC_WATCHING_GULF_WAR_DEVELOPMENTS.<18340> .U.S._SENATE_TEAM_WANTS_MULTINATIONAL_GULF_FORCE.<18329> References Figure 12. HMDS location of a manual query (4 o’clock) and another news document from these days (10 o’clock). The title reveal the successful mapping in a meaningful manner. 6. Discussion and Conclusion Document visualization efforts like ThemeScapes [17] or the SOM based WebSOM [4] as well as Skupin’s cartographic approach [13] have impressively demonstrated the usefulness of compressed, map-like 2D-representations of massive data collections, even if the data items contain extremely high-dimensional information such as text. Recent work, such as [6, 11, 16] shows that the task of information visualization can significantly benefit from the use of hyperbolic space as a projection manifold. On the one hand it gains the exponentially growing space around each point which provides extra space for compressing semantic relationships. On the other hand the Poincaré model offers superb visualization and navigation properties, which were found to yield significant improvement in task time compared to traditional browsing methods [10]. By simple mouse interaction the focus can be transfered to any location of interest. The core area close to the center of the Poincaré disk magnifies the data with a zoom factor of 0.5 and decreases exponentially to the outer area. By this means a very natural visualization behavior is constructed: The fovea is an area with high resolution, while remote area are gradually compressed and still visible as context. Another advantage is scalability. Due to the favorable linear scaling of O(N ) the HSOM can be used to form an initial overview map for very large data collections. This map then offers all strengths of the hyperbolic focus+context navigation, permitting the user to rapidly narrow down the to-be-investigated data to a much smaller subset which then can be interactively mapped to the individual level with the HMDS technique. Again, the same hyperbolic focus+context navigation is available. Both approaches produce spatial representations of data similarity: The HSOM produces on a coarse level the “master map”, providing the thematic overview. Additionally, the neurons of the HSOM can be regarded as data collect- [1] T. Cox and M. Cox. Multidimensional Scaling. Chapman and Hall, 1994. [2] H.S.M. Coxeter. Non-Euclidean Geometry. University of Toronto Press, 1957. [3] T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer, 1995. [4] T. Kohonen et al. Organization of a massive document collection. IEEE TNN Spec Issue Neural Networks for Data Mining and Knowledge Discovery, 11(3):574–585, 2000. [5] J. Lamping, R. Rao, and P. Pirolli. A focus+context technique based on hyperbolic geometry for viewing large hierarchies. In ACM SIGCHI, pages 401–408, 1995. [6] J. Lamping and R. Rao. Laying out and visualizing large trees using a hyperbolic space. In ACM Symp User Interface Software and Technology, pages 13–14, 1994. [7] F. Morgan. Riemannian Geometry: A Beginner’s Guide. Jones and Bartlett Publishers, 1993. [8] T. Munzner. H3: Laying out large directed graphs in 3d hyperbolic space. In Proc IEEE Symp Info Vis, pages 2–10, 1997. [9] P. Pirolli, S. Card, and M. M. Van Der Wege. Visual information foraging in a focus + context visualization. In CHI, pages 506–513, 2001. [10] K. Risden, M. Czerwinski, T. Munzner, and D. Cook. An initial examination of ease of use for 2d and 3d information visualizations of web content. Int J Human Computer Studies, 53(5):695–714, 2000. [11] H. Ritter. Self-organizing maps on non-euclidean spaces. In Kohonen Maps, pages 97–110. Elsevier, 1999. [12] J. W. Sammon, Jr. A non-linear mapping for data structure analysis. IEEE Trans Computers, 18:401–409, 1969. [13] A. Skupin. A cartographic approach to visualizing conference abstracts. IEEE Computer Graphics and Applications, pages 50–58, 2002. [14] J.A. Thorpe. Elementary Topics in Differential Geometry. Springer, 1979. [15] J. Walter. H-MDS: a new approach for interactive visualization with multidimensional scaling in the hyperbolic space. Information Systems, (in print), 2003. [16] J. Walter and H. Ritter. On interactive visualization of high-dimensional data using the hyperbolic plane. In ACM SIGKDD Int Conf Knowledge Discovery and Data Mining, pages 123–131. 2002. [17] J. Wise. The ecological approach to text visualizationt. J Am Soc Information Sci, 50(13):1224–1233, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Interactive Visualization and Navigation in Large Data Collections