Download 01_ndp_jockheck - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Visualizing N Dimensional Clustering in 2 Dimensions
Dr. Paul Jeull
Dept. of Computer Science
North Dakota State University
Fargo, North Dakota 58105
[email protected]
W. Jockheck
Dept. of Computer Science
North Dakota State University
Fargo, North Dakota 58105
[email protected]
Abstract
1.1 An n-dimensional Wire Frame
Data mining has presented users with methods for
clustering over a large number of attributes. What is
missing is a intuitive way to visualize the results in
that n-dimensional space. This paper examines some
of the efforts to visualize n dimensional data and then
proposes a scheme to provide a representation that
can be used view and interpret the analysis of the
data. The proposed visualization involves projecting
the normalized values onto the sides of an n-1 sided
regular polygon. These points are then transformed
into a representation of a single point inside the
polygon. While this approach has limitations such as
the fact that it does not provide a unique mapping to a
two dimensional space, it does provide a relatively
intuitive way to view the data with some limitations.
The initial work was discounted due to problems
caused by perturbations in the patterns caused by
random or uncorrelated variables. After viewing the
Visualizing High Dimensional Datasets and
Relations tutorial by Alfred Inselberg* at KDD200,
it became apparent that the parallel coordinates
systems he used were similar in construction to the
work here. His points of interest were simply those
that we were pruning out for simplicity. After
revisiting this work it became apparent that the
similarity would be capitalized upon to gain useful
visualization of clustering.
1. Background
Increasingly there is a need to represent more and
more complex data in a visual format. As the number
of independent variables exceeds three or four it
becomes awkward to provide graphs in the traditional
sense. This usually results in some combination of
graphs or the exclusion of some of the variables for
the purpose of simplicity. Data mining has extended
this problem by often working in tend or hundreds of
“dimension.” These matters are further complicated
by the fact that the user interface (a computer screen)
is actually 2 D and human experience is for the most
part limited to experience in three (or four)
dimensions. Mathematically, there is no limitation
on the extension of dimensionality but for
visualization it becomes quite busy and far from
intuitive to the user of the image.
Using a simple wire frame drawing we can
progress from 1-D to n-D. If you can choose an
arbitrary vector to visualize the third dimension,
there's no reason you can't choose another arbitrary
vector to visualize the fourth dimension. The figure
below shows an example of this concept extended to
five dimensions.
Figure 1: A five dimensional cube. Each drag point indicates the
angel used to represent the dimension in the figure. Drag points 1
and 2 are clearly the x and y coordinates. Drag point 3 is a
traditional projection of a third dimension. Drag points 4 and 5
continue the concept but are arbitrary since the people have no
natural experience in viewing four or five dimensions. The result
is sometimes referred to as a 5-dimensional super hypercube [1].
1.2 Other n-dimensional Displays
A number of other methods have been proposed
to convey dimensional information. Figure 2 is taken
from a dissertation by Sami Kaski [2]. It displays a
ten-dimensional data item visualized using four
different methods. These include (a) a profile of the
component values, (b) a ``star'' in which the length of
each ray emanating from the center illustrates one
component, (c) Andrews' curve [3], (d) a facial
caricature. The first is simply a set of grouped bar
graphs which show the values associated for each
item. There are serious limitations on how many
values could be portrayed in this method. The star is
related to the technique proposed here. However,
since the rays emanate from the center there is a limit
to the number of rays that can be shown before they
are over lapping. Representing data as a wave allows
comparison of multiple data points but again has a
limit on how many can be viewed and compared in a
meaningful way. In (d) Chernoff's faces [4] use each
dimension of the data to determine the size, location,
or shape of some component of a facial. This
technique is based on the concept that human brains
are well adapted for recognizing and remembering
faces.
Figure 2: Methods for conveying additional information
(dimensionality)
1.3 Parallel Coordinate Scheme [5,6,7].
This method from to Inselberg and others
attempts plots with all the axes parallel to each other
in a plane. This preserves some of geometric
structure. It is however projected in such a fashion
that most geometric intuition has to be relearned.
Inselberg has used it to achieve impressive results but
the intuition is not clear to a novice user.
Figure 3: A parallel Axis System
2. The Proposed Method of Viewing ndimensional Information
The method is straight forward but offers many
variation on implementation and tailoring. In the first
version, a set of six variable was used, three of which
were related and three of which were random. This
was simply generated using an Excel spreadsheet.
The results were then normalized and projected onto
the sides of a regular n-dimensional polygon (in this
case a hexagon). The six projected points were then
simply averaged to create a "center of mass".
Because the outline looks like a cut stone with
twinkling facets it is nicknamed a jewel diagram.
Figure 4: Jewel Diagram with each six dimensions with attributes
for each sample connected.
While interesting to look at the results are hard to
distinguish due to the numerous colors being used to
identify the different sample sets (thirteen used in the
example). This means the method will have serious
limitations as more and more samples are introduced.
However, notice the similarity of the above
diagram to the parallel coordinate system of
Isenberg[5,6,7]. While Isenberg uses parallel axis,
this set of lines follow the same layout but around the
polygon. The implications of this have not been
fully explored but left for later exploration. They
may be useful but pruning out the lines in the next
step makes the diagram simpler to view in the hopes
of making it useful in much larger data sets.
This representation in figure 5 provides z-axis
separation of each sample for clarity. The results
displayed are for a set of data samples for
meteorological data. The close-up shows the
sequence.
Figure 5: Jewel diagram with connecting lines removed.
By removing the lines each side of the polygon
represents the distribution of values for one attribute
(labeled Series 1 through 6 in the diagram). The
cluster in the center shows the “center of mass” for
each sample. Now each sample point represents a
single point in the center of the polygon. For the
purposes of this example an attempt was made to
retain the identity of the points by using color and
patterns. However, that would not be necessary in
larger data sets.
To assist with the visibility problem the idea of
creating a vrml space in which each sample is
separated and can be viewed from different angles
was explored. While this solves some of the
separation problems and enables exploration of the
values on the polygon edges, many of the basic
problems remain.
Figure 6: Detail of the vrml view, notice the points on the
polygon showing the individual values.
Since the data is in vrml, the viewer can fly
through the space to examine the detail.
This lead to the creation of a utility to generate
the vrml. Note that a browser with a vrml enable
plug in is required. The screen shots show Netscape
and Cosmoviewer as the software. Using other
software may provide slightly different appearances.
Figure 7: Skewed angle view in vrml
Or viewed from a distance to gain an overview.
Figure 5: Screen shot of the vrml utility
http://jockheck.northern.edu/vrml2/demo12.html
3.4 Usefulness. Initial exploration has not
consistently produced useful patters. It appears that
the pattern is lost due to the perturbations caused by
the random (or uncorrelated) values in some variables
which then move the points randomly in the two
dimensional space.
4. Conclusion and Future Directions
This technique is interested in providing a method
for viewing n-dimensional data and hence
interpreting such items as clustering in data mining.
Like most tools it has its drawbacks. However, the
method of displaying the information provides a
method for human viewers to gain understanding and
insight.
Figure 8: A distant view of the sequence in vrml.
Despite the drawbacks this technique at least
provides a method to visualize n-dimensional
clustering. As a result the author intends to continue
to use the technique to represent data mining results
in n-dimensional space.
3. Concerns
5. References
There are serious problems in an absolute
mathematical sense with this approach.
[1] Nick Jackiw, October 1997 from
http://www.keypress.com/sketchpad/java_gsp/hyperc
ube.html
3.1 Loss of uniqueness: Points mapped to 2-d space
are not a unique representation of the n-d space. It
may be possible to construct such a mapping but it
would probably require the insertion of additional
variables which combine the existing ones. It may be
necessary to have up to n! sides to the polygon to
produce uniqueness.
3.2 Shrinking viewing space. The points mapped
into 2-d space will converge to a single point (no
distinguishable variation in values) as the number of
sides of the polygon approaches a circle. This is due
to the fact that the values are averaged. While it is
possible to increase the diameter of the near circle to
separate the point, when n is very large it is not
possible to view the distribution of the individual
attributes on the sides of the polygons in the same
image as the center of mass points.
3.3 Sequence of Variables. The sequence that the
variables are placed on the polygon has significant
impact. Note that (in figure 4) series 1 and 4 increase
in opposite directions (180 degree offset) hence
directly off setting each other while 1 and 2 move in
a similar (30 degree offset) direction. Changing the
sequence of variable changes the diagram
significantly.
[2] Kaski, S., Data exploration using self-organizing
maps. Acta Polytechnica Scandinavica, Mathematics,
Computing and Management in Engineering Series
No. 82, Espoo 1997, 57 pp. Published by the Finnish
Academy of Technology. ISBN 952-5148-13-0.
ISSN 1238-9803. UDC 681.327.12:159.95:519.2
accessible at http://www.cis.hut.fi/~sami/thesis
[3] Andrews, D. F. (1972) Plots of high-dimensional
data. Biometrics, 28:125-136.
[4] Chernoff, H. (1973) The use of faces to represent
points in k-dimensional space graphically. Journal of
the American Statistical Association, 68:361-368.
[5] Visual data mining with parallel coordinates Al
Inselberg, COMPUTATION STAT 13: (1) 47-63
1998
[6] MULTIDIMENSIONAL LINES .1.
REPRESENTATION. INSELBERG A, DIMSDALE
B, SIAM J APPL MATH 54: (2) 559-577 APR 1994
[7] Parallel coordinate graphics using MATLAB,
http://www.nbb.cornell.edu/neurobio/land/PROJECT
S/Inselberg/