Download VISUALIZATIONS OF HIGH

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
VISUALIZATIONS OF HIGH-DIMENSIONAL SPACE
M. P. Canton
Computer Science Department
North Dakota State University
Fargo, NN, 58105, USA
[email protected]
Abstract
Spatial data analyzing can occur at
different levels, namely, pixel, pixel-group, or
band. At the pixel-level, the common attributes
are location (on the coordinate plane or a geolocation) and reflectance values (0-255 encoded
in eight bits). If we group the reflectance
attributes vertically [1] [2], the data analyzing
operations can be described as band operations.
In this paper we define band operations as
functionals and visualize the process using 1D
and 2D representors, namely Jewell and
Mountain Chains diagrams.
1
INTRODUCTION
This paper is organized a follows: first we
briefly describe the massive datasets found in
remotely sensed imagery and present the various
image band formats in use, then introduce the
concept of band operations as functionals and
define the functionals. Discussion of 1D and 2D
representors follow, with our conclusions at the
end.
Over the past three decades, we have
become increasingly concerned with the
possibility of changes in the global environment.
The detection and quantification of climatic and
biological patterns of change in the earth's
biosphere is a difficult task. However, a record
of these changes has been available since 1972 to
date [3]. We have a continuous dataset of the
earth’s continental surface – once every sixteen
days over every point on the planet – that started
with the launching of Landsat 1, an earthobserving satellite [4].
One of the major problems in gleaning
information from this massive dataset is the
sheer amount of data that has been collected over
the past 34 years [5] that will soon reach
petabyte-size [6]. Just one thematic mapper
W. Perrrizo
Computer Science Department
North Dakota State University
Fargo, ND, 58105, USA
[email protected]
image in time, over a 180-km square area,
consists of well over one billion pixels [7].
This increase in data, with the resulting
increases in the sizes of data repositories, has
spurred research in data analyzing and mining.
Database systems are undergoing revolutionary
changes [8]. They are also our main hope to deal
with the information avalanche hitting
individuals, organizations, and all aspects of
human organization [9].
Data mining encounters two types of
scalability issues: row (or database size)
scalability and column (or dimension)
scalability. A data mining system is considered
row scalable if, when the number of rows is
enlarged 10 times, it takes no more than 10 times
to execute the same data mining queries [9]. A
data mining system is considered column
scalable if the mining execution time increases
linearly with the number of columns (or
attributes or dimensions) [9].
This paper addresses the issues of spatial
data scalability by using functionals and
visualizing the process using 1D and 2D
representors.
2
IMAGE BAND FORMATS
There are various image formats in use for
remotely sensed imagery; conventionally, image
data organized by bands is usually structured in
the following band formats: band interleaved by
pixel (BIP), band interleaved by line (BIL), and
band sequential (BSQ). Two other formats, for
vertically structuring this data, have been
developed [10][11]: band interleaved by pixel
(BIp), and bit sequential (bSQ).
In band interleaved by pixel, a pixelconsecutive scheme, the banded data is stored in
pixel-major order. Figure 1 illustrates how a
scene originally sensed in three different
radiation ranges is stored in three corresponding
bands, and how the band data is digitized and
saved in BIP format.
stored in three corresponding bands, and how the
band data is digitized and saved in BSQ format.
0 3 2 4 .. 1 0 0 ..
BAND 1
DIGITIZED AND FORMATTED DATA
0 1 4
3 2 0
2 4 2
4 0 3 ....
BAND 1
0
1
2
3
4
BAND 2
0
1
2
3
4
ANALOG
TO
DIGITAL
SCALE
SCENE
DATA
BAND 3
SCENE
DATA
Figure 3. Band Sequential Format
Figure 1. Band Interleaved by Pixel Format
In the band interleaved by line format, the
image scan line constitutes the organizing base.
When the data is stored in line-major order, the
image appears consecutively in the data. Figure 2
shows a scene originally sensed in three different
radiation ranges and stored in three
corresponding bands, and how the band data is
digitized and saved in BIL format.
DIGITIZED AND FORMATTED DATA
line 1
band 2
line 1
band 3
0 3 2 4 .... 1 2 4 0 .... 4 0 1 3 ....
BAND 1
BAND 2
4 0 1 3 ..
BAND 3
ANALOG
TO
DIGITAL
SCALE
line 1
band 1
1 2 4 0 .. 2 4 3 ..
BAND 2
line 2
band 1
1 0 0 ..
BAND 3
A new organization at the "interleaving
extreme" end of this spectrum of organizations is
Band-Interleaved-by-bit (BIb) [10] in which
there is just one file: the first bit of the first
pixel-value of the first band is followed by the
first bit of the first pixel-value of the second
band, ... , the first bit of the first pixel-value of
the last band, followed by the second bit of the
first pixel-value of the first band, and so on. At
the other end of this organization spectrum is bitSeQential (bSQ) [10] in which each bit of each
band, B11..,18, B21..B28 ... Bn1..Bn8, is a
separate file.
In many cases, a commercial data format is
a variation of one of the standard schemes. For
example, data in BSQ format can be stored in a
single file that includes all bands, or separated
into a different data file for each band. The same
is true about the other formats.
3
BAND OPERATIONS as
FUNCTIONALS
0
1
2
3
4
ANALOG
TO
DIGITAL
SCALE
SCENE
DATA
Figure 2. Band Interleaved by Line
In the band sequential format, the banded
data is stored in band-major order. That is, each
image band appears consecutively in the data
file. Figure 3 shows how a scene originally
sensed in three different radiation ranges is
We introduce band operations in terms of a
16 pixel reduced number, raster ordered
Remotely Sensed Imagery (RSI) dataset. Each
pixel has two structural attributes, x,y, that
identify the row (the id tuples, similar to key
attributes); three feature attributes, R (red), B
(blue), G (green); a class label, Y (yield).; three
functionals – i.e., derived attributes – that we call
RVI (Rough Vegetative Index), NTV
(Normalized Total Variation), and RLTV
(Round Log Total Variation). For the
functionals, we illustrate the contours around a
given pixel (see Figure 5).
x
y
R
G
B
Y
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
2
3
2
2
3
2
7
7
1
0
0
0
1
2
1
3
6
6
5
5
1
1
1
1
5
6
7
7
7
7
6
5
1
1
1
1
2
2
0
0
0
0
0
0
0
0
0
0
2
2
1
1
0
0
0
0
1
2
2
2
2
1
2
1
36
2.25
76
4.75
8
0.5
19
1.875
S =
 =
2-6+7
3-6+7
2-5+7
2-5+7
3-1+7
2-1+7
7-1+7
7-1+7
1-5+7
0-6+7
0-7+7
0-7+7
0-7+7
2-7+7
1-6+7
3-5+7
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
RVI
NTV
RLTV
3
4
4
4
9
8
13
13
3
1
0
0
1
2
2
5
30
38
6
6
270
262
590
590
30
110
166
166
110
86
54
14
1
2
1
1
2
2
3
3
1
2
2
2
2
2
2
1
Figure 4. RSI Band Functionals
Next, we define the Derived Attribute or
Functional that we call Rough Vegetative Index,
or RVI. This is defined as G – R + 7. The 7 is a
shift so the values are all nonnegative numbers
between 0 and 15 (i.e., four-bit numbers). We
also show the RVI contours, M  RVIxy-1 and H
 RVIxy-1 [10], with respect to x and y. The sum
of the mean and feature vectors, S and , are
shown on the last two rows of the table. The
Normalized Total Variation functional, NTV, is
NTV(t) = t-2= 16[(tR - R)2 + (tG - G)2 + (tB B)2], where t is a variable over the 16 pixel
tuples. For example, in the first row, where R=2,
G=6, and B=1, then NTV= 16[(2 - 2.25)2 + (6 4.75)2 + (1 - 0.5)2] = 30. Since these NTV values
are large ten-bit numbers, in order to reduce the
bitwidth, we define a more convenient
functional, RLTV, or Round Log Total
Variation. RLTV  Round (log10,0); this gives a
2-bit derived attribute with essentially the same
contours. In order to define contours, let
contours, functional pruning can be used to
prune off non-neighboring tuples.[2]
q  R k be the set of all points x  R n such
that f(x) = q is the preimage of q under f and is
1
(q). Now let [ p, q]  R k be
n
the set of all points x  R , such that
f ( x)  [ p, q] is the preimage of [p,q] under f,
denoted as
f
or the contour of [p,q] under f. Using the
Figure 5. H and M contours with respect to x, y
4
1D and 2D TUPLE VISUALIZATION
We now refer to the RSI dataset of Figure 4
as a function, X, as follows: let f:X  Rn  Rk be
any function, with R = reals.
If k = 1, then we call it a functional, a new
derived attribute (column) with f(x) in this
column for row x, where Total Variation(t) =
TV(t)  xX(t-x)(t-x) and NTV(t) – TV(x) =
t-x2.
If k = 2 or 3, then we call it a diagram and
its range can be viewed as a plot of points, as in
the Jewell and Mountain Chain diagrams. These
are related to Parallel Coordinates [11].
If k = n, then it is a vector field, which can
be thought of as hair on n-space – one strand of
hair sticking attached at each point e.g., the
gradient field of a functional, f, where
f (a)  (f / x1 (a),..., f / xn (a) .
The case where k = 2 or 3 is illustrated
with the following examples. In Figures 6, 7, 8
the attributes are represented by A1 through A8,
and depicted by the straight lines in the
diagrams. In these simple examples, we look at
the distances between pairs of centroids: the red
and blue points should be clustered together, as
similar patters, and the green should be
identifiable as an outlier (different pattern).; then
we compare the examples to these expected
results. The diagrams can be viewed as
functionals to 2D space and their centroids as
functionals to 1D space.
The Parallel Coordinates diagram [11], see
Figure 6, is given as a reference point for the
other diagrams. The intersection points with the
vertical lines represent the attribute’s value in
the range of 0-15. The result, shown by the three
dots, is the derived attribute, or functional. For
its computation, any statistical function can be
used. Here, the red and green centroids are
approximately equidistant from the blue
centroid.
A1
A2
A3
A4
A5
Figure 6. Parallel Coordinates
The Jewell Diagram [12], see Figure 7,
developed by P. Juell of North Dakota State
University, Fargo, represents a 2D view of the
attributes and the resulting functional. It consists
of an n-sided polygon, where n = number of
attributes. Each point of data, or attribute value,
is normalized on a side where one vertex is 0 and
the other is 1. The derived attribute, or
functional, is formed by the centroid; the
distance from the center represents how close the
approximations are to the defined statistical
function. We view the centroids of the Jewell
diagram as examples of functionals. However,
if, for example, the goal is to produce a visual
“outlier screen”, neither the parallel coordinates
diagram nor jewel diagram separate the green
outlier point from the other two points.
This led to the development of another twodimensional diagram called the AUGH diagram
(Always Up Grade Himalayan diagram).
A2
A3
A
A
5
1
A1
A4
A5
Figure 7. Jewell Diagram
The Jewell diagram and the AUGH
diagrams represent multidimensional views.
(See Figures 7,8).
A
A
1
In the AUGH diagrams, the dots “climbing”
the sides represent attribute values in the range
0-15. The derived attribute, a functional, is
shown by the dots of the centroid.
Note that the angle of inclination of each
succeeding dimension is doubled in the AUGH
diagram. This provides an additional measure of
separation for outliers.
Figure 8. Concatenated dimensions Himalayan
diagram (Top) - Always Up Grade Himalayan
(AUGH) diagram (Bottom).
5
In this example, there is clustering of red and
blue points but the green point shows up as an
outlier. We can note that the AUGH diagram is
similar to parallel coordinates in that the vertical
component is just the mean value (as it is in
parallel coordinates), but the horizontal
component is spread out by changing the angle
of inclination.
Also in figure 8 (Top) we show a
Himalayan diagram in which the dimension axes
are simple concatenated. This approach tends to
exhibit the same canceling effect as the jewel
diagram (for outlier visualization anyway).
5
CONCLUSIONS
These diagrams provide a method for
viewing n-dimensional data, and, hence, a
preliminary and rough interpretation of
clustering and outlier detection. This might be
useful in providing a coarse filter for pruning the
data set, addressing scalability issues and
identifying outliers.
Preliminary results, obtained with minimal
datasets, support the use of band operations as
functionals and of the derived tuple
visualizations as a tool for quickly determining
the value ranges that satisfy predefined data
analysis and mining conditions; it should be
noted that these are very preliminary results and
further work with full datasets is necessary
before the advantage of their use can be fully
understood.
6
REFERENCES
[1] W. Perrizo, “The P-tree Algebra,”
Proceedings of ACM Symposium on
Applied Computing, pp. 426-431, Madrid,
Spain, March 2002.
[2] M. Serazi, “A Super-Max Data Mining
Benchmark by Vertically Structuring Data,”
Ph.D. Thesis, North Dakota State
University, Fargo, ND, 2005.
[3] EROS, National Center for Earth
Resources Observation and Science Annual
Report 2004, USGS, Sioux Falls, SD, 2004.
[4] NASA, Landsat,
http://landsat.gsfc.nasa.gov/about/history.ht
ml
[5] N. Short, Remote Sensing Tutorial,
NASA, USA, May 2006.
[6] M. Serazi, A. Perera, W. Perrizo, Multilayered Framework for Distributed Data
Mining, 13th International Conference on
Intelligent and Adaptive Systems and
Software Engineering, Niece, France, 2004.
[7] J. Sanchez and M. Canton, Space Image
Processing, CRC Press, Boca Raton FL,
1999.
[8] J. Gray, The Next Database Revolution.
Keynote address. SIGMOD 2004,San
Francisco, CA, 2004.
[9] J. Han and M. Kamber, Data Mining:
Concepts and Techniques. Morgan
Kaufman, San Francisco, CA, 2001.
[10] W. Perrizo, “Peano Count Tree
Technology Lab Notes,” Technical Report
NDSU-CS-TR-01-1, North Dakota State
University, Computer Science Department,
http://www.cs.ndsu.nodak.edu/~perrizo/clas
ses/785/pct.html, 2001.
[11] A. Inselberg, Visual data mining with
parallel coordinates, COMPUTATION
STAT 13: (1) 47-63, 1998.
[12] P. Juell and W. Jockheck, Visualizing N
Dimensional Clustering, CATA 2002.