Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
VISUALIZATIONS OF HIGH-DIMENSIONAL SPACE M. P. Canton Computer Science Department North Dakota State University Fargo, NN, 58105, USA [email protected] Abstract Spatial data analyzing can occur at different levels, namely, pixel, pixel-group, or band. At the pixel-level, the common attributes are location (on the coordinate plane or a geolocation) and reflectance values (0-255 encoded in eight bits). If we group the reflectance attributes vertically [1] [2], the data analyzing operations can be described as band operations. In this paper we define band operations as functionals and visualize the process using 1D and 2D representors, namely Jewell and Mountain Chains diagrams. 1 INTRODUCTION This paper is organized a follows: first we briefly describe the massive datasets found in remotely sensed imagery and present the various image band formats in use, then introduce the concept of band operations as functionals and define the functionals. Discussion of 1D and 2D representors follow, with our conclusions at the end. Over the past three decades, we have become increasingly concerned with the possibility of changes in the global environment. The detection and quantification of climatic and biological patterns of change in the earth's biosphere is a difficult task. However, a record of these changes has been available since 1972 to date [3]. We have a continuous dataset of the earth’s continental surface – once every sixteen days over every point on the planet – that started with the launching of Landsat 1, an earthobserving satellite [4]. One of the major problems in gleaning information from this massive dataset is the sheer amount of data that has been collected over the past 34 years [5] that will soon reach petabyte-size [6]. Just one thematic mapper W. Perrrizo Computer Science Department North Dakota State University Fargo, ND, 58105, USA [email protected] image in time, over a 180-km square area, consists of well over one billion pixels [7]. This increase in data, with the resulting increases in the sizes of data repositories, has spurred research in data analyzing and mining. Database systems are undergoing revolutionary changes [8]. They are also our main hope to deal with the information avalanche hitting individuals, organizations, and all aspects of human organization [9]. Data mining encounters two types of scalability issues: row (or database size) scalability and column (or dimension) scalability. A data mining system is considered row scalable if, when the number of rows is enlarged 10 times, it takes no more than 10 times to execute the same data mining queries [9]. A data mining system is considered column scalable if the mining execution time increases linearly with the number of columns (or attributes or dimensions) [9]. This paper addresses the issues of spatial data scalability by using functionals and visualizing the process using 1D and 2D representors. 2 IMAGE BAND FORMATS There are various image formats in use for remotely sensed imagery; conventionally, image data organized by bands is usually structured in the following band formats: band interleaved by pixel (BIP), band interleaved by line (BIL), and band sequential (BSQ). Two other formats, for vertically structuring this data, have been developed [10][11]: band interleaved by pixel (BIp), and bit sequential (bSQ). In band interleaved by pixel, a pixelconsecutive scheme, the banded data is stored in pixel-major order. Figure 1 illustrates how a scene originally sensed in three different radiation ranges is stored in three corresponding bands, and how the band data is digitized and saved in BIP format. stored in three corresponding bands, and how the band data is digitized and saved in BSQ format. 0 3 2 4 .. 1 0 0 .. BAND 1 DIGITIZED AND FORMATTED DATA 0 1 4 3 2 0 2 4 2 4 0 3 .... BAND 1 0 1 2 3 4 BAND 2 0 1 2 3 4 ANALOG TO DIGITAL SCALE SCENE DATA BAND 3 SCENE DATA Figure 3. Band Sequential Format Figure 1. Band Interleaved by Pixel Format In the band interleaved by line format, the image scan line constitutes the organizing base. When the data is stored in line-major order, the image appears consecutively in the data. Figure 2 shows a scene originally sensed in three different radiation ranges and stored in three corresponding bands, and how the band data is digitized and saved in BIL format. DIGITIZED AND FORMATTED DATA line 1 band 2 line 1 band 3 0 3 2 4 .... 1 2 4 0 .... 4 0 1 3 .... BAND 1 BAND 2 4 0 1 3 .. BAND 3 ANALOG TO DIGITAL SCALE line 1 band 1 1 2 4 0 .. 2 4 3 .. BAND 2 line 2 band 1 1 0 0 .. BAND 3 A new organization at the "interleaving extreme" end of this spectrum of organizations is Band-Interleaved-by-bit (BIb) [10] in which there is just one file: the first bit of the first pixel-value of the first band is followed by the first bit of the first pixel-value of the second band, ... , the first bit of the first pixel-value of the last band, followed by the second bit of the first pixel-value of the first band, and so on. At the other end of this organization spectrum is bitSeQential (bSQ) [10] in which each bit of each band, B11..,18, B21..B28 ... Bn1..Bn8, is a separate file. In many cases, a commercial data format is a variation of one of the standard schemes. For example, data in BSQ format can be stored in a single file that includes all bands, or separated into a different data file for each band. The same is true about the other formats. 3 BAND OPERATIONS as FUNCTIONALS 0 1 2 3 4 ANALOG TO DIGITAL SCALE SCENE DATA Figure 2. Band Interleaved by Line In the band sequential format, the banded data is stored in band-major order. That is, each image band appears consecutively in the data file. Figure 3 shows how a scene originally sensed in three different radiation ranges is We introduce band operations in terms of a 16 pixel reduced number, raster ordered Remotely Sensed Imagery (RSI) dataset. Each pixel has two structural attributes, x,y, that identify the row (the id tuples, similar to key attributes); three feature attributes, R (red), B (blue), G (green); a class label, Y (yield).; three functionals – i.e., derived attributes – that we call RVI (Rough Vegetative Index), NTV (Normalized Total Variation), and RLTV (Round Log Total Variation). For the functionals, we illustrate the contours around a given pixel (see Figure 5). x y R G B Y 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 2 3 2 2 3 2 7 7 1 0 0 0 1 2 1 3 6 6 5 5 1 1 1 1 5 6 7 7 7 7 6 5 1 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 2 2 1 1 0 0 0 0 1 2 2 2 2 1 2 1 36 2.25 76 4.75 8 0.5 19 1.875 S = = 2-6+7 3-6+7 2-5+7 2-5+7 3-1+7 2-1+7 7-1+7 7-1+7 1-5+7 0-6+7 0-7+7 0-7+7 0-7+7 2-7+7 1-6+7 3-5+7 = = = = = = = = = = = = = = = = RVI NTV RLTV 3 4 4 4 9 8 13 13 3 1 0 0 1 2 2 5 30 38 6 6 270 262 590 590 30 110 166 166 110 86 54 14 1 2 1 1 2 2 3 3 1 2 2 2 2 2 2 1 Figure 4. RSI Band Functionals Next, we define the Derived Attribute or Functional that we call Rough Vegetative Index, or RVI. This is defined as G – R + 7. The 7 is a shift so the values are all nonnegative numbers between 0 and 15 (i.e., four-bit numbers). We also show the RVI contours, M RVIxy-1 and H RVIxy-1 [10], with respect to x and y. The sum of the mean and feature vectors, S and , are shown on the last two rows of the table. The Normalized Total Variation functional, NTV, is NTV(t) = t-2= 16[(tR - R)2 + (tG - G)2 + (tB B)2], where t is a variable over the 16 pixel tuples. For example, in the first row, where R=2, G=6, and B=1, then NTV= 16[(2 - 2.25)2 + (6 4.75)2 + (1 - 0.5)2] = 30. Since these NTV values are large ten-bit numbers, in order to reduce the bitwidth, we define a more convenient functional, RLTV, or Round Log Total Variation. RLTV Round (log10,0); this gives a 2-bit derived attribute with essentially the same contours. In order to define contours, let contours, functional pruning can be used to prune off non-neighboring tuples.[2] q R k be the set of all points x R n such that f(x) = q is the preimage of q under f and is 1 (q). Now let [ p, q] R k be n the set of all points x R , such that f ( x) [ p, q] is the preimage of [p,q] under f, denoted as f or the contour of [p,q] under f. Using the Figure 5. H and M contours with respect to x, y 4 1D and 2D TUPLE VISUALIZATION We now refer to the RSI dataset of Figure 4 as a function, X, as follows: let f:X Rn Rk be any function, with R = reals. If k = 1, then we call it a functional, a new derived attribute (column) with f(x) in this column for row x, where Total Variation(t) = TV(t) xX(t-x)(t-x) and NTV(t) – TV(x) = t-x2. If k = 2 or 3, then we call it a diagram and its range can be viewed as a plot of points, as in the Jewell and Mountain Chain diagrams. These are related to Parallel Coordinates [11]. If k = n, then it is a vector field, which can be thought of as hair on n-space – one strand of hair sticking attached at each point e.g., the gradient field of a functional, f, where f (a) (f / x1 (a),..., f / xn (a) . The case where k = 2 or 3 is illustrated with the following examples. In Figures 6, 7, 8 the attributes are represented by A1 through A8, and depicted by the straight lines in the diagrams. In these simple examples, we look at the distances between pairs of centroids: the red and blue points should be clustered together, as similar patters, and the green should be identifiable as an outlier (different pattern).; then we compare the examples to these expected results. The diagrams can be viewed as functionals to 2D space and their centroids as functionals to 1D space. The Parallel Coordinates diagram [11], see Figure 6, is given as a reference point for the other diagrams. The intersection points with the vertical lines represent the attribute’s value in the range of 0-15. The result, shown by the three dots, is the derived attribute, or functional. For its computation, any statistical function can be used. Here, the red and green centroids are approximately equidistant from the blue centroid. A1 A2 A3 A4 A5 Figure 6. Parallel Coordinates The Jewell Diagram [12], see Figure 7, developed by P. Juell of North Dakota State University, Fargo, represents a 2D view of the attributes and the resulting functional. It consists of an n-sided polygon, where n = number of attributes. Each point of data, or attribute value, is normalized on a side where one vertex is 0 and the other is 1. The derived attribute, or functional, is formed by the centroid; the distance from the center represents how close the approximations are to the defined statistical function. We view the centroids of the Jewell diagram as examples of functionals. However, if, for example, the goal is to produce a visual “outlier screen”, neither the parallel coordinates diagram nor jewel diagram separate the green outlier point from the other two points. This led to the development of another twodimensional diagram called the AUGH diagram (Always Up Grade Himalayan diagram). A2 A3 A A 5 1 A1 A4 A5 Figure 7. Jewell Diagram The Jewell diagram and the AUGH diagrams represent multidimensional views. (See Figures 7,8). A A 1 In the AUGH diagrams, the dots “climbing” the sides represent attribute values in the range 0-15. The derived attribute, a functional, is shown by the dots of the centroid. Note that the angle of inclination of each succeeding dimension is doubled in the AUGH diagram. This provides an additional measure of separation for outliers. Figure 8. Concatenated dimensions Himalayan diagram (Top) - Always Up Grade Himalayan (AUGH) diagram (Bottom). 5 In this example, there is clustering of red and blue points but the green point shows up as an outlier. We can note that the AUGH diagram is similar to parallel coordinates in that the vertical component is just the mean value (as it is in parallel coordinates), but the horizontal component is spread out by changing the angle of inclination. Also in figure 8 (Top) we show a Himalayan diagram in which the dimension axes are simple concatenated. This approach tends to exhibit the same canceling effect as the jewel diagram (for outlier visualization anyway). 5 CONCLUSIONS These diagrams provide a method for viewing n-dimensional data, and, hence, a preliminary and rough interpretation of clustering and outlier detection. This might be useful in providing a coarse filter for pruning the data set, addressing scalability issues and identifying outliers. Preliminary results, obtained with minimal datasets, support the use of band operations as functionals and of the derived tuple visualizations as a tool for quickly determining the value ranges that satisfy predefined data analysis and mining conditions; it should be noted that these are very preliminary results and further work with full datasets is necessary before the advantage of their use can be fully understood. 6 REFERENCES [1] W. Perrizo, “The P-tree Algebra,” Proceedings of ACM Symposium on Applied Computing, pp. 426-431, Madrid, Spain, March 2002. [2] M. Serazi, “A Super-Max Data Mining Benchmark by Vertically Structuring Data,” Ph.D. Thesis, North Dakota State University, Fargo, ND, 2005. [3] EROS, National Center for Earth Resources Observation and Science Annual Report 2004, USGS, Sioux Falls, SD, 2004. [4] NASA, Landsat, http://landsat.gsfc.nasa.gov/about/history.ht ml [5] N. Short, Remote Sensing Tutorial, NASA, USA, May 2006. [6] M. Serazi, A. Perera, W. Perrizo, Multilayered Framework for Distributed Data Mining, 13th International Conference on Intelligent and Adaptive Systems and Software Engineering, Niece, France, 2004. [7] J. Sanchez and M. Canton, Space Image Processing, CRC Press, Boca Raton FL, 1999. [8] J. Gray, The Next Database Revolution. Keynote address. SIGMOD 2004,San Francisco, CA, 2004. [9] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, CA, 2001. [10] W. Perrizo, “Peano Count Tree Technology Lab Notes,” Technical Report NDSU-CS-TR-01-1, North Dakota State University, Computer Science Department, http://www.cs.ndsu.nodak.edu/~perrizo/clas ses/785/pct.html, 2001. [11] A. Inselberg, Visual data mining with parallel coordinates, COMPUTATION STAT 13: (1) 47-63, 1998. [12] P. Juell and W. Jockheck, Visualizing N Dimensional Clustering, CATA 2002.