Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining with the Purpose of Elucidating Multiresolution Spatial Patterns Linked to Fragmentation Sonia Rodrguez Universidad de Costa Rica Escuela de Matematica San Jose, Costa Rica E-mail: [email protected] 1. Introduction Data mining is a new eld, born as a consequence of modern computing technology. It can be considered as a combination of science and art that borrows some of its methods and tools from statistics, database technology, machine learning, knowledge discovery, pattern recognition and other elds. It has much in common with exploratory data analysis, a term more familiar to statisticians. However, data minig is only concerned with data sets of enormous size, with storage requirements in the order of megabytes, gigabytes, or even terabytes. From dealing with data sets we progressed to databases and nally to data warehouses. It is clear that new methods were needed in order to process and analyse such amounts of information. For scientic applications, data mining can be dened as \the process of secondary analysis of large databases aimed at nding unsuspected relationships which are of interest or value to the database owner" (Hand, 1998). Let us just mention that commercial applications, however, consider that \prediction is arguably the strongest goal of data mining" (Weiss and Indurkhya, 1998). With the rst denition in mind, we are concerned with the exploration of categorical raster maps of 102 watersheds in Pennsylvania obtained from satellite images with the goal of uncovering spatial patterns that could be linked to fragmentation processes. The smallest watershed is about half a megabyte in size, the biggest a little more than nine megabytes. Two common tools in data mining when the purpose is exploration, are clustering and visualization. In addition, we are proposing the use of empirical entropy proles as a new method of data reduction and analysis. 2. Data Sets and Software The raw data consisted of 102 raster maps that were generated from two data sets: a landuse map of Pennsylvania and the watershed boundary polygons, property of the OÆce of Remote Sensing of Earth Resources at Penn State University. Information on these data, including metadata for the land coverage, can be obtained at http:\\www.pasda.psu.edu. The land cover classication was performed using both a supervised and an unsupervised approach. The six bands of thematic mapper imagery were rst compressed into 255 clusters using the Erdas Imagine c version of ISODATA. The supervised classication was done with the PHASE program (Myers, 1997) producing maps with 8 landcover categories: water, conifer forest, mixed forest, broadleaf forest, vegplex (a mixture), perennial herbaceous, annual herbaceous and terrestrial unvegetated. The resulting grids were then processed using custom programs written in C-language that dealt with the spatial structure of the images and made the calculations of the empirical entropy proles (Taillie, 1998). 3. Method Considering the landscape as a rectangular grid and dividing it into groups of 4 cells, we then record the color of every group as a 4 tuple: 1 1 3 4 4-tuple (1134) ! ! Frequency Table A table of the distribution of the 4096 possible 4 tuples was constructed and the empirical entropy value of the image calculated. Then, a random lter was applied iteratively to the grid, selecting one pixel at random from every group of 4 cells. After each application of the random lter, the empirical entropies were calculated for each of the new obtained images. Given the size of the maps, the lter was applied eight times and hence the entropy prole for each watershed consists of the eight empirical entropy values. The entropy proles have a characteristic form. Starting with a minimum value for iter = 0 (oor resolution), the entropies keep increasing after further applications of the random lter until they approach an asymptotic value (Figure 1). o o o o 6 6 6 o o o o o o o o o o 4 o o Entropy ws105 o 4 o o o o Entropy ws120 4 Entropy ws025 o 2 2 2 o 0 2 4 6 0 0 0 o 0 2 iter 4 6 0 2 4 iter 6 iter Figure 1. Entropy Prole The following parametric model was tted to the proles (1) c y = a exp( b iter ) where: c is the asymptotic value, c a is the entropy value at oor resolution, and b is the rate at which the asymptotic value is approached. Assuming b xed, and writting x = exp( b iter), produces a liner model: y = c ax. The least squares estimators of c and a are given by: P (w y ) Q P (w x y )=D P (w y ) + P P (w x y ))=D (Q (2) c = (R (3) a= 8 i=1 8 i=1 i 8 i=1 i i i i 8 i=1 i i i i i P P P 1.0 0.0 0.5 SumSq 1.5 where: wi are weights, P = 8i=1 wi , Q = 8i=1 (wi xi ), R = 8i=1 (wi x2i ), and D = PR Q2 . The next step is to nd the value of b which minimizes the sum of squares. As seen in Figure 2, the plot of SumSq vs b has a minimum that can be easily aproximated. The values of c and a are then given by (2) and (3). 0 2 4 6 Parameter b Figure 2. SumSq vs b 4. Empirical Entropy Proles and Collapsing of Categories Even though we compressed several megabytes into eight numbers, the empirical entropy proles still keep information about the landscape patterns. In Figure 1 we can see the plots corresponding to three dierent watersheds in the Appalachian region: the rst is mostly forested, the second shows some degree of deforestation and the last one is the most deforested of the three, and presents a fragmentation process. In these three plots it is evident that as the process of deforestation starts in a landscape, the entropy proles begin to increase. There is, however, a certain point were the trend reverses: if deforestation keeps increasing, the patches corresponding to land devoted to agriculture and construction tend to lump together producing a decrease in entropy. We are interested in changes that occur when categories are collapsed. The eight original categories were rst re-grouped into four classes: water, forest, veg-mix and human-intensive, and then into two nal categories: forest and non-forest. For illustration purposes, we selected 9 watersheds from the 102 in the following way: Table 1. Selected Watersheds Watershed Status Physiograc Province Appalachian Ridge Piedmont Mostly forested ws025 ws098 ws148 Transitional ws120 ws 073 ws138 Highly deforested ws105 ws125 ws126 Figure 3 shows the empirical entropy proles of three of the nine selected watersheds corresponding to eight, four and two categories. After the rst collapsing there is a pronounced drop in the entropies as a result of the loss of small patches. However, after the second collapsing the drop is less important. ws148 ws138 o o o o 6 o 6 6 o ws126 o o o o o o o o o o o o o o o o o o o 4 o o o o Entropy 4 Entropy Entropy 4 o o o o o o o o o o o o o o o o o o o o o o o 2 2 2 o o o o o o o o o o o o o o o 2 4 6 iter 0 0 o 0 0 o o o o 0 2 4 iter 6 0 2 4 6 iter Figure 3. Entropy Proles after Collapsing of Categories 5. Conclusion The use of empirical entropy proles is a promising method for data exploration and for reduction of categorical raster maps. In addition, the spatial information that is contained in the proles still enables classication of landscapes. REFERENCES Hand, D. J. (1998). Data mining: statistics and more? The American Statistician 52, 112-118. Johnson, G. D. and Patil, G. P. (1998). Quantitative multiresolution characterization of landscape patterns for assessing the status of ecosystem health in watershed management areas. Ecosystem Health. 4(3): 177-187. Myers, W. L., Patil, G. P., and Taillie, C. (1998). PHASE formulation of synoptic multivariate landscape data. Technical report. CSEES. Pennsylvania State University. Taillie, C. (1998). Notes on the use of wsclean.c, wsfreq.c, wsgroup.c and EEP programs. Personal communication. Weiss, S. M and Indurkhya, N. (1998). Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers, Inc. San Francisco, California. RESUM E Les donnes originelles consistent en images de satellite de cent-deux bassins de euves de la Pennsylvanie. Cettes images sont transformees en cartes de l'usage de la terre en huit categories. On applique un ltre aleatoire successivement aux images et on calcule leurs proles d'entropie empiriques. Le m^eme proces est repete en utilisant des cartes avec quatre et deux categories, apres un groupement naturel des categories originelles. La comparaison des proles obtenus montre des dierences que peuvent ^etre liees aux proces de fragmentation dans les bassins.