Download Data Mining with the Purpose of Elucidating Multiresolution Spatial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining with the Purpose of Elucidating Multiresolution Spatial Patterns Linked to Fragmentation
Sonia Rodrguez
Universidad de Costa Rica
Escuela de Matematica
San Jose, Costa Rica
E-mail: [email protected]
1. Introduction
Data mining is a new eld, born as a consequence of modern computing technology. It can
be considered as a combination of science and art that borrows some of its methods and tools
from statistics, database technology, machine learning, knowledge discovery, pattern recognition and other elds. It has much in common with exploratory data analysis, a term more
familiar to statisticians. However, data minig is only concerned with data sets of enormous
size, with storage requirements in the order of megabytes, gigabytes, or even terabytes. From
dealing with data sets we progressed to databases and nally to data warehouses. It is clear
that new methods were needed in order to process and analyse such amounts of information.
For scientic applications, data mining can be dened as \the process of secondary analysis of large databases aimed at nding unsuspected relationships which are of interest or value to
the database owner" (Hand, 1998). Let us just mention that commercial applications, however,
consider that \prediction is arguably the strongest goal of data mining" (Weiss and Indurkhya,
1998). With the rst denition in mind, we are concerned with the exploration of categorical
raster maps of 102 watersheds in Pennsylvania obtained from satellite images with the goal
of uncovering spatial patterns that could be linked to fragmentation processes. The smallest
watershed is about half a megabyte in size, the biggest a little more than nine megabytes. Two
common tools in data mining when the purpose is exploration, are clustering and visualization.
In addition, we are proposing the use of empirical entropy proles as a new method of data
reduction and analysis.
2. Data Sets and Software
The raw data consisted of 102 raster maps that were generated from two data sets: a
landuse map of Pennsylvania and the watershed boundary polygons, property of the OÆce
of Remote Sensing of Earth Resources at Penn State University. Information on these data,
including metadata for the land coverage, can be obtained at http:\\www.pasda.psu.edu.
The land cover classication was performed using both a supervised and an unsupervised
approach. The six bands of thematic mapper imagery were rst compressed into 255 clusters using the Erdas Imagine c version of ISODATA. The supervised classication was done
with the PHASE program (Myers, 1997) producing maps with 8 landcover categories: water,
conifer forest, mixed forest, broadleaf forest, vegplex (a mixture), perennial herbaceous, annual
herbaceous and terrestrial unvegetated. The resulting grids were then processed using custom
programs written in C-language that dealt with the spatial structure of the images and made
the calculations of the empirical entropy proles (Taillie, 1998).
3. Method
Considering the landscape as a rectangular grid and dividing it into groups of 4 cells, we
then record the color of every group as a 4 tuple:
1 1
3 4
4-tuple
(1134)
!
!
Frequency Table
A table of the distribution of the 4096 possible 4 tuples was constructed and the empirical entropy value of the image calculated. Then, a random lter was applied iteratively to
the grid, selecting one pixel at random from every group of 4 cells. After each application of
the random lter, the empirical entropies were calculated for each of the new obtained images.
Given the size of the maps, the lter was applied eight times and hence the entropy prole for
each watershed consists of the eight empirical entropy values.
The entropy proles have a characteristic form. Starting with a minimum value for
iter = 0 (oor resolution), the entropies keep increasing after further applications of the random lter until they approach an asymptotic value (Figure 1).
o
o
o
o
6
6
6
o
o
o
o
o
o
o
o
o
o
4
o
o
Entropy ws105
o
4
o
o
o
o
Entropy ws120
4
Entropy ws025
o
2
2
2
o
0
2
4
6
0
0
0
o
0
2
iter
4
6
0
2
4
iter
6
iter
Figure 1. Entropy Prole
The following parametric model was tted to the proles
(1) c
y = a exp( b iter )
where: c is the asymptotic value, c a is the entropy value at oor resolution, and b is the rate at
which the asymptotic value is approached. Assuming b xed, and writting x = exp( b iter),
produces a liner model: y = c ax. The least squares estimators of c and a are given by:
P (w y ) Q P (w x y )=D
P (w y ) + P P (w x y ))=D
(Q
(2)
c = (R
(3)
a=
8
i=1
8
i=1
i
8
i=1
i
i
i
i
8
i=1
i
i
i
i
i
P
P
P
1.0
0.0
0.5
SumSq
1.5
where: wi are weights, P = 8i=1 wi , Q = 8i=1 (wi xi ), R = 8i=1 (wi x2i ), and D = PR Q2 .
The next step is to nd the value of b which minimizes the sum of squares. As seen in Figure
2, the plot of SumSq vs b has a minimum that can be easily aproximated. The values of c and
a are then given by (2) and (3).
0
2
4
6
Parameter b
Figure 2. SumSq vs
b
4. Empirical Entropy Proles and Collapsing of Categories
Even though we compressed several megabytes into eight numbers, the empirical entropy proles still keep information about the landscape patterns. In Figure 1 we can see the
plots corresponding to three dierent watersheds in the Appalachian region: the rst is mostly
forested, the second shows some degree of deforestation and the last one is the most deforested
of the three, and presents a fragmentation process. In these three plots it is evident that as the
process of deforestation starts in a landscape, the entropy proles begin to increase. There is,
however, a certain point were the trend reverses: if deforestation keeps increasing, the patches
corresponding to land devoted to agriculture and construction tend to lump together producing
a decrease in entropy.
We are interested in changes that occur when categories are collapsed. The eight original
categories were rst re-grouped into four classes: water, forest, veg-mix and human-intensive,
and then into two nal categories: forest and non-forest.
For illustration purposes, we selected 9 watersheds from the 102 in the following way:
Table 1. Selected Watersheds
Watershed
Status
Physiograc Province
Appalachian Ridge Piedmont
Mostly forested
ws025 ws098
ws148
Transitional
ws120 ws 073
ws138
Highly deforested
ws105 ws125
ws126
Figure 3 shows the empirical entropy proles of three of the nine selected watersheds corresponding to eight, four and two categories. After the rst collapsing there is a pronounced drop
in the entropies as a result of the loss of small patches. However, after the second collapsing
the drop is less important.
ws148
ws138
o
o
o
o
6
o
6
6
o
ws126
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
4
o
o
o
o
Entropy
4
Entropy
Entropy
4
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
2
2
2
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
2
4
6
iter
0
0
o
0
0
o
o
o
o
0
2
4
iter
6
0
2
4
6
iter
Figure 3. Entropy Proles after Collapsing of Categories
5. Conclusion
The use of empirical entropy proles is a promising method for data exploration and for
reduction of categorical raster maps. In addition, the spatial information that is contained in
the proles still enables classication of landscapes.
REFERENCES
Hand, D. J. (1998). Data mining: statistics and more? The American Statistician 52, 112-118.
Johnson, G. D. and Patil, G. P. (1998). Quantitative multiresolution characterization of landscape patterns for assessing the status of ecosystem health in watershed management areas.
Ecosystem Health. 4(3): 177-187.
Myers, W. L., Patil, G. P., and Taillie, C. (1998). PHASE formulation of synoptic multivariate
landscape data. Technical report. CSEES. Pennsylvania State University.
Taillie, C. (1998). Notes on the use of wsclean.c, wsfreq.c, wsgroup.c and EEP programs. Personal communication.
Weiss, S. M and Indurkhya, N. (1998). Predictive Data Mining: A Practical Guide. Morgan
Kaufmann Publishers, Inc. San Francisco, California.
RESUM
E
Les donnes originelles consistent en images de satellite de cent-deux bassins de euves de
la Pennsylvanie. Cettes images sont transformees en cartes de l'usage de la terre en huit
categories. On applique un ltre aleatoire successivement aux images et on calcule leurs proles d'entropie empiriques. Le m^eme proces est repete en utilisant des cartes avec quatre et
deux categories, apres un groupement naturel des categories originelles. La comparaison des
proles obtenus montre des dierences que peuvent ^etre liees aux proces de fragmentation dans
les bassins.