Download Biome Classification in the United States

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Four-vector wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
1
Biome Classification in the United States
Jeffrey Blodgett, Nathan Madsen, ECEn 670
Abstract—This project will use data from two sensors,
ASCAT and AMSR-E, treated as a Gaussian random
vector, to classify biomes in the United States. This classification system will then be analyzed for accuracy and
probability of error.
I. I NTRODUCTION
The surface of the earth may be classified according to
types of terrain, vegetation, and weather that are common
in a specific area. Areas of the earth in which these characteristics are similar in nature are known as biomes (see
Fig. 1). Biomes may be recognized and classified in a
variety of ways. One method that is convenient for large
scales and frequent updating is the use of remote-sensing
instruments. Such instruments measure properties of the
earth’s surface such as electromagnetic radiation and
radar backscatter. These measurements may be treated
as random variables with unique distributions according
to the type of biome at a specific location.
The purpose of this project is to use several types
of data from two remote sensing instruments to create
and analyze a classification system for biomes within the
United States. The data types used as inputs to the system
are treated as a random vector. We use a similar process
to that used by Remund and Long to classify sea ice
[1]. The instruments used are the Advanced Scatterometer (ASCAT) and the Advanced Microwave Scanning
Radiometer - Earth Observing System (AMSR-E). From
ASCAT we will use normalized σ0 or backscatter images
and the incidence-angle dependence images. These are
divided into ascending and descending passes. The data
types used from AMSR-E are TB (brightness temperature) in both H-pol and V-pol. These are also divided
into ascending and descending passes. Ascending and
descending passes for both these sensors have different
local times of day. For all of these data types, summer
and winter will be considered separately, giving sixteen
variables from which to characterize the data. All data
used is from the year 2010.
The gathered data will initially be characterized according to the measured statistics of certain training sets
for four major biomes within the United States. These
biomes are (1) Temperate Forests, (2) Great Plains, (3)
Deserts, and (4) Northwest Forested Mountains. Training
set areas will be determined according to the map shown
in Fig. 1. As seen in the figure, there are more biomes
than these, but we are only using the four largest.
Statistics will be gathered for each biome and used
to characterize them as random vectors with specific
probability distributions. After this, an iterative process
will be used over data from the entire U.S. to find
more accurate distributions. Simulations will also be
conducted with actual random inputs to estimate the
potential accuracy of this characterization process.
Fig. 1. Major biomes in the United States [2]. Green: Temperate
Forests. Orange: Great Plains. Yellow: Deserts. Dark Green: Northwest
Forested Mountains.
II. T HEORY
A. Signal Model
The previously described data types from ASCAT
and AMSR-E describe a 16-dimensional vector space.
Within this vector space, each biome is treated as a
random vector with a unique probability distribution. We
characterize this random vector by a truth value for the
type of biome and several noise terms as given by
Xbiome = xtrue + nterrain + nweather + nsensor , (1)
where Xbiome is the random vector, xtrue is the true
vector value of the biome, nterrain is the natural
variation of the terrain within the biome, nweather is
variation caused by local weather-based phenomena, and
nsensor is sensor noise. Assuming the sum of these noise
terms are gaussian, Xbiome has a multivariate gaussian
probability distribution function (pdf) given by
T
−1
1
fXbiome (x) = p
e−(x−µ) Λ (x−µ)/2 ,
k
(2π) det(Λ)
(2)
where Λ is the covariance matrix, µ is the mean vector,
and k is the number of dimensions.
III. E XPERIMENT
A. Simulation
If different biomes have significantly different distributions, the proposed approach should be effective
for biome classification. In order to determine how
significant the differences are between distinct biomes,
we will simulate the measurement and classification of
different biome types.
In accordance with the model, the distribution for each
different biome can be represented by (2). Λ and µ are
estimated for each biome by selecting training areas and
calculating the covariance and mean for pixels in that
area. The first two dimensions of the training vectors
used to determine these distributions are shown in Fig. 3.
Each biome does appear to have a unique distribution.
B. Maximum Likelihood
After determining the specific probability distributions
of each biome, a maximum likelihood (ML) estimator is
used to determine the biome for a given measured vector
xmeas . The equation for the estimator is given by
ŶM L (x) = argmaxfXbiome |Xmeas (x|x = xmeas ), (3)
in which ŶM L (x) is the estimated biome for a vector
x, xmeas is used as the input to the pdf for each biome,
and the pdf with maximum probability is chosen as the
biome for that measurement.
C. Principal Component Analysis
Before using this estimator, the complexity of the
problem may be reduced by using principal component
analysis (PCA). Using PCA, we determine the directions
of maximum variation within the entire vector space. A
covariance matrix is found for the entire set of data.
The eigenvectors of this matrix represent the orthogonal
directions of maximum variation and their corresponding
eigenvalues represent the variance in each of these
directions. Using these, the data is projected onto this
new vector space, in which the higher dimensions can be
ignored because they only represent a very small amount
of the variance. For our data, the first eight dimensions
respresent over 95% of the variation within the data,
so these are the dimensions used in characterizing the
biomes (see Fig. 2).
Fig. 3. Training Set Distributions
These distributions for each biome type are used to
generate actual random vectors. These vectors are then
classified according to the ML estimator described in
the previous section, using the same distributions. The
results of the simulation are shown in Table I. These
results show very low error rates, and we anticipate
that the biome distributions are distinct enough to be
classified using our data.
B. Iterated Classification
The previous simulation shows that the training areas used to generate the distributions are recognizable
using our method. However, these training areas do
not represent the full diversity present within a given
biome, nor do we have a reliable dataset to determine
the full range of each biome. We now use iterative
classification in order to get a more realistic description
of the distribution across the entire range of the biome.
For iterative classification, we first use the parameters
(covariance matrix and mean vectors) from the training
Fig. 2. Eigenvalues of the principal components
2
TABLE I
F OR EACH BIOME 100,000 VECTORS WERE GENERATED
ACCORDING TO THE PDF DETERMINED BY THE TRAINING SET.
E ACH VECTOR IS THEN CLASSIFIED ACCORDING TO THE SAME PDF.
VALUES ARE PROPORTIONS OF SIMULATED VECTORS FOR A BIOME
CLASSIFIED INTO EACH OTHER BIOME .
Classified
Simulated
Tem. For.
Gr. Plains
Desert
NW Moun.
Temp. For.
0.9991
0.0000
0.0000
0.0008
Gr. Plains
0.0001
0.9999
0.0000
0.0000
Desert
0.0000
0.0000
0.9952
0.0048
NW Mount.
0.0022
0.0000
0.0044
0.9934
sets to classify each vector across the United States.
We then recalculate the parameters for the classified
areas. These parameters are then used to reclassify the
vectors. Repeating this process until it stabilizes gives
a more accurate representation of the distribution for
each biome type. Images of pre-iterative classification
and classification after eight iterations can be seen in
Fig. 4.
To investigate the behavior using real data, we create
the same dataset for 2011. It is assumed that most pixels
will have similar characterics to those from 2010, with
the same true biome mean, but different sensor and
weather noise. The 2011 data is normalized to have the
same overall mean as the 2010 data in order to account
for any variation that is correlated across the entire
United States. The 2011 data is then classified according
to the distributions from the 2010 iterated classification.
This classified 2011 data can be seen in Fig. 4.
IV. R ESULTS
Using the iterated classification process described
above, more realistic probability distributions are generated for each biome. Running the simulation from
section III-A again for these new distributions gives error
rates as seen in Table II. With the full range of each
biome it is clear that there is more overlap. This is to
be expected because the true distributions are likely to
have greater variance than those generated using only a
small training sample. The error rates are still quite low,
however, indicating that if these are the true distributions,
this is a reliable classification system.
However, when actual data from the next year is used,
rather than simulated data, the error rates are found to
be much higher (see Table III). There are many possible
reasons for this, mainly the fact that our probability
distributions were determined using data from only 2011,
so any year-to-year variation is not taken into account in
Fig. 4. Classification of different biome types using a ML classifier.
Top: classification using distribution of pixels in training areas (outlined in black). Middle: classification after eight iterations. Bottom:
classification of 2011 data using distribution from iterated classification
of 2010 data.
the pdf. It is encouraging, however, that the majority
of the error is between temperate forests and northwest
forests, and between deserts and great plains. These
biomes are intuitively similar, and it is reasonable that
differences between them would be easily affected by
yearly variation.
V. A NALYSIS
According to the simulation results shown in Tables I
and II, our classification system has the potential to be
3
As discussed in section IV, we also only used data
from one year, which leaves out the impact of year-toyear variation on the probability distributions. In future
studies, multiple years could be taken into account either
by averaging years together or by using them as separate
variables, increasing the number of dimensions. This
also, however, reduces the temporal resolution of our
classification method.
We also assumed throughout that the distribution of
the terrains was Gaussian. While the noise is likely to
be Gaussian, the other noise factors very well might
not be. The different biomes may have a more gradual
transition from one to the other that is not well modeled
with Gaussian clusters.
Another possible area of improvement would be the
use of a prior distribution. The prior would have to
be determined for each pixel. This could be done by
mapping a trusted classification source to our grid, then
determined a prior distribution by the distribution of
different biomes in the area. This would give a more
accurate classification, but would also rely heavily on
previously known data.
TABLE II
F OR EACH BIOME 100,000 VECTORS WERE GENERATED
ACCORDING TO THE PDF DETERMINED BY THE ITERATED
CLASSIFICATION OF BIOME TYPES . E ACH VECTOR IS THEN
CLASSIFIED ACCORDING TO THE SAME PDF. VALUES ARE
PROPORTIONS OF SIMULATED VECTORS FOR A BIOME CLASSIFIED
INTO EACH OTHER BIOME .
Classified
Simulated
Tem. For.
Gr. Plains
Desert
NW Moun.
Temp. For.
0.9845
0.0028
0.0055
0.0072
Gr. Plains
0.0087
0.9390
0.0359
0.0164
Desert
0.0115
0.0298
0.9212
0.0376
NW Mount.
0.0225
0.0129
0.0343
0.9303
TABLE III
DATA POINTS FOR THE U NITED S TATES ARE TAKEN FOR 2011.
E ACH VECTOR IS THEN CLASSIFIED ACCORDING TO THE PDF
DETERMINED THROUGH ITERATED CLASSIFICATION USING 2010.
VALUES ARE PROPORTIONS OF 2010 VECTORS FOR A BIOME
CLASSIFIED INTO EACH OTHER BIOME .
2011
2010
Tem. For.
Gr. Plains
Desert
NW Moun.
Temp. For.
0.4193
0.1159
0.0973
0.3675
Gr. Plains
0.0082
0.8665
0.0622
0.0631
VI. C ONCLUSIONS
Desert
0.0124
0.3043
0.5843
0.0990
NW Mount.
0.0158
0.0515
0.0843
0.8484
This project shows the potential as well as the limitations of the Gaussian model. It provides a simple way of
accounting for differences in means and linear correlations within vectors. Unfortunately, it does not perfectly
describe all distributions. There are many methods for
improving the classification of biomes. However, the
classification we had was based on real data, and does
draw distinct lines across the United States. What causes
these distinctions, if not traditional biomes, is a matter
for further investigation.
quite as long as our probability distributions are correct.
Our methods of determining these distributions, however,
can be improved in future studies.
Comparing between Fig. 1 and Fig. 4, it is apparent
there is room for improvement. From a simple
comparison it is immediately obvious that there
are more biome types that we did not attempt to
classify. These biome types were left out due to their
limited extent. However, if these types of biomes are
truly distinct to our sensors from other types, they
will skew the distribution of whichever biome they
are classified as in the iterative classification process.
R EFERENCES
[1] Q.P. Remund, and D.G. Long, “An Iterative Approach to Multisensor Sea Ice Classification,” IEEE Transactions on Geoscience
and Remote Sensing, Vol. 38, No. 4, pp. 1843-1856, July 2000.
[2] Morning Earth. (2013, Nov.). Biosphere as Place.
[Online]. Available: http://www.morning-earth.org/Graphic-E/
BIOSPHERE/Bios-PL-Intro.htm
4