Download Biome Classification in the United States

1 Biome Classification in the United States Jeffrey Blodgett, Nathan Madsen, ECEn 670 Abstract—This project will use data from two sensors, ASCAT and AMSR-E, treated as a Gaussian random vector, to classify biomes in the United States. This classification system will then be analyzed for accuracy and probability of error. I. I NTRODUCTION The surface of the earth may be classified according to types of terrain, vegetation, and weather that are common in a specific area. Areas of the earth in which these characteristics are similar in nature are known as biomes (see Fig. 1). Biomes may be recognized and classified in a variety of ways. One method that is convenient for large scales and frequent updating is the use of remote-sensing instruments. Such instruments measure properties of the earth’s surface such as electromagnetic radiation and radar backscatter. These measurements may be treated as random variables with unique distributions according to the type of biome at a specific location. The purpose of this project is to use several types of data from two remote sensing instruments to create and analyze a classification system for biomes within the United States. The data types used as inputs to the system are treated as a random vector. We use a similar process to that used by Remund and Long to classify sea ice [1]. The instruments used are the Advanced Scatterometer (ASCAT) and the Advanced Microwave Scanning Radiometer - Earth Observing System (AMSR-E). From ASCAT we will use normalized σ0 or backscatter images and the incidence-angle dependence images. These are divided into ascending and descending passes. The data types used from AMSR-E are TB (brightness temperature) in both H-pol and V-pol. These are also divided into ascending and descending passes. Ascending and descending passes for both these sensors have different local times of day. For all of these data types, summer and winter will be considered separately, giving sixteen variables from which to characterize the data. All data used is from the year 2010. The gathered data will initially be characterized according to the measured statistics of certain training sets for four major biomes within the United States. These biomes are (1) Temperate Forests, (2) Great Plains, (3) Deserts, and (4) Northwest Forested Mountains. Training set areas will be determined according to the map shown in Fig. 1. As seen in the figure, there are more biomes than these, but we are only using the four largest. Statistics will be gathered for each biome and used to characterize them as random vectors with specific probability distributions. After this, an iterative process will be used over data from the entire U.S. to find more accurate distributions. Simulations will also be conducted with actual random inputs to estimate the potential accuracy of this characterization process. Fig. 1. Major biomes in the United States [2]. Green: Temperate Forests. Orange: Great Plains. Yellow: Deserts. Dark Green: Northwest Forested Mountains. II. T HEORY A. Signal Model The previously described data types from ASCAT and AMSR-E describe a 16-dimensional vector space. Within this vector space, each biome is treated as a random vector with a unique probability distribution. We characterize this random vector by a truth value for the type of biome and several noise terms as given by Xbiome = xtrue + nterrain + nweather + nsensor , (1) where Xbiome is the random vector, xtrue is the true vector value of the biome, nterrain is the natural variation of the terrain within the biome, nweather is variation caused by local weather-based phenomena, and nsensor is sensor noise. Assuming the sum of these noise terms are gaussian, Xbiome has a multivariate gaussian probability distribution function (pdf) given by T −1 1 fXbiome (x) = p e−(x−µ) Λ (x−µ)/2 , k (2π) det(Λ) (2) where Λ is the covariance matrix, µ is the mean vector, and k is the number of dimensions. III. E XPERIMENT A. Simulation If different biomes have significantly different distributions, the proposed approach should be effective for biome classification. In order to determine how significant the differences are between distinct biomes, we will simulate the measurement and classification of different biome types. In accordance with the model, the distribution for each different biome can be represented by (2). Λ and µ are estimated for each biome by selecting training areas and calculating the covariance and mean for pixels in that area. The first two dimensions of the training vectors used to determine these distributions are shown in Fig. 3. Each biome does appear to have a unique distribution. B. Maximum Likelihood After determining the specific probability distributions of each biome, a maximum likelihood (ML) estimator is used to determine the biome for a given measured vector xmeas . The equation for the estimator is given by ŶM L (x) = argmaxfXbiome |Xmeas (x|x = xmeas ), (3) in which ŶM L (x) is the estimated biome for a vector x, xmeas is used as the input to the pdf for each biome, and the pdf with maximum probability is chosen as the biome for that measurement. C. Principal Component Analysis Before using this estimator, the complexity of the problem may be reduced by using principal component analysis (PCA). Using PCA, we determine the directions of maximum variation within the entire vector space. A covariance matrix is found for the entire set of data. The eigenvectors of this matrix represent the orthogonal directions of maximum variation and their corresponding eigenvalues represent the variance in each of these directions. Using these, the data is projected onto this new vector space, in which the higher dimensions can be ignored because they only represent a very small amount of the variance. For our data, the first eight dimensions respresent over 95% of the variation within the data, so these are the dimensions used in characterizing the biomes (see Fig. 2). Fig. 3. Training Set Distributions These distributions for each biome type are used to generate actual random vectors. These vectors are then classified according to the ML estimator described in the previous section, using the same distributions. The results of the simulation are shown in Table I. These results show very low error rates, and we anticipate that the biome distributions are distinct enough to be classified using our data. B. Iterated Classification The previous simulation shows that the training areas used to generate the distributions are recognizable using our method. However, these training areas do not represent the full diversity present within a given biome, nor do we have a reliable dataset to determine the full range of each biome. We now use iterative classification in order to get a more realistic description of the distribution across the entire range of the biome. For iterative classification, we first use the parameters (covariance matrix and mean vectors) from the training Fig. 2. Eigenvalues of the principal components 2 TABLE I F OR EACH BIOME 100,000 VECTORS WERE GENERATED ACCORDING TO THE PDF DETERMINED BY THE TRAINING SET. E ACH VECTOR IS THEN CLASSIFIED ACCORDING TO THE SAME PDF. VALUES ARE PROPORTIONS OF SIMULATED VECTORS FOR A BIOME CLASSIFIED INTO EACH OTHER BIOME . Classified Simulated Tem. For. Gr. Plains Desert NW Moun. Temp. For. 0.9991 0.0000 0.0000 0.0008 Gr. Plains 0.0001 0.9999 0.0000 0.0000 Desert 0.0000 0.0000 0.9952 0.0048 NW Mount. 0.0022 0.0000 0.0044 0.9934 sets to classify each vector across the United States. We then recalculate the parameters for the classified areas. These parameters are then used to reclassify the vectors. Repeating this process until it stabilizes gives a more accurate representation of the distribution for each biome type. Images of pre-iterative classification and classification after eight iterations can be seen in Fig. 4. To investigate the behavior using real data, we create the same dataset for 2011. It is assumed that most pixels will have similar characterics to those from 2010, with the same true biome mean, but different sensor and weather noise. The 2011 data is normalized to have the same overall mean as the 2010 data in order to account for any variation that is correlated across the entire United States. The 2011 data is then classified according to the distributions from the 2010 iterated classification. This classified 2011 data can be seen in Fig. 4. IV. R ESULTS Using the iterated classification process described above, more realistic probability distributions are generated for each biome. Running the simulation from section III-A again for these new distributions gives error rates as seen in Table II. With the full range of each biome it is clear that there is more overlap. This is to be expected because the true distributions are likely to have greater variance than those generated using only a small training sample. The error rates are still quite low, however, indicating that if these are the true distributions, this is a reliable classification system. However, when actual data from the next year is used, rather than simulated data, the error rates are found to be much higher (see Table III). There are many possible reasons for this, mainly the fact that our probability distributions were determined using data from only 2011, so any year-to-year variation is not taken into account in Fig. 4. Classification of different biome types using a ML classifier. Top: classification using distribution of pixels in training areas (outlined in black). Middle: classification after eight iterations. Bottom: classification of 2011 data using distribution from iterated classification of 2010 data. the pdf. It is encouraging, however, that the majority of the error is between temperate forests and northwest forests, and between deserts and great plains. These biomes are intuitively similar, and it is reasonable that differences between them would be easily affected by yearly variation. V. A NALYSIS According to the simulation results shown in Tables I and II, our classification system has the potential to be 3 As discussed in section IV, we also only used data from one year, which leaves out the impact of year-toyear variation on the probability distributions. In future studies, multiple years could be taken into account either by averaging years together or by using them as separate variables, increasing the number of dimensions. This also, however, reduces the temporal resolution of our classification method. We also assumed throughout that the distribution of the terrains was Gaussian. While the noise is likely to be Gaussian, the other noise factors very well might not be. The different biomes may have a more gradual transition from one to the other that is not well modeled with Gaussian clusters. Another possible area of improvement would be the use of a prior distribution. The prior would have to be determined for each pixel. This could be done by mapping a trusted classification source to our grid, then determined a prior distribution by the distribution of different biomes in the area. This would give a more accurate classification, but would also rely heavily on previously known data. TABLE II F OR EACH BIOME 100,000 VECTORS WERE GENERATED ACCORDING TO THE PDF DETERMINED BY THE ITERATED CLASSIFICATION OF BIOME TYPES . E ACH VECTOR IS THEN CLASSIFIED ACCORDING TO THE SAME PDF. VALUES ARE PROPORTIONS OF SIMULATED VECTORS FOR A BIOME CLASSIFIED INTO EACH OTHER BIOME . Classified Simulated Tem. For. Gr. Plains Desert NW Moun. Temp. For. 0.9845 0.0028 0.0055 0.0072 Gr. Plains 0.0087 0.9390 0.0359 0.0164 Desert 0.0115 0.0298 0.9212 0.0376 NW Mount. 0.0225 0.0129 0.0343 0.9303 TABLE III DATA POINTS FOR THE U NITED S TATES ARE TAKEN FOR 2011. E ACH VECTOR IS THEN CLASSIFIED ACCORDING TO THE PDF DETERMINED THROUGH ITERATED CLASSIFICATION USING 2010. VALUES ARE PROPORTIONS OF 2010 VECTORS FOR A BIOME CLASSIFIED INTO EACH OTHER BIOME . 2011 2010 Tem. For. Gr. Plains Desert NW Moun. Temp. For. 0.4193 0.1159 0.0973 0.3675 Gr. Plains 0.0082 0.8665 0.0622 0.0631 VI. C ONCLUSIONS Desert 0.0124 0.3043 0.5843 0.0990 NW Mount. 0.0158 0.0515 0.0843 0.8484 This project shows the potential as well as the limitations of the Gaussian model. It provides a simple way of accounting for differences in means and linear correlations within vectors. Unfortunately, it does not perfectly describe all distributions. There are many methods for improving the classification of biomes. However, the classification we had was based on real data, and does draw distinct lines across the United States. What causes these distinctions, if not traditional biomes, is a matter for further investigation. quite as long as our probability distributions are correct. Our methods of determining these distributions, however, can be improved in future studies. Comparing between Fig. 1 and Fig. 4, it is apparent there is room for improvement. From a simple comparison it is immediately obvious that there are more biome types that we did not attempt to classify. These biome types were left out due to their limited extent. However, if these types of biomes are truly distinct to our sensors from other types, they will skew the distribution of whichever biome they are classified as in the iterative classification process. R EFERENCES [1] Q.P. Remund, and D.G. Long, “An Iterative Approach to Multisensor Sea Ice Classification,” IEEE Transactions on Geoscience and Remote Sensing, Vol. 38, No. 4, pp. 1843-1856, July 2000. [2] Morning Earth. (2013, Nov.). Biosphere as Place. [Online]. Available: http://www.morning-earth.org/Graphic-E/ BIOSPHERE/Bios-PL-Intro.htm 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biome Classification in the United States