Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hawaii International Conference on Statistics and Related Fields Submission Title Page Title: Spatial Analysis of a Small Area Problem Topic Area(s): Applied Statistics, Population Statistics Keywords: Kriging, Small Area Estimation Author: Brady West (student, University of Michigan at Ann Arbor) Mailing Address: 2222 Fuller Ct., Apartment 708A, Ann Arbor, MI, 48105 E-Mail Address: [email protected] Phone Numbers: 734.998.0498 (home), 734.223.9793 (cell) Fax Number: 734.647.2440 (work) Advisor: Edward Rothman, University of Michigan Department of Statistics Spatial Analysis of a Small Area Problem Brady T. West University of Michigan Department of Statistics April, 2001 1 “Spatial Analysis of a Small Area Problem” Table of Contents I. Introduction……………………………………………………………………. 3 II. Current Model-Based Approaches to Small Area Estimation…………….. 4 III. The Kriging Framework…………………………………………………….. 5 Calculation of the Sample and Theoretical Variograms……….. 7 IV. Analysis: An Application of Kriging to Small Area Estimation………….. 9 Overview……………………………………………………………9 Modeling the Census Block Group Population in Washtenaw County……………..10 Using Kriging for Small Area Estimation………………………. 15 V. Discussion……………………………………………………………………… 17 VI. Works Cited………………………………………………………………….. 18 VII. Appendix A: Census Block Group Population Model Details…………... 20 VIII. Appendix B: Census Block Group Population Model Diagnostics…….. 28 IX. Appendix C: Kriging Details……………………………………………….. 37 X. Appendix D: Kriging Plots…………………………………………………... 49 XI. Appendix E: Additional SAS Code…………………………………………59 2 Introduction The United States Census Bureau currently gathers population data for certain bounded geographic regions in all U.S. counties known as census tracts, and several different methods have been developed by statisticians and population analysts to estimate the populations of smaller areas within these tracts. In recent years, the statistical technique of small area estimation has been a very hot topic, and there is an ever-growing demand for reliable estimates of small area populations of all types. Reliable estimates of the populations of small areas (i.e. city blocks) within the currently established census tracts are important for several reasons. These extremely important estimates are used for, among other things, determination of state funding allocations, determination of exact boundaries for school and voting districts, administrative planning, marketing guidance, and as data for detailed descriptive and analytical studies of cities (Bryan: 1999). Current small area estimation techniques attempt to derive an estimate of some attribute for a small area (i.e. population) by using available auxiliary data for the area; practitioners of these techniques now generally accept that indirect estimators should be used, based on explicit models relating small areas through supplementary data such as administrative records and recent Census data (Rao: 1999). Rao proposes that small area models of this nature can be broadly classified into two types, relating available auxiliary data for small areas to parameters of interest for the small areas (i.e. total population). Such model-based approaches truly are beneficial, in that they increase the precision of small area estimates. Surveys attempting to derive direct small area estimates are often faced with the problem of extremely small sample sizes in particular regions, which can produce estimates that are lacking in precision. However, these model-based methods rely on the availability of both auxiliary data and data for the parameter of interest (Census tract data, for instance), and they ignore the spatial relationships between the small areas requiring estimates and the areas where data of interest are actually available. A need exists for a new small area estimation technique that takes into account the spatial relationships between small areas, where the data of interest are unavailable for the construction of models, and those larger areas where the desired data are present. The geo-statistical spatial analysis technique known as kriging is frequently used by geologists for interpolation of attributes at points not physically sampled over a particular region, based on knowledge about the spatial relationships of the sampled points. Tobler’s Law of Geography1 provides the rationale behind such spatial interpolation: points close together in space are more likely to have similar attributes than points far apart. The small area analogy suggests that adjacent small areas will have characteristics extremely similar to both each other and the larger regions that they are near or a part of. Using this analogy, we investigate an application of kriging to the small area estimation problem described above, and determine the precision of small area estimates that take into account the spatial relationships between small areas and those larger areas where data are available for those attributes desired in the small areas. 1 Taken from “http://reserves.library.okstate.edu/geog5343/lec12/sld001.htm” 3 Current Model-Based Approaches to Small Area Estimation2 The model-based approach to small area estimation permits validation of the models from sample data. Rao (1999) classifies small area models into two types: (1) i = x t i + vi In this model, area-specific auxiliary data xi (administrative records, census data) are available for the areas i = 1,…,m. The population small area total Yi, or some function i = g(Yi), is assumed to be related to xi through the above linear model (1). The vi’s are assumed to be normally distributed, random, uncorrelated small area effects, with mean zero and variance 2v. represents the vector of regression parameters. The second type of model is as follows: (2) yij = xtij + vi + eij This model is appropriate for continuous variables y. In this model, unit-specific auxiliary data xij are again available for areas i = 1,…,m , where j = 1,…,Ni and Ni represents the number of population units in the i-th area. The unit y-values, yij, are assumed to be related to the auxiliary values xij through the nested error regression model above (2). The vi’s are normal, independent, and identically distributed, with mean zero and variance 2v. The eij’s are independent of the vi’s, normal, independent of each other, and identically distributed, with mean zero and variance 2e. again represents the vector of regression parameters. Rao further asserts that in the case of type (1) models, direct survey estimators Yi are available whenever the sample sizes ni >= 1. Then, it can be assumed that (3) i = i + ei where i = g(Yi) and the sampling errors ei are independent and normally distributed with mean zero and known variance i. Then, when model (3) is combined with model (1), we have (4) i = x t i + vi + ei Models of this form, with i = log(Yi), have recently been used to produce model-based county estimates of poor school-age children in the United States (Rao: 1999). According to Rao, “The success of small area estimation…largely depends on getting good auxiliary information {xi} that leads to a small model variance 2v relative to i.” 2 From: Rao, J.N.K. 1999. Current Trends in Sample Survey Theory and Methods. Indian Journal of Statistics 61: 16-22. 4 The Kriging Framework The geo-statistical technique known as kriging was developed in the late 1950’s, building on the pioneering work of a South African mining engineer named Danie Krige, who in 1951 proposed innovative new concepts for mining estimation (Guyaguler: 2000; Goovaerts: 1997). Based on Krige’s work, Georges Matheron developed the Theory of Regionalized Variables, combining Krige’s concepts into a single framework which he coined “kriging.” Formally, kriging now refers to a family of least-squares linear regression algorithms that attempt to predict values of a regionalized variable, or a phenomenon that is spread out in space, at locations where data for the variable is not available, based on the spatial pattern of the available data. In general, kriging provides an optimal means of estimating the values of a certain continuous attribute z at any points u not physically sampled over a specific area A, using knowledge about the underlying spatial relationships in a data set. Statistical models known as variograms (p. 7) provide knowledge about these underlying spatial relationships, and express spatial variation in the available data. The estimated values are weighted linear combinations of the available data, where data points that are closer to the locations requiring estimation are given more weight in producing the estimates, and the resulting estimators are unbiased (Lang: 1997). What differentiates kriging from other linear estimation methods is its aim to minimize the variance of the errors of the estimates, which because they are unbiased have mean zero. In most cases, kriging is applied to produce a fine grid of accurate estimates of some variable (i.e. standing water level, or richness of soil) over a physical site of interest. Goovaerts (1997) outlines the basic methodology behind the kriging procedure, and differentiates between the different types of kriging: Given n data at sampled locations ui of some study area A for some attribute z(ui), i = 1,…,n, we wish to produce values for the estimator z*(u). This estimator is defined as follows: (5) z*(u) - m(u) = i i(u)[z(ui) - m(ui)] Here, i(u) = the weight assigned to the datum z(ui) using a theoretical variogram [see (8)], m(u) = the expected value of z(u), and m(ui) = the expected value of z(ui). The vector of weights (u) is determined to minimize the variance of the estimation errors, denoted as (6) 2e = Var{z*(u) - z(u)} 5 The variance of the estimation errors 2e is minimized under the constraint that E{z*(u) - z(u)} = 0, making the estimator unbiased. The estimation errors can be computed using cross-validation, where estimates are derived for actual data points by removing sample values from the data set and then using the kriging procedure to reestimate them. There are three main kriging variants, distinguished according to the model considered for m(u), which recall is the expected value of z(u). Ordinary kriging is the most widely used kriging method (Guyaguler: 2000), which accounts for local fluctuations of the mean for the attribute z by limiting the area where the mean for z is fixed to some local neighborhood W(u) (Goovaerts: 1997). This variant of kriging considers the mean m(u) for z to be constant but unknown for all locations in W(u) . Ordinary kriging estimates are derived in the following manner: (7) z*(u) = i i(u)z(ui) The weights i(u) assigned to each datum z(ui) are constrained such that i i(u) = 1, to ensure the unbiasedness of the estimates (Guyaguler: 2000). The weights are again determined using the experimental variogram so as to minimize the error variance described in (6). The vector of weights is derived by solving the following system of equations (Guyaguler: 2000): (8) i i(u)(uj - ui) + = (uj - uo) j = 1,…,n Here, (uj - ui) represents the value of the theoretical variogram at distance (uj - ui), where uj and ui are the locations of actual data points, and uo is the location where an estimate is desired. represents the “Lagrange parameter.” (Lang: 1997) Ordinary kriging is an exact interpolator, since the estimated value of the desired attribute at some point is equal to the exact data value. This variant of kriging only uses a local neighborhood of data in deriving the estimates, where only the closest data are assigned weights and the expected value m(u) of the desired attribute is considered to be constant, and this fact can be used to show that the estimates are unbiased: E{z*(u) - z(u)} = E{i i(u)z(ui)} - E{z(u)} = m(u) i i(u) - m(u) = 0. Ordinary kriging is usually preferred to other types of kriging, because it requires neither knowledge nor stationarity of the mean m(u) over the entire area of interest A (Goovaerts: 1997). The other two chief variants of kriging include simple kriging, which considers the mean m(u) of the attribute z to be known and constant throughout the area of interest A, and 6 kriging with a trend, which assumes that the unknown local mean m(u) varies within each local neighborhood W(u), throughout A. Some other “members” of the kriging family include block kriging, which involves the estimation of average z-values over some segment, surface, or volume of any size or shape, and cokriging, which allows one to better estimate values of z if the spatial distribution of some secondary variate sampled more intensely than z is known. Cokriging basically incorporates the additional information provided by other covariates into the above kriging methods, and can greatly improve interpolation estimates if it is difficult or expensive to sample the primary variate z. The values of the additional covariates are measured at locations where z is measured, and also at several other locations, and this additional information greatly improves the accuracy of the estimates of z. Calculation of the Sample and Theoretical Variograms As mentioned earlier, the sample variogram (or semivariogram, representing spatial covariance), a statistical model computed by using the available sample data, is the key part of the kriging procedure, in that it expresses how the data vary spatially across the area of interest. A sample variogram takes as input a distance h between two sampled points, and outputs a variogram estimate (h) that explains the variance in the attribute z over the distance between the two points. The sample variogram is derived in the following manner: 1. The range of distances between sampled points (from zero to the maximum distance) is divided into a set of discrete intervals. The width of each lag, or discrete interval h, must be large enough such that there are enough point pairs for estimation in all of the intervals. A rule of thumb in computing sample variograms is to have at least 30 point pairs in each discrete interval. The reliability of a variogram hinges on having sufficient pairs of observations in each lag (Schabenberger: 1997). However, for plotting and estimation purposes, it is desirable to have as many points as possible for a plot of (h) against h, so there is an important tradeoff between the number of lags and the number of point pairs within each lag. 2. For every pair of sampled points, the distance between the points is computed, along with the squared difference in the z-values. 3. Each pair of sampled points is assigned to one of the distance intervals, and the total variance in each interval is accumulated. 4. After every pair of points has been used, the average variance in each distance interval is computed. This value (h) is then plotted at the midpoint distance of each distance interval h. This results in a plot that only has as many points as there are distance intervals, with a variogram estimate for each distance interval. The next important step in computing the 7 variogram is to fit a model to these data (a theoretical variogram), which will allow for estimation of the variogram at all possible distances, rather than just the midpoint of each defined distance interval. Geo-statistics literature generally suggests the choice of one of five theoretical models to fit to sample variogram data, each of which has the parameters defined in the plot3 below: Plot 1: Example of Theoretical Variogram In a theoretical variogram, h (above, separation distance) represents the distance between two points; ao represents the range parameter, or the distance at which the upper limit of the variogram (above, semivariance) is reached; co represents the nugget effect parameter, or the apparent non-zero intersection of the sample variogram with the y-axis (several factors, like sampling error and short-scale variability, make it is possible for two points very close to each other to have unusual variance in the attribute z, and this is the nugget effect); and c represents the sill parameter, or the upper limit of the variogram. One of five possibilities for the theoretical variogram is chosen, generally based on visual rules of thumb. If there is no nugget effect, the parameter co = 0. 1. The linear model describes a straight-line variogram: (h) = co + [h(c/ao)] 2. The linear-to-sill model describes a variogram that appears to be linear and then reaches an abrupt asymptote: (h) = co + [h(c/ao)] (h) = co + c 3 for ao h for h ao Taken from “http://www.geostatistics.com/GSWin/GSWINGaussian_Isotropic_Model.html” 8 3. The exponential model describes a variogram that approaches the sill gradually but never converges with the sill (the ao parameter is used to provide range): (h) = co + c[1 - exp (-h/ao)] 4. The spherical model is a modified quadratic function that describes a variogram similar to that described by the exponential model which actually reaches an asymptote at the sill: (h) = co + c[1.5(h/ao) - 0.5(h/ao)3] (h) = co + c for ao h for h ao 5. The Gaussian model is similar to the exponential model, and describes a variogram that begins rising slowly from the y-intercept, and then rises more rapidly toward the sill (again, the ao parameter is used to provide range): (h) = co + c[1 - exp (-h2/ao2)] Once a theoretical variogram has been fit to the sampled data, the system of equations described in (8) can be solved for the vector of weights (u), and kriging estimates for the attribute z at specific locations may be obtained. Analysis: An Application of Kriging to Small Area Estimation Overview We consider the following small area estimation problem: population counts, as well as various demographic, housing, and geographic data, are available at the census tract level for a given county in the United States, and population estimates are desired for smaller subdivisions of the census tracts known as census block groups. Only geographic data (land area, etc.) are available for the census block groups. Definitions are of course in order. According to the United States Bureau of the Census4, census tracts are defined as “small, relatively permanent statistical subdivisions of a county…Census tracts usually have between 2,500 and 8,000 persons and, when first delineated, are designed to be homogeneous with respect to population characteristics, economic status, and living conditions. Census tracts do not cross county boundaries.” In addition, a census block group (BG) is defined as “a cluster of blocks having the same first digit of their three-digit identifying numbers within a census tract…Geographic BG's never cross census tract boundaries…BG's generally contain between 250 and 550 housing units, with the ideal size being 400 housing units.” The U.S. Census Bureau currently produces all of the data proposed to be available at the census tract level in the above setting at the census block group level as well, but for the purposes of investigating the effectiveness of kriging in small area estimation, we consider these data to be unavailable for the small areas of interest (the census block 4 United States Bureau of the Census - Geographic Area Descriptions: http://www.census.gov/geo/www/cob/tr_info.html; http://www.census.gov/geo/www/cob/bg_info.html 9 groups). The following methodology can then be considered analogous to a variety of situations where data are available for some larger area, but not for smaller subdivisions of the larger area, where estimates for some parameter are desired. Modeling the Census Block Group Population in Washtenaw County We begin our investigation of this problem by considering 1990 census data for the county of Washtenaw in the state of Michigan. The 1990 TIGER data sets produced by the U.S. Census Bureau after the completion of the 1990 U.S. census are available publicly over the web, at http://www.esri.com/data/online/tiger/data.html. These TIGER data sets include geographic features such as roads, railroads, rivers, and lakes, political boundaries, statistical boundaries, and demographic attributes for the entire United States, making them very useful for this setting. However, they do not contain the geographical coordinates of the various geographic areas, which are essential for the application of kriging. The SAS code used to merge the TIGER data sets with the coordinate data for the geographic areas of interest5 in this study is available in Appendix E. The images on the following pages, produced using the spatial analysis software ArcView GIS Version 3.1, provide an idea of what the 1990 census tract and census block group divisions of the county of Washtenaw looked like. 5 Latitude and longitude coordinate data for the census tracts and block groups in the county of Washtenaw were obtained from http://www.census.gov/geo/www/cob/tr.html and http://www.census.gov/geo/www/cob/bg.html. 10 N 10 0 10 Miles W E S Figure 1: 1990 Census Tracts in Washtenaw County 11 N 10 0 10 Miles W E S Figure 2: 1990 Census Block Groups in Washtenaw County 12 The continuous variables below, all considered relevant to the population of a particular area, are available in the aforementioned TIGER data sets for both the Washtenaw census tracts and the Washtenaw census block groups, in addition to population counts6: Land Area (km2) Water Area (km2) %Units Occupied %Units Vacant %Units Owned %Units Rented Median Unit Value Median Monthly Rent %Single Detached Units %Single Attached Units %Duplex Units %Apartment Units %Persons Employed Med. Household Income Income Per Capita %Persons Unemployed %Persons Not In Work Force %Children in Poverty %Units Built < 1970 %Units Built 70-79 %Units Built > 1984 %In Poverty %Units Built 80-84 Here is the 1990 spatial distribution of the variable %Apartment Units across the county of Washtenaw, for both the census tracts and the block groups: 10 0 10 20 Miles N % Apartment Units 0 - 0.092 0.092 - 0.298 0.298 - 0.501 0.501 - 0.763 0.763 - 0.954 W E S Figure 3: Spatial Distribution of % Apartment Units Across the 1990 Census Tracts in Washtenaw County Note that “Units” refers to the total number of housing units in the given area. Hence “% Apartment Units” refers to the proportion of housing units that are apartments. 6 13 One can see that the majority of the apartments in Washtenaw County are concentrated around the areas of Ann Arbor and Ypsilanti, which are college towns. 10 0 20 Miles 10 % Apartment Units 0 - 0.08 0.08 - 0.249 0.249 - 0.484 0.484 - 0.753 0.753 - 0.991 N E W S Figure 4: Spatial Distribution of % Apartment Units Across the 1990 Census Block Groups in Washtenaw County Because these data are all available at the census block group level, it is possible to fit a linear model to the block group data using ordinary regression analysis, with Persons as the response, and produce fitted values for each of the block groups. The accuracy of this model-based technique, using data that are actually available for the small areas, can then be compared with the accuracy of a new kriging-based technique, which will attempt to estimate the small area populations when data are only available for larger areas, or in this setting, when data are only available at the census tract level. In order to compare the accuracies of these two techniques, the following errors will be computed for each census block group in Washtenaw, and the mean and standard deviation (sd) of these errors for each technique will then be derived: (9) errormodel = actual population - predicted population (10) errorkriging = actual population - estimated population 14 A linear model of the form described in (1) was fit to the 1990 Washtenaw census block group data, with Persons as the parameter of interest and all other variables as predictors. The technical details of this model fitting process are described in Appendix A and Appendix B. The final model that resulted was (11) sqrt(Y) = 0.0912x1 + 31.67x2 + 0.00002363x3 + -0.0001586x4 + -11.05x5 + 22.92x6 where Y is the population of the census block group, x1 is the area of the block group in square kilometers, x2 is the % of housing units that are occupied, x3 is the median value of the housing units, x4 is the income per capita, x5 is the % of housing units built between 1980 and 1984, and x6 is the % of housing units built after 1984. After back-transforming to get the predicted population values for each block group by taking Yo2, where Yo was the fitted value for the block group in the above model, we find the following results based on (9): sd{errormodel} = 394.94771 mean{errormodel} = 35.40443 Using Kriging for Small Area Estimation The situation is considered where population data, various demographic, housing, and geographic data, and geographical coordinates (physical longitude and latitude) are all available for some large areas (in this example, these data are available for the Washtenaw census tracts). Population estimates are desired for smaller areas within these larger areas, and the only data that are available for the small areas are physical areas and geographical coordinates. In this example, the geographical coordinates of a particular area are assumed to be the coordinates of some point internal to the area. 1. The first step will be to determine a linear model like that in (11) relating the small areas, in order to establish those variables that are significant predictors of population for the small areas. This model may come from previous investigations, or by fitting a model to data that actually are available for similar small areas. Predictor variables in the model should represent percentages or summary measures describing the small areas, and not counts for the entire areas. The reason for this is as follows. After determining the appropriate small area model, we will use the available large area data to krig for those variables found to be significant predictors of the parameter of interest over the entire grid representing the collection of the large areas (in this case, Washtenaw county). Then, we will use these kriging estimates to predict the populations at the locations of the small areas, based on the small area model. Kriging is a form of 15 point interpolation, where the estimated value of some attribute at a location near a point where a value for the attribute is given will likely be very similar to the given value. Suppose a count variable, such as the total number of occupied homes, is a significant predictor of the desired attribute (population) in the small areas. When kriging estimates for this variable are produced over the entire area of interest (Washtenaw County), estimates for the total number of occupied homes in small areas that are within the larger areas where data are available (and thus have geographical coordinates that are extremely close to those of the large areas) will be very similar to the total number of occupied homes for the large areas, which will not make sense. However, suppose a variable such as “% of homes occupied” is a significant predictor of the desired attribute, and we obtain kriging estimates for this variable over the entire area of interest. The kriging estimates for this variable, at geographical coordinates representing small area locations that are extremely close to the geographical coordinates of the large areas, will again be similar to the given values for the large areas, which now makes sense. The same would be true for summary variables, such as median income. Based on the population model derived in (11), with Persons as the parameter of interest, the following variables were found to be significant predictors of population in the small areas: Land Area (km2) %Units Occupied %Units Built 1980-84 %Units Built > 1984 Median Unit Value Income Per Capita 2. The second step will be to krig for all of the variables that are significant predictors of the parameter of interest (population), over the entire area of interest (Washtenaw County), using the available large area (tract level) data. This will result in estimated values for each of the variables listed above (except for the area variable, which is assumed to be given for the small areas) at all locations over the entire area of interest. The estimated values of these variables at the locations representing the small areas can then be used, along with the given area values for the small areas, to estimate the populations of the small areas using the pre-determined small area model. Kriging estimates were all derived using the SAS System, and the procedures variogram and krige2d. Please see Appendix C and Appendix D for the technical details behind the calculations of all sample variograms, theoretical variograms, and kriging estimates, and all of the associated SAS code. After kriging for all of the continuous variables above (except land area), and then using the kriging estimates in the small area model described in (11) to obtain population estimates for the small areas, we find the following results based on (9): sd{errorkriging} = 437.7639 mean{errorkriging} = 26.1729 16 We see a smaller mean error than that found using ordinary regression to predict the small area populations, but a larger standard deviation in the errors. These errors have a distribution that is comparable to that found using the model-based technique. Discussion The standard deviation of the errors for the small area estimates derived using the kriging-based technique is understandably larger than that found using the model-based technique of ordinary regression. As indicated in (6), all kriging estimates of an attribute computed at locations over a specific area have their own standard deviation, because they are linear combinations of all of the available data for the attribute, with the weights for each datum derived by solving the system of equations defined in (8). When these kriging estimates are input into a pre-determined model, they are not fixed values but rather the expected values from a series of distributions. Each individual kriging estimate for a particular attribute at a particular location thus in effect brings its own error with it, incorporating additional error into the predictions derived by using the kriging estimates as “new” values in the model relating the small areas. Despite this fact, the mean and standard deviation of the errors for the small area estimates derived using a kriging-based technique were still comparable to those found using a model-based technique with available small area data. This finding suggests that the kriging-based technique can provide reasonable estimates for small area attributes when model-based techniques cannot be used due to a lack of small area data. The technique of kriging also produces estimates for a particular attribute at all locations over an area of interest. Kriging was originally designed for geological use, where the areas of interest are often fields or plots that do not have roads, bodies of water, or parks dividing the sampled points. When applying kriging to population estimation, kriging estimates of attributes such as “% of housing units occupied” are computed at all locations over an entire area, including points where there are roads, forests, parks, lakes, etc. Future research into the use of kriging-based or other spatial analytic techniques for small area estimation should investigate the possibility that barriers (i.e. roads, lakes, etc.) between “sampled points” such as census block groups, where population or housing attributes have meaning, might have an effect on the spatial continuity of attributes related to population. For example, kriging techniques may predict very similar values for the variable “% of housing units occupied” in two census block groups that are very close to each other but divided by a river, when in fact the actual values of this variable may vary significantly due to the presence of the river. Local kriging systems may need to be solved in order to derive kriging estimates of attributes at all points lying between such barriers, if the barriers are found to have a significant effect on the spatial continuity of the desired attributes. 17 Works Cited 1990 Block Group Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/bg_info.html 1990 Census Tract and Block Numbering Areas Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/tr_info.html Bryan, Thomas. 1999. “Small-Area Population Estimation Technique Using Administrative Records, and Evaluation of Results with Loss Functions and Optimization Criteria.” Federal Committee on Statistical Methodology Research Conference, November 15-17, 1999. Washington, DC: U.S. Bureau of the Census. Census Block Group Cartographic Boundary Files: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/bg.html Census TIGER Data: ArcData Online. http://www.esri.com/data/online/tiger/data.html Census Tract Cartographic Boundary Files: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/tr.html Faraway, Julian J. 1999. Practical Regression and Anova using R. Gamma Design Software. http://www.geostatistics.com/GSWIN/GSWINIsotropic_Variogram_Models.html Gill, Andrew. 1996. “Kriging Example.” http://www.maths.adelaide.edu.au/Applied/UA_DAM_FLUIDS/GROUNDWATER /GEOSTATS/KrigEx/krigex.html Goovaerts, Pierre. 1997. Geostatistics for Natural Resources Evaluation. New York: Oxford Press. Pp. 125-158, 203. Goovaerts, Pierre. 2001. “Study of scale-dependent correlation structures using factorial kriging analysis.” http://www-personal.engin.umich.edu/~goovaert/pg-suj3.html Guyaguler, Baris. “Ordinary Kriging.” http://pangea.stanford.edu/~baris/professional/theoryok.html Lang, Chao-Yi. “Kriging Interpolation.” http://www.tc.cornell.edu/Visualization/contrib/cs490-94to95/clang/kriging.html Rao, J.N.K. 1999. “Current Trends in Sample Survey Theory and Methods.” Pp. 16-22 in Indian Journal of Statistics, vol. 61. Rao, Mahesh. 1999. “Interpolation.” http://reserves.library.okstate.edu/geog5343/lec12 18 SAS Institute Inc. 1999. Online documentation: the KRIGE2D and VARIOGRAM Procedures. Schabenberger, Oliver. 1997. “Developing Variograms in SAS + Ordinary Kriging.” http://www.cas.vt.edu/schabenb/Spatial.htm 19 APPENDIX A Census Block Group Population Model: Fitting Details and Associated R Code 20 Washtenaw Census Block Group (1990) Population Modeling All of the following code was written for the R Software Package for interactive statistical analysis. > wash <- read.table("c:\\wash1.txt", sep=",", h=T) #variable selection > > > > > > wash2 wash3 wash4 wash5 wash6 wash7 <<<<<<- wash[,1:181] wash2[,-c(8:58)] wash3[,-c(13:26)] wash4[,-c(14:19,25:70)] wash5[,-c(19,23:44)] wash6[,-c(23:30,34:37)] Variables (Continuous): State County Tract Blockgroup Land.km Water.km Persons Housing Occupied Vacant Ownr.occ Rent.occ Val.medi Rnt.medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc.medn Incprcap Childpov Inpovrty Bltbfr70 Blt.7079 Blt.8084 Bltaft84 #Convert specific variables to proportions: > > > > > > > > > > > > > > > > > wash7$Vacant <- wash7$Vacant / wash7$Housing wash7$Ownr.occ <- wash7$Ownr.occ / wash7$Occupied wash7$Rent.occ <- wash7$Rent.occ / wash7$Occupied wash7$Occupied <- wash7$Occupied / wash7$Housing wash7$Detached <- wash7$Detached / wash7$Housing wash7$Attached <- wash7$Attached / wash7$Housing wash7$Duplex <- wash7$Duplex / wash7$Housing wash7$Apartmnt <- wash7$Apartmnt / wash7$Housing wash7$Employed <- wash7$Employed / wash7$Persons wash7$Unemploy <- wash7$Unemploy / wash7$Persons wash7$Notinwrk <- wash7$Notinwrk / wash7$Persons wash7$Bltbfr70 <- wash7$Bltbfr70 / wash7$Housing wash7$Blt.7079 <- wash7$Blt.7079 / wash7$Housing wash7$Blt.8084 <- wash7$Blt.8084 / wash7$Housing wash7$Bltaft84 <- wash7$Bltaft84 / wash7$Housing wash7$Childpov <- wash7$Childpov / wash7$Persons wash7$Inpovrty <- wash7$Inpovrty / wash7$Persons > washmod <- wash7[,-c(1:4)] #for fitting purposes #remove bad cases > washmod <- washmod[-c(66,254),] > wash7 <- wash7[-c(66,254),] > washmod1 <- washmod[,-c(4,6,8)] #remove housing, %vacant, %rent.occ (%occupied + %vacant = 1) 21 #5-number numerical summary > summary(washmod1) Land.km Min. : 0.0120 1st Qu.: 0.3680 Median : 0.6685 Mean : 6.8540 3rd Qu.: 2.1520 Max. :96.8800 Ownr.occ Min. :0.0000 1st Qu.:0.2126 Median :0.6467 Mean :0.5623 3rd Qu.:0.8838 Max. :1.0000 Attached Min. :0.000000 1st Qu.:0.003135 Median :0.010070 Mean :0.056940 3rd Qu.:0.033450 Max. :0.679200 Unemploy Min. :0.00000 1st Qu.:0.01022 Median :0.02026 Mean :0.02687 3rd Qu.:0.03485 Max. :0.16410 Childpov Min. :0.000000 1st Qu.:0.000000 Median :0.006738 Mean :0.025130 3rd Qu.:0.026300 Max. :0.421100 Blt.8084 Min. :0.00000 1st Qu.:0.00000 Median :0.02214 Mean :0.04213 3rd Qu.:0.06329 Max. :0.33700 Water.km Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.121 3rd Qu.:0.000 Max. :3.272 Persons Min. : 28.0 1st Qu.: 720.8 Median : 999.0 Mean :1054.0 3rd Qu.:1220.0 Max. :5853.0 Occupied Min. :0.6648 1st Qu.:0.9304 Median :0.9601 Mean :0.9438 3rd Qu.:0.9775 Max. :1.0000 Val.medi Min. : 0 1st Qu.: 68330 Median : 90200 Mean :100200 3rd Qu.:118000 Max. :500000 Rnt.medi Min. : 0.0 1st Qu.: 391.3 Median : 476.5 Mean : 487.2 3rd Qu.: 571.3 Max. :1001.0 Detached Min. :0.0000 1st Qu.:0.1695 Median :0.6017 Mean :0.5467 3rd Qu.:0.9170 Max. :1.0000 Duplex Min. :0.00000 1st Qu.:0.00318 Median :0.01210 Mean :0.03666 3rd Qu.:0.04498 Max. :0.31360 Notinwrk Min. :0.0000 1st Qu.:0.1599 Median :0.2099 Mean :0.2359 3rd Qu.:0.2607 Max. :0.9911 Apartmnt Min. :0.00000 1st Qu.:0.01137 Median :0.17220 Mean :0.32210 3rd Qu.:0.62250 Max. :0.99080 Inc.medn Min. : 0 1st Qu.: 24910 Median : 37510 Mean : 38860 3rd Qu.: 50130 Max. :111200 Inpovrty Min. :0.00000 1st Qu.:0.02024 Median :0.05479 Mean :0.12040 3rd Qu.:0.14900 Max. :0.74070 Bltbfr70 Min. :0.0000 1st Qu.:0.2634 Median :0.3877 Mean :0.4195 3rd Qu.:0.5733 Max. :1.3570 Bltaft84 Min. :0.00000 1st Qu.:0.00000 Median :0.02094 Mean :0.08072 3rd Qu.:0.10920 Max. :1.06400 22 Employed Min. :0.0000 1st Qu.:0.4919 Median :0.5380 Mean :0.5416 3rd Qu.:0.6101 Max. :0.8768 Incprcap Min. : 2415 1st Qu.:12870 Median :16800 Mean :17660 3rd Qu.:20250 Max. :66660 Blt.7079 Min. :0.00000 1st Qu.:0.04358 Median :0.18410 Mean :0.22330 3rd Qu.:0.36030 Max. :0.84270 Variables with data indicating that particular cases might be outliers: Persons, Val.medi, Bltbfr70, Bltaft84, Incprcap, Notinwrk, Land.km. According to ArcView, land cases are not errors in data entry. > wrk <- (washmod1$Bltbfr70 > 1) error in data entry. > washmod1 <- washmod1[-108,] > wash7 <- wash7[-108,] Notinwrk values do not appear to be errors in data entry. Persons values might be errors in data entry. More than most Census Tracts! Val.medi may not be an error in data entry. > (washmod1$Bltaft84 > 1) Bltaft84 value is an error in data entry. > wash7 <- wash7[-181,] > washmod1 <- washmod1[-181,] Incprcap value may not be an error in data entry. Fit full model: > mod <- lm(Persons ~ .,washmod1) Model Diagnostics ----------------> x <- model.matrix(mod) > lev <- hat(x) > plot(lev, ylab = "Leverages") > abline(h=2*22/266) > sum(lev) [1] 22 #see Plot 1, Appendix B. > sum <- summary(mod) > stud <- mod$res / (sum$sig*sqrt(1-lev)) > plot(stud,ylab="Studentized Residuals") #see Plot 2, App. B. > jack <- stud*sqrt((266-22-1)/(266-22-stud^2)) > plot(jack,ylab="Jackknife Residuals") #see Plot 3, App. B. > jack[jack > 3] 71 85 151 8.757113 3.755840 7.153207 23 Check to see if these cases are outliers using a Bonferonni-based test: > qt(0.05/(266*2),266-22-1) [1] -3.792889 Two of the cases appear to be outliers! Further investigation yields that these are the two cases with Persons counts above 5,000. Check for influential observations: > cook <- stud^2*lev/(22*(1-lev)) > plot(cook,ylab = "Cook Statistics") #see Plot 4, Appendix B. > cook[cook > 0.15] 12 71 86 151 0.1690986 0.3085052 0.2805020 0.1715737 There appear to be four influential cases with relatively large Cook statistics, two of which were identified as earlier as outliers. Temporarily exclude them from analysis: > washmod1 <- washmod1[-c(12,70,85,149),] > wash7 <- wash7[-c(12,70,85,149),] > dim(washmod1) [1] 262 22 > dim(wash7) [1] 262 29 Variable Transformation ----------------------> plot(mod$fit, mod$res, xlab = "Fitted Values", ylab = "Residuals") #see Plot 5, Appendix B. #test for non-constant variance > summary(lm(abs(mod$res) ~ mod$fit)) Call: lm(formula = abs(mod$res) ~ mod$fit) Residuals: Min 1Q -635.57 -219.80 Median -58.22 3Q Max 80.81 3381.34 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -90.06772 88.47109 -1.018 0.310 mod$fit 0.40512 0.08035 5.042 8.58e-07 *** --- 24 Signif. codes: ` ' 1 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 Residual standard error: 394.2 on 264 degrees of freedom Multiple R-Squared: 0.08783, Adjusted R-squared: 0.08438 F-statistic: 25.42 on 1 and 264 degrees of freedom, p-value: 8.577e-007 VERY strong evidence of non-constant variance in the residuals. Plot 5 suggests a square-root transformation of the count response (Persons). > mod2 <- lm(sqrt(Persons) ~ ., washmod1) > plot(mod2$fit, mod2$res, xlab = "Fitted Values", ylab = "Residuals", main = "Square Root Response") #see Plot 6, App. B. > summary(lm(abs(mod2$res) ~ mod2$fit)) Call: lm(formula = abs(mod2$res) ~ mod2$fit) Residuals: Min 1Q Median -4.594 -2.669 -0.755 3Q Max 1.349 17.844 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.38401 1.80611 3.535 0.000483 *** mod2$fit -0.06380 0.05719 -1.116 0.265606 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 ` ' 1 `.' 0.1 Residual standard error: 3.838 on 260 degrees of freedom Multiple R-Squared: 0.004764, Adjusted R-squared: 0.0009365 F-statistic: 1.245 on 1 and 260 degrees of freedom, p-value: 0.2656 There is no longer any evidence of non-constant variance in the residuals after the square-root transformation of the response (Persons). Variable Selection -----------------Stepwise backward elimination with a critical alpha value of 0.15 (for prediction performance) results in the following model: > summary(mod2) Call: 25 lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1) Residuals: Min 1Q -19.3348 -3.5081 Median -0.2181 3Q 3.6621 Max 24.2039 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.300e+01 7.446e+00 1.746 0.081991 . Land.km 7.794e-02 2.464e-02 3.163 0.001752 ** Occupied 1.959e+01 7.748e+00 2.529 0.012056 * Val.medi 3.255e-05 9.736e-06 3.343 0.000954 *** Inc.medn 8.833e-05 3.816e-05 2.315 0.021436 * Incprcap -3.926e-04 8.437e-05 -4.653 5.27e-06 *** Bltbfr70 -3.385e+00 2.095e+00 -1.615 0.107473 Blt.8084 -1.320e+01 7.098e+00 -1.860 0.064105 . Bltaft84 2.032e+01 3.510e+00 5.790 2.07e-08 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 ` ' 1 `.' 0.1 Residual standard error: 6.029 on 253 degrees of freedom Multiple R-Squared: 0.3133, Adjusted R-squared: 0.2916 F-statistic: 14.43 on 8 and 253 degrees of freedom, p-value: 0 Check for influential cases: > sum <- summary(mod2) > x <- model.matrix(mod2) > lev <- hat(x) > sum(lev) [1] 9 > stud <- mod2$res / (sum$sig*sqrt(1-lev)) > cook <- stud^2*lev/(9*(1-lev)) > plot(cook,main="Cook Statistics after Variable Selection") #see Plot 7, Appendix B. > cook[cook > 0.15] 120 0.1631331 Is there a significant change in the model when this case is excluded? > summary(mod2) Call: lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1, 26 subset = (cook < max(cook))) Residuals: Min 1Q -19.4777 -3.5513 Median -0.2323 3Q 3.5122 Max 24.7954 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.024e+01 7.545e+00 1.358 0.17580 Land.km 8.157e-02 2.458e-02 3.318 0.00104 ** Occupied 2.204e+01 7.812e+00 2.822 0.00516 ** Val.medi 2.792e-05 9.980e-06 2.798 0.00554 ** Inc.medn 6.137e-05 4.047e-05 1.517 0.13064 Incprcap -2.961e-04 9.781e-05 -3.027 0.00272 ** Bltbfr70 -2.778e+00 2.108e+00 -1.318 0.18873 Blt.8084 -1.361e+01 7.064e+00 -1.927 0.05515 . Bltaft84 2.055e+01 3.494e+00 5.883 1.28e-08 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 ` ' 1 `.' 0.1 Residual standard error: 5.997 on 252 degrees of freedom Multiple R-Squared: 0.2871, Adjusted R-squared: 0.2644 F-statistic: 12.68 on 8 and 252 degrees of freedom, p-value: 2.665e-015 Two predictors are no longer significant! This case should also be excluded. > wash7 <- wash7[-115,] > washmod1 <- washmod1[-115,] > dim(wash7) [1] 261 29 > dim(washmod1) [1] 261 22 Further stepwise backward elimination results in: > summary(mod2) Call: lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ - Bltbfr70 - 1 Inc.medn, data = washmod1) Residuals: Min 1Q -19.71474 -3.78245 Median -0.03491 3Q 3.49497 Max 25.02833 Coefficients: Estimate Std. Error t value Pr(>|t|) Land.km 9.120e-02 2.354e-02 3.874 0.000136 *** Occupied 3.167e+01 9.984e-01 31.718 < 2e-16 *** 27 Val.medi 2.363e-05 9.639e-06 2.451 0.014913 * Incprcap -1.586e-04 6.117e-05 -2.592 0.010084 * Blt.8084 -1.105e+01 6.720e+00 -1.645 0.101265 Bltaft84 2.292e+01 3.092e+00 7.412 1.83e-12 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 ` ' 1 `.' 0.1 Residual standard error: 6.02 on 255 degrees of freedom Multiple R-Squared: 0.9658, Adjusted R-squared: 0.965 F-statistic: 1200 on 6 and 255 degrees of freedom, p-value: 0 We have a very good fit. Plot 8 in Appendix B indicates that the residuals in this model appear to follow a normal distribution. > dim(washmod1) [1] 261 22 > dim(wash7) [1] 261 29 > fit <- mod2$fit > fit <- as.matrix(fit) > dim(fit) [1] 260 1 > wash8$fitted <- fit > wash8$fitted2 <- (wash8$fitted)^2 #actual fitted values > dim(wash8) [1] 260 28 #write out data set with fitted values, included cases > write.table(wash8,"c:\\washdata.txt",row.names=F,sep=",") #compute mean, standard deviation of desired errors > data <- read.table("c:\\washdata.txt",h=T,sep=",") > data$error <- data$Persons - data$fitted2 > var(data$error) [1] 155983.7 > mean(data$error) [1] 35.40443 > sqrt(var(data$error)) [1] 394.9477 28 APPENDIX B Census Block Group Population Model: Diagnostic Plots 29 Plot 1: Leverages of all cases after fitting the full model According to the rather arbitrary (2p / n) rule of thumb, several cases appear to have high leverage. 30 Plot 2: Studentized Residuals (Full Model) Three cases appear to have unusually large studentized residuals. 31 Plot 3: Jackknife Residuals (Full Model) Three cases again appear to have large jackknife residuals. 32 Plot 4: Cook Statistics (Full Model) There appear to be four influential cases, each having an unusually large Cook statistic. 33 Plot 5: Residuals vs. Fitted Values (Full Model) There is rather strong evidence of non-constant variance in the residuals, in a pattern that suggests a square-root transformation of the response variable Persons. 34 Plot 6: Residuals vs. Fitted Values AFTER square-root transformation of response The transformation of the response appears to have cleared up the non-constant variance in the residuals. 35 Plot 7: Cook Statistics AFTER Variable Selection There appears to be one rather influential case. 36 Plot 8: Normal Q-Q Plot, Indicating Normal Residuals in the Final Model 37 APPENDIX C Kriging Analysis: Technical Details and Associated SAS Code 38 Summary of Kriging Analysis 1. Reconstruct the complete 1990 Washtenaw Census Tract data set, in R: > wash <- read.table("c:\\washt.txt",h=T,sep=",") > dim(wash) [1] 81 193 > wash2 <- wash[,-c(173:193)] > wash3 <- wash2[,-c(7:57)] > wash4 <- wash3[,-c(12:25,27:32,38:76)] > wash5 <- wash4[,-c(21:42,44:51,55:58)] > dim(wash5) [1] 81 28 > > > > > > > > > > > > > > > > > wash5$Vacant <- wash5$Vacant / wash5$Housing wash5$Ownr.occ <- wash5$Ownr.occ / wash5$Occupied wash5$Rent.occ <- wash5$Rent.occ / wash5$Occupied wash5$Occupied <- wash5$Occupied / wash5$Housing wash5$Detached <- wash5$Detached / wash5$Housing wash5$Attached <- wash5$Attached / wash5$Housing wash5$Duplex <- wash5$Duplex / wash5$Housing wash5$Apartmnt <- wash5$Apartmnt / wash5$Housing wash5$Employed <- wash5$Employed / wash5$Persons wash5$Unemploy <- wash5$Unemploy / wash5$Persons wash5$Notinwrk <- wash5$Notinwrk / wash5$Persons wash5$Bltbfr70 <- wash5$Bltbfr70 / wash5$Housing wash5$Blt.7079 <- wash5$Blt.7079 / wash5$Housing wash5$Blt.8084 <- wash5$Blt.8084 / wash5$Housing wash5$Bltaft84 <- wash5$Bltaft84 / wash5$Housing wash5$Childpov <- wash5$Childpov / wash5$Persons wash5$Inpovrty <- wash5$Inpovrty / wash5$Persons > write.table(wash5,"c:\\prekrig.txt",row.names=F,sep=",") 2. Merge the Census Tract data set with the file containing the locations of all of the Census Tracts (SAS code): data wash; infile "c:\prekrig.txt" dlm = ","; input State County Tract Land_km Water_km Persons Housing Occupied Vacant Ownr_occ Rent_occ Val_medi Rnt_medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc_medn Incprcap Childpov Inpovrty Bltbfr70 Blt_7079 Blt_8084 Bltaft84; run; data locs; infile "c:\tractloc2.txt" dlm = ","; 39 input obs stcty tract long lat; run; proc sort data = locs; by tract; run; proc sort data = wash; by tract; run; data last; merge wash (in=ina) locs (in=inb); by tract; if ina and inb; run; libname brady "h:\thesis"; data brady.washkrig; set last; run; 3. Use the given Census Tract data to krig for the significant predictors of population in the small area model across all of Washtenaw County, using proc variogram and proc krige2d in SAS: Variable: % Housing Units Occupied ---------------------------------(see all plots for % Housing Units Occupied in Appendix D) *create individual variable subset: data washocc; set brady.washkrig (keep = Occupied long lat); run; *plot measurement locations (see Plot 1 in Appendix D): proc gplot data = washocc; title 'Scatter Plot of Measurement Locations'; plot lat*long / frame cframe = ligr haxis = axis1 vaxis = axis2; symbol1 v = dot color = blue; axis1 minor = none; axis2 minor = none label = (angle = 90 rotate = 0); label lat = 'Latitude' long = 'Longitude'; run; quit; 40 *look at 3d surface plot of variable: proc g3d data = washocc; title 'Surface Plot of Variable Measurements'; scatter lat*long=Occupied / xticknum = 5 yticknum = 5 grid zmin = 0.8 zmax = 1.0; label long = 'Longitude' lat = 'Latitude' Occupied = '% Housing Units Occupied'; run; quit; *look at the distribution of the pairwise distances between the tracts: (see Plot 2 in Appendix D) proc variogram data = washocc outdistance = outd; compute nhc = 20 novariogram; coordinates xc = long yc = lat; var Occupied; run; title 'Distance Intervals'; proc print data = outd; run; data outd; set outd; mdpt = round((lb+ub) / 2,0.001); label mdpt = 'Midpoint of Interval'; run; axis1 minor = none; axis2 minor = none label = (angle = 90 rotate = 0); title 'Distribution of Pairwise Distances'; proc gchart data = outd; vbar mdpt / type = sum sumvar = count discrete frame cframe = ligr gaxis = axis1 raxis = axis2 nolegend; run; quit; *look at the lower bound of distance interval where there are still sufficient (more than 30) pairs of points, and divide this number by the apparent lag distance (LAGD) to yield MAXLAGS. Plot 2 in Appendix D suggests setting MAXLAGS = 15, and lagd = 0.03. Compute and plot standard and robust semivariograms. proc variogram data = washocc outv = outv; compute lagd = 0.03 maxlag = 15 robust; coordinates xc = long yc = lat; 41 var Occupied; run; title 'Variogram Results'; proc print data = outv label; var lag count distance variog rvario; run; data outv2; set outv; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Standard and Robust Semivariogram for %Occupied Data'; proc gplot data = outv2; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; *fit an appropriate theoretical variogram to the data, and plot it against the standard and robust semivariograms: data outv3; set outv; c0 = 0.0012; c = 0.0022; a0 = 0.50; vari = c0 + c * (1 - exp(-distance * distance / (a0 * a0))); type = 'Gaussian'; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Theoretical and Sample Semivariogram for %Occupied Data'; proc gplot data = outv3; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; symbol3 i = join l = 1 c = red v = diamond; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); 42 run; quit; *krig using the theoretical variogram, and plot the resulting estimates: proc sort data = washocc; by long; run; proc sort data = washocc; by lat; run; proc krige2d data = washocc outest = est; pred var = Occupied; model nugget = 0.0012 scale = 0.0022 range = 0.50 form = gauss; coord xc = long yc = lat; grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01; run; proc g3d data = est; title 'Surface Plot of Kriged Estimates for %Occupied'; scatter gyc*gxc = estimate / grid; label gyc = 'Latitude' gxc = 'Longitude' estimate = '%Occupied'; run; 4. Merge the kriging estimates with the final Census Block Group data set: data est2 (keep = long lat koccu); set est; long = gxc; lat = gyc; long = round(long, 0.01); lat = round(lat, 0.01); koccu = estimate; run; data block; set brady.krigdata; *copy of brady.washdata long = round(long, 0.01); lat = round(lat, 0.01); run; proc sort data = est2; by long lat; run; 43 proc sort data = block; by long lat; run; data final; merge block (in = ina) est2 (in = inb); by long lat; if ina and inb; run; *continue to append brady.krigdata, so that the final data set will contain both actual data and kriging estimates for all of the significant predictors of population for each of the Census Block Groups: data brady.krigdata; set final; run; 5. Continue to krig for all other significant predictors of population in the small area model, following the same methodology outlined above: Variable: Median Housing Unit Value ----------------------------------(see all plots for Median Housing Unit Value in Appendix D) *fit theoretical variogram: data outv3; set outv; c = 3500000000; a0 = 0.06; if (distance <= 0.06) then do; vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3); type = 'Spherical'; end; else if (distance > 0.06) then do; vari = c; type = 'Spherical'; end; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; *krig using the theoretical variogram, and plot the resulting estimates: proc krige2d data = wash_med outest = est; 44 pred var = Val_medi; model scale = 3500000000 range = 0.06 form = spherical; coord xc = long yc = lat; grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01; run; proc g3d data = est; title 'Surface Plot of Kriged Estimates for Med.Value'; scatter gyc*gxc = estimate / grid; label gyc = 'Latitude' gxc = 'Longitude' estimate = 'Median Unit Value'; run; Variable: % Units Built After 1984 ---------------------------------(see all plots for % Units Built After 1984 in Appendix D) *fit theoretical variogram: data outv3; set outv; c = 0.015; a0 = 0.10; if (distance <= 0.10) then do; vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3); type = 'Spherical'; end; else if (distance > 0.10) then do; vari = c; type = 'Spherical'; end; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Theoretical and Sample Semivariogram for %BltAft84 Data'; proc gplot data = outv3; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; symbol3 i = join l = 1 c = red v = diamond; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; 45 *krig using the theoretical variogram, and plot the resulting estimates: proc krige2d data = wash84 outest = est; pred var = Bltaft84; model scale = 0.015 range = 0.10 form = sph; coord xc = long yc = lat; grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01; run; proc g3d data = est; title 'Surface Plot of Kriged Estimates for %BltAft84'; scatter gyc*gxc = estimate / grid; label gyc = 'Latitude' gxc = 'Longitude' estimate = '% Units Built After 1984'; run; *zero out any negative kriging estimates: data final; set final; if (kaft84 < 0) then kaft84 = 0; run; Variable: Income Per Capita --------------------------(see all plots for Income Per Capita in Appendix D) data washinc; set brady.washkrig (keep = Incprcap long lat); run; proc g3d data = washinc; title 'Surface Plot of Variable Measurements'; scatter lat*long=Incprcap / xticknum = 5 yticknum = 5 grid zmin = 4120 zmax = 59400; label long = 'Longitude' lat = 'Latitude' Incprcap = 'Income Per Capita'; run; quit; proc variogram data = washinc outv = outv; compute lagd = 0.03 maxlag = 15 robust; coordinates xc = long yc = lat; var Incprcap; run; 46 title 'Variogram Results'; proc print data = outv label; var lag count distance variog rvario; run; data outv2; set outv; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Standard and Robust Semivariogram for IncPrCap Data'; proc gplot data = outv2; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; *fit theoretical variogram: data outv3; set outv; c = 75000000; a0 = 0.03; if (distance <= 0.03) then do; vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3); type = 'Spherical'; end; else if (distance > 0.03) then do; vari = c; type = 'Spherical'; end; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Theoretical and Sample Semivariogram for IncPrCap Data'; proc gplot data = outv3; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; 47 symbol3 i = join l = 1 c = red v = diamond; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; *krig using the theoretical variogram, and plot the resulting estimates: proc krige2d data = washinc outest = est; pred var = Incprcap; model scale = 75000000 range = 0.03 form = sph; coord xc = long yc = lat; grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01; run; proc g3d data = est; title 'Surface Plot of Kriged Estimates for IncPrCap'; scatter gyc*gxc = estimate / grid; label gyc = 'Latitude' gxc = 'Longitude' estimate = 'Income Per Capita'; run; Variable: % Units Built Between 1980-1984 ----------------------------------------(see all plots for % Units Built Between 1980-1984 in Appendix D) data wash8084; set brady.washkrig (keep = Blt_8084 long lat); run; proc g3d data = wash8084; title 'Surface Plot of Variable Measurements'; scatter lat*long=Blt_8084 / xticknum = 5 yticknum = 5 grid zmin = 0 zmax = 0.1375; label long = 'Longitude' lat = 'Latitude' Blt_8084 = '% Units Built 1980-1984'; run; quit; proc variogram data = wash8084 outv = outv; compute lagd = 0.03 maxlag = 15 robust; coordinates xc = long yc = lat; var Blt_8084; run; 48 title 'Variogram Results'; proc print data = outv label; var lag count distance variog rvario; run; data outv2; set outv; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Standard and Robust Semivariogram for %Blt8084 Data'; proc gplot data = outv2; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; *fit theoretical variogram: data outv3; set outv; c = 0.00155; a0 = 0.05; if (distance <= 0.05) then do; vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3); type = 'Spherical'; end; else if (distance > 0.05) then do; vari = c; type = 'Spherical'; end; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Theoretical and Sample Semivariogram for %Blt8084 Data'; proc gplot data = outv3; plot vari*distance=type / frame cframe = ligr vaxis = axis2 haxis = axis1; symbol1 i = join l = 1 c = blue v = star; symbol2 i = join l = 1 c = black v = square; symbol3 i = join l = 1 c = red v = diamond; 49 axis1 minor = none label = (c = black 'Lag Distance'); axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram'); run; quit; *krig using the theoretical variogram, and plot the resulting estimates: proc krige2d data = wash8084 outest = est; pred var = Blt_8084; model scale = 0.00155 range = 0.05 form = sph; coord xc = long yc = lat; grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01; run; proc g3d data = est; title 'Surface Plot of Kriged Estimates for %Blt8084'; scatter gyc*gxc = estimate / grid; label gyc = 'Latitude' gxc = 'Longitude' estimate = '% Units Built 1980-1984'; run; 6. Using the final data set constructed after all kriging estimates have been derived, obtain population estimates at the locations representing the small areas by entering the kriging estimates of the significant predictors for each small area into the model relating the small areas: data test; set brady.krigdata; estimate = 0.0912 * land_km + 31.67 * koccu + 0.00002363 * kmedv + -0.0001586 * kinc + -11.05 * k8084 + 22.92 * kaft84; est2 = estimate * estimate; error = Persons - est2; run; *compute the mean and standard deviation of the desired errors: proc means data = test; var error; run; data brady.krigdata; set test; run; 50 APPENDIX D Kriging Plots 51 Kriging Plots 1990 Washtenaw Census Tract Data: Location Summary Plot 1: Scatter Plot of all Measurement Locations for Tracts in Washtenaw County Plot 2: Distribution of Pairwise Distances Between Tracts 52 Variable: % Housing Units Occupied 53 Variable: Median Housing Unit Value 54 55 Variable: Income Per Capita 56 Variable: % Units Built Between 1980-1984 57 58 Variable: % Units Built After 1984 59 60 APPENDIX E Additional SAS Code 61 SAS code used to create the data file that includes the locations of each Census Block Group in Washtenaw County: data bgid; infile "c:\bg26_d90_pa.dat"; input number : 4. county $char5. tract 4. block 3.; run; data bgid2; set bgid; if county = '26161'; run; data coords; infile "c:\bg26_d90_p.dat" lrecl = 70; input number 1-10 long 17-23 lat 46-51; run; data coords2; set coords; if number ne -0.8 and number ne -0.9; run; proc sort data = bgid2; by number; run; proc sort data = coords2; by number; run; data final; merge bgid2 (in=ina) coords2(in=inb); by number; if ina; run; data temp; set final; file "c:\bgloc2.txt"; put @1 number @5 "," @6 county @11 "," @12 tract @16 "," @17 block @18 "," @19 long @26 "," @27 lat; run; SAS code used to create the data file that includes the locations of each Census Tract in Washtenaw County: data tractid; infile "c:\tr26_d90_pa.dat"; input number : 4. county $char5. tract 4.; run; 62 data tractid2; set tractid; if county = '26125' or county = '26161' or county = '26163'; run; data coords; infile "c:\tr26_d90_p.dat" lrecl = 46; input number 1-10 long 11-28 lat 29-46; run; data coords2; set coords; if number ne -8 and number ne -9; run; proc sort data = tractid2; by number; run; proc sort data = coords2; by number; run; data final; merge tractid2 (in=ina) coords2(in=inb); by number; if ina; run; data temp; set final; file "c:\tractloc2.txt"; put @1 number @4 "," @5 county @10 "," @11 tract @15 "," @16 long @26 "," @27 lat; run; SAS code used to create the master data set, including all variables for each Census Block Group (demographics, population and housing characteristics, latitude and longitude coordinates, etc.): data wash; infile "c:\washdata.txt" dlm = ","; input state county tract block land_km water_km persons housing occupied vacant ownr_occ rent_occ val_medi rnt_medi detached attached duplex apartmnt employed unemploy notinwrk inc_medn incprcap childpov inpovrty bltbfr70 blt_7079 blt_8084 bltaft84 fitted fitted2; run; 63 data locs; infile "c:\bgloc2.txt" dlm = ","; input number county tract block long lat; run; proc sort data = locs; by tract block; run; proc sort data = wash; by tract block; run; data last; merge wash (in=ina) locs (in=inb); by tract block; if ina and inb; run; libname brady "h:\thesis"; data brady.washdata; set last; long = long * 100; lat = lat * 100; run; 64