Download Spatial Analysis of a Small Area Problem - Personal

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Census wikipedia , lookup

Time series wikipedia , lookup

Transcript
Hawaii International Conference on Statistics and Related Fields
Submission Title Page
Title: Spatial Analysis of a Small Area Problem
Topic Area(s): Applied Statistics, Population Statistics
Keywords: Kriging, Small Area Estimation
Author: Brady West (student, University of Michigan at Ann Arbor)
Mailing Address: 2222 Fuller Ct., Apartment 708A, Ann Arbor, MI, 48105
E-Mail Address: [email protected]
Phone Numbers: 734.998.0498 (home), 734.223.9793 (cell)
Fax Number: 734.647.2440 (work)
Advisor: Edward Rothman, University of Michigan Department of Statistics
Spatial Analysis of a Small Area Problem
Brady T. West
University of Michigan
Department of Statistics
April, 2001
1
“Spatial Analysis of a Small Area Problem”
Table of Contents
I. Introduction……………………………………………………………………. 3
II. Current Model-Based Approaches to Small Area Estimation…………….. 4
III. The Kriging Framework…………………………………………………….. 5

Calculation of the Sample and Theoretical Variograms……….. 7
IV. Analysis: An Application of Kriging to Small Area Estimation………….. 9

Overview……………………………………………………………9

Modeling the Census Block Group Population
in Washtenaw County……………..10

Using Kriging for Small Area Estimation………………………. 15
V. Discussion……………………………………………………………………… 17
VI. Works Cited………………………………………………………………….. 18
VII. Appendix A: Census Block Group Population Model Details…………... 20
VIII. Appendix B: Census Block Group Population Model Diagnostics…….. 28
IX. Appendix C: Kriging Details……………………………………………….. 37
X. Appendix D: Kriging Plots…………………………………………………... 49
XI. Appendix E: Additional SAS Code…………………………………………59
2
Introduction
The United States Census Bureau currently gathers population data for certain bounded
geographic regions in all U.S. counties known as census tracts, and several different
methods have been developed by statisticians and population analysts to estimate the
populations of smaller areas within these tracts. In recent years, the statistical technique
of small area estimation has been a very hot topic, and there is an ever-growing demand
for reliable estimates of small area populations of all types. Reliable estimates of the
populations of small areas (i.e. city blocks) within the currently established census tracts
are important for several reasons. These extremely important estimates are used for,
among other things, determination of state funding allocations, determination of exact
boundaries for school and voting districts, administrative planning, marketing guidance,
and as data for detailed descriptive and analytical studies of cities (Bryan: 1999).
Current small area estimation techniques attempt to derive an estimate of some attribute
for a small area (i.e. population) by using available auxiliary data for the area;
practitioners of these techniques now generally accept that indirect estimators should be
used, based on explicit models relating small areas through supplementary data such as
administrative records and recent Census data (Rao: 1999). Rao proposes that small area
models of this nature can be broadly classified into two types, relating available auxiliary
data for small areas to parameters of interest for the small areas (i.e. total population).
Such model-based approaches truly are beneficial, in that they increase the precision of
small area estimates. Surveys attempting to derive direct small area estimates are often
faced with the problem of extremely small sample sizes in particular regions, which can
produce estimates that are lacking in precision. However, these model-based methods
rely on the availability of both auxiliary data and data for the parameter of interest
(Census tract data, for instance), and they ignore the spatial relationships between the
small areas requiring estimates and the areas where data of interest are actually available.
A need exists for a new small area estimation technique that takes into account the spatial
relationships between small areas, where the data of interest are unavailable for the
construction of models, and those larger areas where the desired data are present.
The geo-statistical spatial analysis technique known as kriging is frequently used by
geologists for interpolation of attributes at points not physically sampled over a particular
region, based on knowledge about the spatial relationships of the sampled points.
Tobler’s Law of Geography1 provides the rationale behind such spatial interpolation:
points close together in space are more likely to have similar attributes than points far
apart. The small area analogy suggests that adjacent small areas will have characteristics
extremely similar to both each other and the larger regions that they are near or a part of.
Using this analogy, we investigate an application of kriging to the small area estimation
problem described above, and determine the precision of small area estimates that take
into account the spatial relationships between small areas and those larger areas where
data are available for those attributes desired in the small areas.
1
Taken from “http://reserves.library.okstate.edu/geog5343/lec12/sld001.htm”
3
Current Model-Based Approaches to Small Area Estimation2
The model-based approach to small area estimation permits validation of the models from
sample data. Rao (1999) classifies small area models into two types:
(1)
i = x t i  + vi
In this model, area-specific auxiliary data xi (administrative records, census data) are
available for the areas i = 1,…,m. The population small area total Yi, or some function
i = g(Yi), is assumed to be related to xi through the above linear model (1). The vi’s
are assumed to be normally distributed, random, uncorrelated small area effects, with
mean zero and variance 2v.  represents the vector of regression parameters. The
second type of model is as follows:
(2)
yij = xtij + vi + eij
This model is appropriate for continuous variables y. In this model, unit-specific
auxiliary data xij are again available for areas i = 1,…,m , where j = 1,…,Ni and Ni
represents the number of population units in the i-th area. The unit y-values, yij, are
assumed to be related to the auxiliary values xij through the nested error regression model
above (2). The vi’s are normal, independent, and identically distributed, with mean zero
and variance 2v. The eij’s are independent of the vi’s, normal, independent of each
other, and identically distributed, with mean zero and variance 2e.  again represents
the vector of regression parameters.
Rao further asserts that in the case of type (1) models, direct survey estimators Yi are
available whenever the sample sizes ni >= 1. Then, it can be assumed that
(3)
i = i + ei
where i = g(Yi) and the sampling errors ei are independent and normally distributed
with mean zero and known variance i. Then, when model (3) is combined with model
(1), we have
(4)
i = x t i  + vi + ei
Models of this form, with i = log(Yi), have recently been used to produce model-based
county estimates of poor school-age children in the United States (Rao: 1999).
According to Rao, “The success of small area estimation…largely depends on getting
good auxiliary information {xi} that leads to a small model variance 2v relative to i.”
2
From: Rao, J.N.K. 1999. Current Trends in Sample Survey Theory and Methods. Indian Journal of
Statistics 61: 16-22.
4
The Kriging Framework
The geo-statistical technique known as kriging was developed in the late 1950’s, building
on the pioneering work of a South African mining engineer named Danie Krige, who in
1951 proposed innovative new concepts for mining estimation (Guyaguler: 2000;
Goovaerts: 1997). Based on Krige’s work, Georges Matheron developed the Theory of
Regionalized Variables, combining Krige’s concepts into a single framework which he
coined “kriging.” Formally, kriging now refers to a family of least-squares linear
regression algorithms that attempt to predict values of a regionalized variable, or a
phenomenon that is spread out in space, at locations where data for the variable is not
available, based on the spatial pattern of the available data.
In general, kriging provides an optimal means of estimating the values of a certain
continuous attribute z at any points u not physically sampled over a specific area A,
using knowledge about the underlying spatial relationships in a data set. Statistical
models known as variograms (p. 7) provide knowledge about these underlying spatial
relationships, and express spatial variation in the available data. The estimated values are
weighted linear combinations of the available data, where data points that are closer to
the locations requiring estimation are given more weight in producing the estimates, and
the resulting estimators are unbiased (Lang: 1997). What differentiates kriging from
other linear estimation methods is its aim to minimize the variance of the errors of the
estimates, which because they are unbiased have mean zero. In most cases, kriging is
applied to produce a fine grid of accurate estimates of some variable (i.e. standing water
level, or richness of soil) over a physical site of interest.
Goovaerts (1997) outlines the basic methodology behind the kriging procedure, and
differentiates between the different types of kriging:
Given n data at sampled locations ui of some study area A for some attribute z(ui),
i = 1,…,n, we wish to produce values for the estimator z*(u). This estimator is defined
as follows:
(5)
z*(u) - m(u) = i i(u)[z(ui) - m(ui)]
Here, i(u) = the weight assigned to the datum z(ui) using a theoretical variogram [see
(8)], m(u) = the expected value of z(u), and m(ui) = the expected value of z(ui). The
vector of weights (u) is determined to minimize the variance of the estimation errors,
denoted as
(6)
2e = Var{z*(u) - z(u)}
5
The variance of the estimation errors 2e is minimized under the constraint that
E{z*(u) - z(u)} = 0, making the estimator unbiased. The estimation errors can be
computed using cross-validation, where estimates are derived for actual data points by
removing sample values from the data set and then using the kriging procedure to reestimate them.
There are three main kriging variants, distinguished according to the model considered
for m(u), which recall is the expected value of z(u). Ordinary kriging is the most widely
used kriging method (Guyaguler: 2000), which accounts for local fluctuations of the
mean for the attribute z by limiting the area where the mean for z is fixed to some local
neighborhood W(u) (Goovaerts: 1997). This variant of kriging considers the mean m(u)
for z to be constant but unknown for all locations in W(u) . Ordinary kriging estimates
are derived in the following manner:
(7)
z*(u) = i i(u)z(ui)

The weights i(u) assigned to each datum z(ui) are constrained such that i i(u) = 1,
to ensure the unbiasedness of the estimates (Guyaguler: 2000). The weights are again
determined using the experimental variogram so as to minimize the error variance
described in (6). The vector of weights is derived by solving the following system of
equations (Guyaguler: 2000):
(8)
i i(u)(uj - ui) +  = (uj - uo)
j = 1,…,n
Here, (uj - ui) represents the value of the theoretical variogram at distance (uj - ui),
where uj and ui are the locations of actual data points, and uo is the location where an
estimate is desired.  represents the “Lagrange parameter.” (Lang: 1997)
Ordinary kriging is an exact interpolator, since the estimated value of the desired attribute
at some point is equal to the exact data value. This variant of kriging only uses a local
neighborhood of data in deriving the estimates, where only the closest data are assigned
weights and the expected value m(u) of the desired attribute is considered to be constant,
and this fact can be used to show that the estimates are unbiased:
E{z*(u) - z(u)} = E{i i(u)z(ui)} - E{z(u)} = m(u)
i i(u) - m(u) = 0.
Ordinary kriging is usually preferred to other types of kriging, because it requires neither
knowledge nor stationarity of the mean m(u) over the entire area of interest A
(Goovaerts: 1997).
The other two chief variants of kriging include simple kriging, which considers the mean
m(u) of the attribute z to be known and constant throughout the area of interest A, and
6
kriging with a trend, which assumes that the unknown local mean m(u) varies within
each local neighborhood W(u), throughout A. Some other “members” of the kriging
family include block kriging, which involves the estimation of average z-values over
some segment, surface, or volume of any size or shape, and cokriging, which allows one
to better estimate values of z if the spatial distribution of some secondary variate sampled
more intensely than z is known. Cokriging basically incorporates the additional
information provided by other covariates into the above kriging methods, and can greatly
improve interpolation estimates if it is difficult or expensive to sample the primary
variate z. The values of the additional covariates are measured at locations where z is
measured, and also at several other locations, and this additional information greatly
improves the accuracy of the estimates of z.
Calculation of the Sample and Theoretical Variograms
As mentioned earlier, the sample variogram (or semivariogram, representing spatial
covariance), a statistical model computed by using the available sample data, is the key
part of the kriging procedure, in that it expresses how the data vary spatially across the
area of interest. A sample variogram takes as input a distance h between two sampled
points, and outputs a variogram estimate (h) that explains the variance in the attribute z
over the distance between the two points. The sample variogram is derived in the
following manner:
1. The range of distances between sampled points (from zero to the maximum
distance) is divided into a set of discrete intervals. The width of each lag, or
discrete interval h, must be large enough such that there are enough point pairs for
estimation in all of the intervals. A rule of thumb in computing sample
variograms is to have at least 30 point pairs in each discrete interval. The
reliability of a variogram hinges on having sufficient pairs of observations in each
lag (Schabenberger: 1997). However, for plotting and estimation purposes, it is
desirable to have as many points as possible for a plot of (h) against h, so there
is an important tradeoff between the number of lags and the number of point pairs
within each lag.
2. For every pair of sampled points, the distance between the points is computed,
along with the squared difference in the z-values.
3. Each pair of sampled points is assigned to one of the distance intervals, and the
total variance in each interval is accumulated.
4. After every pair of points has been used, the average variance in each distance
interval is computed. This value (h) is then plotted at the midpoint distance of
each distance interval h.
This results in a plot that only has as many points as there are distance intervals, with a
variogram estimate for each distance interval. The next important step in computing the
7
variogram is to fit a model to these data (a theoretical variogram), which will allow for
estimation of the variogram at all possible distances, rather than just the midpoint of each
defined distance interval. Geo-statistics literature generally suggests the choice of one of
five theoretical models to fit to sample variogram data, each of which has the parameters
defined in the plot3 below:
Plot 1: Example of Theoretical Variogram
In a theoretical variogram, h (above, separation distance) represents the distance between
two points; ao represents the range parameter, or the distance at which the upper limit of
the variogram (above, semivariance) is reached; co represents the nugget effect
parameter, or the apparent non-zero intersection of the sample variogram with the y-axis
(several factors, like sampling error and short-scale variability, make it is possible for two
points very close to each other to have unusual variance in the attribute z, and this is the
nugget effect); and c represents the sill parameter, or the upper limit of the variogram.
One of five possibilities for the theoretical variogram is chosen, generally based on visual
rules of thumb. If there is no nugget effect, the parameter co = 0.
1. The linear model describes a straight-line variogram:
(h) = co + [h(c/ao)]
2. The linear-to-sill model describes a variogram that appears to be linear and then
reaches an abrupt asymptote:
(h) = co + [h(c/ao)]
(h) = co + c
3
for ao  h
for h  ao
Taken from “http://www.geostatistics.com/GSWin/GSWINGaussian_Isotropic_Model.html”
8
3. The exponential model describes a variogram that approaches the sill gradually
but never converges with the sill (the ao parameter is used to provide range):
(h) = co + c[1 - exp (-h/ao)]
4. The spherical model is a modified quadratic function that describes a variogram
similar to that described by the exponential model which actually reaches an
asymptote at the sill:
(h) = co + c[1.5(h/ao) - 0.5(h/ao)3]
(h) = co + c
for ao  h
for h  ao
5. The Gaussian model is similar to the exponential model, and describes a
variogram that begins rising slowly from the y-intercept, and then rises more
rapidly toward the sill (again, the ao parameter is used to provide range):
(h) = co + c[1 - exp (-h2/ao2)]
Once a theoretical variogram has been fit to the sampled data, the system of equations
described in (8) can be solved for the vector of weights (u), and kriging estimates
for the attribute z at specific locations may be obtained.
Analysis: An Application of Kriging to Small Area Estimation
Overview
We consider the following small area estimation problem: population counts, as well as
various demographic, housing, and geographic data, are available at the census tract level
for a given county in the United States, and population estimates are desired for smaller
subdivisions of the census tracts known as census block groups. Only geographic data
(land area, etc.) are available for the census block groups.
Definitions are of course in order. According to the United States Bureau of the Census4,
census tracts are defined as “small, relatively permanent statistical subdivisions of a
county…Census tracts usually have between 2,500 and 8,000 persons and, when first
delineated, are designed to be homogeneous with respect to population characteristics,
economic status, and living conditions. Census tracts do not cross county boundaries.” In
addition, a census block group (BG) is defined as “a cluster of blocks having the same
first digit of their three-digit identifying numbers within a census tract…Geographic BG's
never cross census tract boundaries…BG's generally contain between 250 and 550
housing units, with the ideal size being 400 housing units.”
The U.S. Census Bureau currently produces all of the data proposed to be available at the
census tract level in the above setting at the census block group level as well, but for the
purposes of investigating the effectiveness of kriging in small area estimation, we
consider these data to be unavailable for the small areas of interest (the census block
4
United States Bureau of the Census - Geographic Area Descriptions:
http://www.census.gov/geo/www/cob/tr_info.html; http://www.census.gov/geo/www/cob/bg_info.html
9
groups). The following methodology can then be considered analogous to a variety of
situations where data are available for some larger area, but not for smaller subdivisions
of the larger area, where estimates for some parameter are desired.
Modeling the Census Block Group Population in Washtenaw County
We begin our investigation of this problem by considering 1990 census data for the
county of Washtenaw in the state of Michigan. The 1990 TIGER data sets produced by
the U.S. Census Bureau after the completion of the 1990 U.S. census are available
publicly over the web, at http://www.esri.com/data/online/tiger/data.html. These TIGER
data sets include geographic features such as roads, railroads, rivers, and lakes, political
boundaries, statistical boundaries, and demographic attributes for the entire United States,
making them very useful for this setting. However, they do not contain the geographical
coordinates of the various geographic areas, which are essential for the application of
kriging. The SAS code used to merge the TIGER data sets with the coordinate data for
the geographic areas of interest5 in this study is available in Appendix E.
The images on the following pages, produced using the spatial analysis software
ArcView GIS Version 3.1, provide an idea of what the 1990 census tract and census
block group divisions of the county of Washtenaw looked like.
5
Latitude and longitude coordinate data for the census tracts and block groups in the county of Washtenaw
were obtained from http://www.census.gov/geo/www/cob/tr.html and
http://www.census.gov/geo/www/cob/bg.html.
10
N
10
0
10 Miles
W
E
S
Figure 1: 1990 Census Tracts in Washtenaw County
11
N
10
0
10 Miles
W
E
S
Figure 2: 1990 Census Block Groups in Washtenaw County
12
The continuous variables below, all considered relevant to the population of a particular
area, are available in the aforementioned TIGER data sets for both the Washtenaw census
tracts and the Washtenaw census block groups, in addition to population counts6:
Land Area (km2)
Water Area (km2)
%Units Occupied
%Units Vacant
%Units Owned
%Units Rented
Median Unit Value
Median Monthly Rent
%Single Detached Units %Single Attached Units %Duplex Units
%Apartment Units
%Persons Employed
Med. Household Income
Income Per Capita
%Persons Unemployed %Persons Not In
Work Force
%Children in Poverty
%Units Built < 1970
%Units Built 70-79
%Units Built > 1984
%In Poverty
%Units Built 80-84
Here is the 1990 spatial distribution of the variable %Apartment Units across the
county of Washtenaw, for both the census tracts and the block groups:
10
0
10
20 Miles
N
% Apartment Units
0 - 0.092
0.092 - 0.298
0.298 - 0.501
0.501 - 0.763
0.763 - 0.954
W
E
S
Figure 3: Spatial Distribution of % Apartment Units
Across the 1990 Census Tracts in Washtenaw County
Note that “Units” refers to the total number of housing units in the given area. Hence “% Apartment
Units” refers to the proportion of housing units that are apartments.
6
13
One can see that the majority of the apartments in Washtenaw County are concentrated
around the areas of Ann Arbor and Ypsilanti, which are college towns.
10
0
20 Miles
10
% Apartment Units
0 - 0.08
0.08 - 0.249
0.249 - 0.484
0.484 - 0.753
0.753 - 0.991
N
E
W
S
Figure 4: Spatial Distribution of % Apartment Units
Across the 1990 Census Block Groups in Washtenaw County
Because these data are all available at the census block group level, it is possible to fit a
linear model to the block group data using ordinary regression analysis, with Persons as
the response, and produce fitted values for each of the block groups. The accuracy of this
model-based technique, using data that are actually available for the small areas, can then
be compared with the accuracy of a new kriging-based technique, which will attempt to
estimate the small area populations when data are only available for larger areas, or in
this setting, when data are only available at the census tract level. In order to compare
the accuracies of these two techniques, the following errors will be computed for each
census block group in Washtenaw, and the mean and standard deviation (sd) of these
errors for each technique will then be derived:
(9)
errormodel = actual population - predicted population
(10) errorkriging = actual population - estimated population
14
A linear model of the form described in (1) was fit to the 1990 Washtenaw census block
group data, with Persons as the parameter of interest and all other variables as predictors.
The technical details of this model fitting process are described in Appendix A and
Appendix B. The final model that resulted was
(11) sqrt(Y) = 0.0912x1 + 31.67x2 + 0.00002363x3 +
-0.0001586x4 + -11.05x5 + 22.92x6
where Y is the population of the census block group, x1 is the area of the block group in
square kilometers, x2 is the % of housing units that are occupied, x3 is the median value
of the housing units, x4 is the income per capita, x5 is the % of housing units built
between 1980 and 1984, and x6 is the % of housing units built after 1984.
After back-transforming to get the predicted population values for each block group by
taking Yo2, where Yo was the fitted value for the block group in the above model, we find
the following results based on (9):
sd{errormodel} = 394.94771
mean{errormodel} = 35.40443
Using Kriging for Small Area Estimation
The situation is considered where population data, various demographic, housing, and
geographic data, and geographical coordinates (physical longitude and latitude) are all
available for some large areas (in this example, these data are available for the
Washtenaw census tracts). Population estimates are desired for smaller areas within
these larger areas, and the only data that are available for the small areas are physical
areas and geographical coordinates. In this example, the geographical coordinates of a
particular area are assumed to be the coordinates of some point internal to the area.
1. The first step will be to determine a linear model like that in (11) relating the small
areas, in order to establish those variables that are significant predictors of population for
the small areas. This model may come from previous investigations, or by fitting a
model to data that actually are available for similar small areas. Predictor variables in the
model should represent percentages or summary measures describing the small areas, and
not counts for the entire areas. The reason for this is as follows.
After determining the appropriate small area model, we will use the available large area
data to krig for those variables found to be significant predictors of the parameter of
interest over the entire grid representing the collection of the large areas (in this case,
Washtenaw county). Then, we will use these kriging estimates to predict the populations
at the locations of the small areas, based on the small area model. Kriging is a form of
15
point interpolation, where the estimated value of some attribute at a location near a point
where a value for the attribute is given will likely be very similar to the given value.
Suppose a count variable, such as the total number of occupied homes, is a significant
predictor of the desired attribute (population) in the small areas. When kriging estimates
for this variable are produced over the entire area of interest (Washtenaw County),
estimates for the total number of occupied homes in small areas that are within the larger
areas where data are available (and thus have geographical coordinates that are extremely
close to those of the large areas) will be very similar to the total number of occupied
homes for the large areas, which will not make sense. However, suppose a variable such
as “% of homes occupied” is a significant predictor of the desired attribute, and we obtain
kriging estimates for this variable over the entire area of interest. The kriging estimates
for this variable, at geographical coordinates representing small area locations that are
extremely close to the geographical coordinates of the large areas, will again be similar to
the given values for the large areas, which now makes sense. The same would be true for
summary variables, such as median income.
Based on the population model derived in (11), with Persons as the parameter of
interest, the following variables were found to be significant predictors of population in
the small areas:
Land Area (km2)
%Units Occupied
%Units Built 1980-84
%Units Built > 1984
Median Unit Value
Income Per Capita
2. The second step will be to krig for all of the variables that are significant predictors of
the parameter of interest (population), over the entire area of interest (Washtenaw
County), using the available large area (tract level) data. This will result in estimated
values for each of the variables listed above (except for the area variable, which is
assumed to be given for the small areas) at all locations over the entire area of interest.
The estimated values of these variables at the locations representing the small areas can
then be used, along with the given area values for the small areas, to estimate the
populations of the small areas using the pre-determined small area model.
Kriging estimates were all derived using the SAS System, and the procedures variogram
and krige2d. Please see Appendix C and Appendix D for the technical details behind
the calculations of all sample variograms, theoretical variograms, and kriging estimates,
and all of the associated SAS code.
After kriging for all of the continuous variables above (except land area), and then using
the kriging estimates in the small area model described in (11) to obtain population
estimates for the small areas, we find the following results based on (9):
sd{errorkriging} = 437.7639
mean{errorkriging} = 26.1729
16
We see a smaller mean error than that found using ordinary regression to predict the
small area populations, but a larger standard deviation in the errors. These errors have a
distribution that is comparable to that found using the model-based technique.
Discussion
The standard deviation of the errors for the small area estimates derived using the
kriging-based technique is understandably larger than that found using the model-based
technique of ordinary regression. As indicated in (6), all kriging estimates of an attribute
computed at locations over a specific area have their own standard deviation, because
they are linear combinations of all of the available data for the attribute, with the weights
for each datum derived by solving the system of equations defined in (8). When these
kriging estimates are input into a pre-determined model, they are not fixed values but
rather the expected values from a series of distributions. Each individual kriging estimate
for a particular attribute at a particular location thus in effect brings its own error with it,
incorporating additional error into the predictions derived by using the kriging estimates
as “new” values in the model relating the small areas.
Despite this fact, the mean and standard deviation of the errors for the small area
estimates derived using a kriging-based technique were still comparable to those found
using a model-based technique with available small area data. This finding suggests that
the kriging-based technique can provide reasonable estimates for small area attributes
when model-based techniques cannot be used due to a lack of small area data.
The technique of kriging also produces estimates for a particular attribute at all locations
over an area of interest. Kriging was originally designed for geological use, where the
areas of interest are often fields or plots that do not have roads, bodies of water, or parks
dividing the sampled points. When applying kriging to population estimation, kriging
estimates of attributes such as “% of housing units occupied” are computed at all
locations over an entire area, including points where there are roads, forests, parks, lakes,
etc. Future research into the use of kriging-based or other spatial analytic techniques for
small area estimation should investigate the possibility that barriers (i.e. roads, lakes, etc.)
between “sampled points” such as census block groups, where population or housing
attributes have meaning, might have an effect on the spatial continuity of attributes
related to population.
For example, kriging techniques may predict very similar values for the variable “% of
housing units occupied” in two census block groups that are very close to each other but
divided by a river, when in fact the actual values of this variable may vary significantly
due to the presence of the river. Local kriging systems may need to be solved in order to
derive kriging estimates of attributes at all points lying between such barriers, if the
barriers are found to have a significant effect on the spatial continuity of the desired
attributes.
17
Works Cited
1990 Block Group Descriptions: U.S. Bureau of the Census.
http://www.census.gov/geo/www/cob/bg_info.html
1990 Census Tract and Block Numbering Areas Descriptions: U.S. Bureau of the
Census. http://www.census.gov/geo/www/cob/tr_info.html
Bryan, Thomas. 1999. “Small-Area Population Estimation Technique Using
Administrative Records, and Evaluation of Results with Loss Functions and
Optimization Criteria.” Federal Committee on Statistical Methodology Research
Conference, November 15-17, 1999. Washington, DC: U.S. Bureau of the
Census.
Census Block Group Cartographic Boundary Files: U.S. Bureau of the Census.
http://www.census.gov/geo/www/cob/bg.html
Census TIGER Data: ArcData Online. http://www.esri.com/data/online/tiger/data.html
Census Tract Cartographic Boundary Files: U.S. Bureau of the Census.
http://www.census.gov/geo/www/cob/tr.html
Faraway, Julian J. 1999. Practical Regression and Anova using R.
Gamma Design Software.
http://www.geostatistics.com/GSWIN/GSWINIsotropic_Variogram_Models.html
Gill, Andrew. 1996. “Kriging Example.”
http://www.maths.adelaide.edu.au/Applied/UA_DAM_FLUIDS/GROUNDWATER
/GEOSTATS/KrigEx/krigex.html
Goovaerts, Pierre. 1997. Geostatistics for Natural Resources Evaluation. New York:
Oxford Press. Pp. 125-158, 203.
Goovaerts, Pierre. 2001. “Study of scale-dependent correlation structures using factorial
kriging analysis.” http://www-personal.engin.umich.edu/~goovaert/pg-suj3.html
Guyaguler, Baris. “Ordinary Kriging.”
http://pangea.stanford.edu/~baris/professional/theoryok.html
Lang, Chao-Yi. “Kriging Interpolation.”
http://www.tc.cornell.edu/Visualization/contrib/cs490-94to95/clang/kriging.html
Rao, J.N.K. 1999. “Current Trends in Sample Survey Theory and Methods.” Pp. 16-22
in Indian Journal of Statistics, vol. 61.
Rao, Mahesh. 1999. “Interpolation.”
http://reserves.library.okstate.edu/geog5343/lec12
18
SAS Institute Inc. 1999. Online documentation: the KRIGE2D and VARIOGRAM
Procedures.
Schabenberger, Oliver. 1997. “Developing Variograms in SAS + Ordinary Kriging.”
http://www.cas.vt.edu/schabenb/Spatial.htm
19
APPENDIX A
Census Block Group Population Model:
Fitting Details and Associated R Code
20
Washtenaw Census Block Group (1990) Population Modeling
All of the following code was written for the R Software Package for interactive
statistical analysis.
> wash <- read.table("c:\\wash1.txt", sep=",", h=T)
#variable selection
>
>
>
>
>
>
wash2
wash3
wash4
wash5
wash6
wash7
<<<<<<-
wash[,1:181]
wash2[,-c(8:58)]
wash3[,-c(13:26)]
wash4[,-c(14:19,25:70)]
wash5[,-c(19,23:44)]
wash6[,-c(23:30,34:37)]
Variables (Continuous):
State County Tract Blockgroup Land.km Water.km Persons Housing
Occupied Vacant Ownr.occ Rent.occ Val.medi Rnt.medi Detached
Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc.medn
Incprcap Childpov Inpovrty Bltbfr70 Blt.7079 Blt.8084 Bltaft84
#Convert specific variables to proportions:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
wash7$Vacant <- wash7$Vacant / wash7$Housing
wash7$Ownr.occ <- wash7$Ownr.occ / wash7$Occupied
wash7$Rent.occ <- wash7$Rent.occ / wash7$Occupied
wash7$Occupied <- wash7$Occupied / wash7$Housing
wash7$Detached <- wash7$Detached / wash7$Housing
wash7$Attached <- wash7$Attached / wash7$Housing
wash7$Duplex <- wash7$Duplex / wash7$Housing
wash7$Apartmnt <- wash7$Apartmnt / wash7$Housing
wash7$Employed <- wash7$Employed / wash7$Persons
wash7$Unemploy <- wash7$Unemploy / wash7$Persons
wash7$Notinwrk <- wash7$Notinwrk / wash7$Persons
wash7$Bltbfr70 <- wash7$Bltbfr70 / wash7$Housing
wash7$Blt.7079 <- wash7$Blt.7079 / wash7$Housing
wash7$Blt.8084 <- wash7$Blt.8084 / wash7$Housing
wash7$Bltaft84 <- wash7$Bltaft84 / wash7$Housing
wash7$Childpov <- wash7$Childpov / wash7$Persons
wash7$Inpovrty <- wash7$Inpovrty / wash7$Persons
> washmod <- wash7[,-c(1:4)] #for fitting purposes
#remove bad cases
> washmod <- washmod[-c(66,254),]
> wash7 <- wash7[-c(66,254),]
> washmod1 <- washmod[,-c(4,6,8)]
#remove housing, %vacant, %rent.occ (%occupied + %vacant = 1)
21
#5-number numerical summary
> summary(washmod1)
Land.km
Min.
: 0.0120
1st Qu.: 0.3680
Median : 0.6685
Mean
: 6.8540
3rd Qu.: 2.1520
Max.
:96.8800
Ownr.occ
Min.
:0.0000
1st Qu.:0.2126
Median :0.6467
Mean
:0.5623
3rd Qu.:0.8838
Max.
:1.0000
Attached
Min.
:0.000000
1st Qu.:0.003135
Median :0.010070
Mean
:0.056940
3rd Qu.:0.033450
Max.
:0.679200
Unemploy
Min.
:0.00000
1st Qu.:0.01022
Median :0.02026
Mean
:0.02687
3rd Qu.:0.03485
Max.
:0.16410
Childpov
Min.
:0.000000
1st Qu.:0.000000
Median :0.006738
Mean
:0.025130
3rd Qu.:0.026300
Max.
:0.421100
Blt.8084
Min.
:0.00000
1st Qu.:0.00000
Median :0.02214
Mean
:0.04213
3rd Qu.:0.06329
Max.
:0.33700
Water.km
Min.
:0.000
1st Qu.:0.000
Median :0.000
Mean
:0.121
3rd Qu.:0.000
Max.
:3.272
Persons
Min.
: 28.0
1st Qu.: 720.8
Median : 999.0
Mean
:1054.0
3rd Qu.:1220.0
Max.
:5853.0
Occupied
Min.
:0.6648
1st Qu.:0.9304
Median :0.9601
Mean
:0.9438
3rd Qu.:0.9775
Max.
:1.0000
Val.medi
Min.
:
0
1st Qu.: 68330
Median : 90200
Mean
:100200
3rd Qu.:118000
Max.
:500000
Rnt.medi
Min.
:
0.0
1st Qu.: 391.3
Median : 476.5
Mean
: 487.2
3rd Qu.: 571.3
Max.
:1001.0
Detached
Min.
:0.0000
1st Qu.:0.1695
Median :0.6017
Mean
:0.5467
3rd Qu.:0.9170
Max.
:1.0000
Duplex
Min.
:0.00000
1st Qu.:0.00318
Median :0.01210
Mean
:0.03666
3rd Qu.:0.04498
Max.
:0.31360
Notinwrk
Min.
:0.0000
1st Qu.:0.1599
Median :0.2099
Mean
:0.2359
3rd Qu.:0.2607
Max.
:0.9911
Apartmnt
Min.
:0.00000
1st Qu.:0.01137
Median :0.17220
Mean
:0.32210
3rd Qu.:0.62250
Max.
:0.99080
Inc.medn
Min.
:
0
1st Qu.: 24910
Median : 37510
Mean
: 38860
3rd Qu.: 50130
Max.
:111200
Inpovrty
Min.
:0.00000
1st Qu.:0.02024
Median :0.05479
Mean
:0.12040
3rd Qu.:0.14900
Max.
:0.74070
Bltbfr70
Min.
:0.0000
1st Qu.:0.2634
Median :0.3877
Mean
:0.4195
3rd Qu.:0.5733
Max.
:1.3570
Bltaft84
Min.
:0.00000
1st Qu.:0.00000
Median :0.02094
Mean
:0.08072
3rd Qu.:0.10920
Max.
:1.06400
22
Employed
Min.
:0.0000
1st Qu.:0.4919
Median :0.5380
Mean
:0.5416
3rd Qu.:0.6101
Max.
:0.8768
Incprcap
Min.
: 2415
1st Qu.:12870
Median :16800
Mean
:17660
3rd Qu.:20250
Max.
:66660
Blt.7079
Min.
:0.00000
1st Qu.:0.04358
Median :0.18410
Mean
:0.22330
3rd Qu.:0.36030
Max.
:0.84270
Variables with data indicating that particular cases might be
outliers: Persons, Val.medi, Bltbfr70, Bltaft84, Incprcap,
Notinwrk, Land.km.
According to ArcView, land cases are not errors in data entry.
> wrk <- (washmod1$Bltbfr70 > 1)
error in data entry.
> washmod1 <- washmod1[-108,]
> wash7 <- wash7[-108,]
Notinwrk values do not appear to be errors in data entry.
Persons values might be errors in data entry. More than most
Census Tracts!
Val.medi may not be an error in data entry.
> (washmod1$Bltaft84 > 1)
Bltaft84 value is an error in data entry.
> wash7 <- wash7[-181,]
> washmod1 <- washmod1[-181,]
Incprcap value may not be an error in data entry.
Fit full model:
> mod <- lm(Persons ~ .,washmod1)
Model Diagnostics
----------------> x <- model.matrix(mod)
> lev <- hat(x)
> plot(lev, ylab = "Leverages")
> abline(h=2*22/266)
> sum(lev)
[1] 22
#see Plot 1, Appendix B.
> sum <- summary(mod)
> stud <- mod$res / (sum$sig*sqrt(1-lev))
> plot(stud,ylab="Studentized Residuals")
#see Plot 2, App. B.
> jack <- stud*sqrt((266-22-1)/(266-22-stud^2))
> plot(jack,ylab="Jackknife Residuals")
#see Plot 3, App. B.
> jack[jack > 3]
71
85
151
8.757113 3.755840 7.153207
23
Check to see if these cases are outliers using a Bonferonni-based
test:
> qt(0.05/(266*2),266-22-1)
[1] -3.792889
Two of the cases appear to be outliers! Further investigation
yields that these are the two cases with Persons counts above
5,000.
Check for influential observations:
> cook <- stud^2*lev/(22*(1-lev))
> plot(cook,ylab = "Cook Statistics") #see Plot 4, Appendix B.
> cook[cook > 0.15]
12
71
86
151
0.1690986 0.3085052 0.2805020 0.1715737
There appear to be four influential cases with relatively large
Cook statistics, two of which were identified as earlier as
outliers. Temporarily exclude them from analysis:
> washmod1 <- washmod1[-c(12,70,85,149),]
> wash7 <- wash7[-c(12,70,85,149),]
> dim(washmod1)
[1] 262 22
> dim(wash7)
[1] 262 29
Variable Transformation
----------------------> plot(mod$fit, mod$res, xlab = "Fitted Values", ylab =
"Residuals")
#see Plot 5, Appendix B.
#test for non-constant variance
> summary(lm(abs(mod$res) ~ mod$fit))
Call:
lm(formula = abs(mod$res) ~ mod$fit)
Residuals:
Min
1Q
-635.57 -219.80
Median
-58.22
3Q
Max
80.81 3381.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -90.06772
88.47109 -1.018
0.310
mod$fit
0.40512
0.08035
5.042 8.58e-07 ***
---
24
Signif. codes:
` ' 1
0
`***'
0.001
`**'
0.01
`*'
0.05
`.'
0.1
Residual standard error: 394.2 on 264 degrees of freedom
Multiple R-Squared: 0.08783,
Adjusted R-squared: 0.08438
F-statistic: 25.42 on 1 and 264 degrees of freedom,
p-value:
8.577e-007
VERY strong evidence of non-constant variance in the residuals.
Plot 5 suggests a square-root transformation of the count
response (Persons).
> mod2 <- lm(sqrt(Persons) ~ ., washmod1)
> plot(mod2$fit, mod2$res, xlab = "Fitted Values", ylab =
"Residuals", main = "Square Root Response") #see Plot 6, App. B.
> summary(lm(abs(mod2$res) ~ mod2$fit))
Call:
lm(formula = abs(mod2$res) ~ mod2$fit)
Residuals:
Min
1Q Median
-4.594 -2.669 -0.755
3Q
Max
1.349 17.844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.38401
1.80611
3.535 0.000483 ***
mod2$fit
-0.06380
0.05719 -1.116 0.265606
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05
` ' 1
`.'
0.1
Residual standard error: 3.838 on 260 degrees of freedom
Multiple R-Squared: 0.004764,
Adjusted R-squared: 0.0009365
F-statistic: 1.245 on 1 and 260 degrees of freedom,
p-value:
0.2656
There is no longer any evidence of non-constant variance in the
residuals after the square-root transformation of the response
(Persons).
Variable Selection
-----------------Stepwise backward elimination with a critical alpha value of 0.15
(for prediction performance) results in the following model:
> summary(mod2)
Call:
25
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1)
Residuals:
Min
1Q
-19.3348 -3.5081
Median
-0.2181
3Q
3.6621
Max
24.2039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.300e+01 7.446e+00
1.746 0.081991 .
Land.km
7.794e-02 2.464e-02
3.163 0.001752 **
Occupied
1.959e+01 7.748e+00
2.529 0.012056 *
Val.medi
3.255e-05 9.736e-06
3.343 0.000954 ***
Inc.medn
8.833e-05 3.816e-05
2.315 0.021436 *
Incprcap
-3.926e-04 8.437e-05 -4.653 5.27e-06 ***
Bltbfr70
-3.385e+00 2.095e+00 -1.615 0.107473
Blt.8084
-1.320e+01 7.098e+00 -1.860 0.064105 .
Bltaft84
2.032e+01 3.510e+00
5.790 2.07e-08 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05
` ' 1
`.'
0.1
Residual standard error: 6.029 on 253 degrees of freedom
Multiple R-Squared: 0.3133,
Adjusted R-squared: 0.2916
F-statistic: 14.43 on 8 and 253 degrees of freedom,
p-value:
0
Check for influential cases:
> sum <- summary(mod2)
> x <- model.matrix(mod2)
> lev <- hat(x)
> sum(lev)
[1] 9
> stud <- mod2$res / (sum$sig*sqrt(1-lev))
> cook <- stud^2*lev/(9*(1-lev))
> plot(cook,main="Cook Statistics after Variable Selection")
#see Plot 7, Appendix B.
> cook[cook > 0.15]
120
0.1631331
Is there a significant change in the model when this case is
excluded?
> summary(mod2)
Call:
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1,
26
subset = (cook < max(cook)))
Residuals:
Min
1Q
-19.4777 -3.5513
Median
-0.2323
3Q
3.5122
Max
24.7954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.024e+01 7.545e+00
1.358 0.17580
Land.km
8.157e-02 2.458e-02
3.318 0.00104 **
Occupied
2.204e+01 7.812e+00
2.822 0.00516 **
Val.medi
2.792e-05 9.980e-06
2.798 0.00554 **
Inc.medn
6.137e-05 4.047e-05
1.517 0.13064
Incprcap
-2.961e-04 9.781e-05 -3.027 0.00272 **
Bltbfr70
-2.778e+00 2.108e+00 -1.318 0.18873
Blt.8084
-1.361e+01 7.064e+00 -1.927 0.05515 .
Bltaft84
2.055e+01 3.494e+00
5.883 1.28e-08 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05
` ' 1
`.'
0.1
Residual standard error: 5.997 on 252 degrees of freedom
Multiple R-Squared: 0.2871,
Adjusted R-squared: 0.2644
F-statistic: 12.68 on 8 and 252 degrees of freedom,
p-value:
2.665e-015
Two predictors are no longer significant! This case should also
be excluded.
> wash7 <- wash7[-115,]
> washmod1 <- washmod1[-115,]
> dim(wash7)
[1] 261 29
> dim(washmod1)
[1] 261 22
Further stepwise backward elimination results in:
> summary(mod2)
Call:
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 Water.km - Inpovrty - Unemploy - Attached - Detached - Employed Childpov - Rnt.medi - Apartmnt - Ownr.occ - Bltbfr70 - 1 Inc.medn, data = washmod1)
Residuals:
Min
1Q
-19.71474 -3.78245
Median
-0.03491
3Q
3.49497
Max
25.02833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Land.km
9.120e-02 2.354e-02
3.874 0.000136 ***
Occupied 3.167e+01 9.984e-01 31.718 < 2e-16 ***
27
Val.medi 2.363e-05 9.639e-06
2.451 0.014913 *
Incprcap -1.586e-04 6.117e-05 -2.592 0.010084 *
Blt.8084 -1.105e+01 6.720e+00 -1.645 0.101265
Bltaft84 2.292e+01 3.092e+00
7.412 1.83e-12 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05
` ' 1
`.'
0.1
Residual standard error: 6.02 on 255 degrees of freedom
Multiple R-Squared: 0.9658,
Adjusted R-squared: 0.965
F-statistic: 1200 on 6 and 255 degrees of freedom,
p-value:
0
We have a very good fit. Plot 8 in Appendix B indicates that the
residuals in this model appear to follow a normal distribution.
> dim(washmod1)
[1] 261 22
> dim(wash7)
[1] 261 29
> fit <- mod2$fit
> fit <- as.matrix(fit)
> dim(fit)
[1] 260
1
> wash8$fitted <- fit
> wash8$fitted2 <- (wash8$fitted)^2
#actual fitted values
> dim(wash8)
[1] 260 28
#write out data set with fitted values, included cases
> write.table(wash8,"c:\\washdata.txt",row.names=F,sep=",")
#compute mean, standard deviation of desired errors
> data <- read.table("c:\\washdata.txt",h=T,sep=",")
> data$error <- data$Persons - data$fitted2
> var(data$error)
[1] 155983.7
> mean(data$error)
[1] 35.40443
> sqrt(var(data$error))
[1] 394.9477
28
APPENDIX B
Census Block Group Population Model:
Diagnostic Plots
29
Plot 1: Leverages of all cases after fitting the full model
According to the rather arbitrary (2p / n) rule of thumb, several cases appear to have high
leverage.
30
Plot 2: Studentized Residuals (Full Model)
Three cases appear to have unusually large studentized residuals.
31
Plot 3: Jackknife Residuals (Full Model)
Three cases again appear to have large jackknife residuals.
32
Plot 4: Cook Statistics (Full Model)
There appear to be four influential cases, each having an unusually large Cook statistic.
33
Plot 5: Residuals vs. Fitted Values (Full Model)
There is rather strong evidence of non-constant variance in the residuals, in a pattern that
suggests a square-root transformation of the response variable Persons.
34
Plot 6: Residuals vs. Fitted Values AFTER square-root transformation of response
The transformation of the response appears to have cleared up the non-constant variance
in the residuals.
35
Plot 7: Cook Statistics AFTER Variable Selection
There appears to be one rather influential case.
36
Plot 8: Normal Q-Q Plot, Indicating Normal Residuals in the Final Model
37
APPENDIX C
Kriging Analysis: Technical Details and Associated SAS Code
38
Summary of Kriging Analysis
1. Reconstruct the complete 1990 Washtenaw Census Tract data set,
in R:
> wash <- read.table("c:\\washt.txt",h=T,sep=",")
> dim(wash)
[1] 81 193
> wash2 <- wash[,-c(173:193)]
> wash3 <- wash2[,-c(7:57)]
> wash4 <- wash3[,-c(12:25,27:32,38:76)]
> wash5 <- wash4[,-c(21:42,44:51,55:58)]
> dim(wash5)
[1] 81 28
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
wash5$Vacant <- wash5$Vacant / wash5$Housing
wash5$Ownr.occ <- wash5$Ownr.occ / wash5$Occupied
wash5$Rent.occ <- wash5$Rent.occ / wash5$Occupied
wash5$Occupied <- wash5$Occupied / wash5$Housing
wash5$Detached <- wash5$Detached / wash5$Housing
wash5$Attached <- wash5$Attached / wash5$Housing
wash5$Duplex <- wash5$Duplex / wash5$Housing
wash5$Apartmnt <- wash5$Apartmnt / wash5$Housing
wash5$Employed <- wash5$Employed / wash5$Persons
wash5$Unemploy <- wash5$Unemploy / wash5$Persons
wash5$Notinwrk <- wash5$Notinwrk / wash5$Persons
wash5$Bltbfr70 <- wash5$Bltbfr70 / wash5$Housing
wash5$Blt.7079 <- wash5$Blt.7079 / wash5$Housing
wash5$Blt.8084 <- wash5$Blt.8084 / wash5$Housing
wash5$Bltaft84 <- wash5$Bltaft84 / wash5$Housing
wash5$Childpov <- wash5$Childpov / wash5$Persons
wash5$Inpovrty <- wash5$Inpovrty / wash5$Persons
> write.table(wash5,"c:\\prekrig.txt",row.names=F,sep=",")
2. Merge the Census Tract data set with the file containing the
locations of all of the Census Tracts (SAS code):
data wash;
infile "c:\prekrig.txt" dlm = ",";
input State County Tract Land_km Water_km Persons Housing
Occupied Vacant Ownr_occ Rent_occ Val_medi Rnt_medi Detached
Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc_medn
Incprcap Childpov Inpovrty Bltbfr70 Blt_7079 Blt_8084 Bltaft84;
run;
data locs;
infile "c:\tractloc2.txt" dlm = ",";
39
input obs stcty tract long lat;
run;
proc sort data = locs;
by tract;
run;
proc sort data = wash;
by tract;
run;
data last;
merge wash (in=ina) locs (in=inb);
by tract;
if ina and inb;
run;
libname brady "h:\thesis";
data brady.washkrig;
set last;
run;
3. Use the given Census Tract data to krig for the significant
predictors of population in the small area model across all of
Washtenaw County, using proc variogram and proc krige2d in SAS:
Variable: % Housing Units Occupied
---------------------------------(see all plots for % Housing Units Occupied in Appendix D)
*create individual variable subset:
data washocc;
set brady.washkrig (keep = Occupied long lat);
run;
*plot measurement locations (see Plot 1 in Appendix D):
proc gplot data = washocc;
title 'Scatter Plot of Measurement Locations';
plot lat*long / frame cframe = ligr haxis = axis1 vaxis =
axis2;
symbol1 v = dot color = blue;
axis1 minor = none;
axis2 minor = none label = (angle = 90 rotate = 0);
label lat = 'Latitude'
long = 'Longitude';
run;
quit;
40
*look at 3d surface plot of variable:
proc g3d data = washocc;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Occupied / xticknum = 5 yticknum = 5
grid zmin = 0.8 zmax = 1.0;
label long = 'Longitude'
lat = 'Latitude'
Occupied = '% Housing Units Occupied';
run;
quit;
*look at the distribution of the pairwise distances between the
tracts:
(see Plot 2 in Appendix D)
proc variogram data = washocc outdistance = outd;
compute nhc = 20 novariogram;
coordinates xc = long yc = lat;
var Occupied;
run;
title 'Distance Intervals';
proc print data = outd;
run;
data outd;
set outd;
mdpt = round((lb+ub) / 2,0.001);
label mdpt = 'Midpoint of Interval';
run;
axis1 minor = none;
axis2 minor = none label = (angle = 90 rotate = 0);
title 'Distribution of Pairwise Distances';
proc gchart data = outd;
vbar mdpt / type = sum sumvar = count discrete frame
cframe = ligr gaxis = axis1 raxis = axis2 nolegend;
run;
quit;
*look at the lower bound of distance interval where there are
still sufficient (more than 30) pairs of points, and divide this
number by the apparent lag distance (LAGD) to yield MAXLAGS. Plot
2 in Appendix D suggests setting MAXLAGS = 15, and lagd = 0.03.
Compute and plot standard and robust semivariograms.
proc variogram data = washocc outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
41
var Occupied;
run;
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for %Occupied Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*fit an appropriate theoretical variogram to the data, and plot
it against the standard and robust semivariograms:
data outv3;
set outv;
c0 = 0.0012; c = 0.0022; a0 = 0.50;
vari = c0 + c * (1 - exp(-distance * distance / (a0 * a0)));
type = 'Gaussian';
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %Occupied Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
42
run;
quit;
*krig using the theoretical variogram, and plot the resulting
estimates:
proc sort data = washocc;
by long;
run;
proc sort data = washocc;
by lat;
run;
proc krige2d data = washocc outest = est;
pred var = Occupied;
model nugget = 0.0012 scale = 0.0022 range = 0.50 form =
gauss;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118
by 0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %Occupied';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '%Occupied';
run;
4. Merge the kriging estimates with the final Census Block Group
data set:
data est2 (keep = long lat koccu);
set est;
long = gxc;
lat = gyc;
long = round(long, 0.01);
lat = round(lat, 0.01);
koccu = estimate;
run;
data block;
set brady.krigdata; *copy of brady.washdata
long = round(long, 0.01);
lat = round(lat, 0.01);
run;
proc sort data = est2;
by long lat;
run;
43
proc sort data = block;
by long lat;
run;
data final;
merge block (in = ina) est2 (in = inb);
by long lat;
if ina and inb;
run;
*continue to append brady.krigdata, so that the final data set
will contain both actual data and kriging estimates for all of
the significant predictors of population for each of the Census
Block Groups:
data brady.krigdata;
set final;
run;
5. Continue to krig for all other significant predictors of
population in the small area model, following the same
methodology outlined above:
Variable: Median Housing Unit Value
----------------------------------(see all plots for Median Housing Unit Value in Appendix D)
*fit theoretical variogram:
data outv3;
set outv;
c = 3500000000; a0 = 0.06;
if (distance <= 0.06) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance /
a0)**3);
type = 'Spherical';
end;
else if (distance > 0.06) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
*krig using the theoretical variogram, and plot the resulting
estimates:
proc krige2d data = wash_med outest = est;
44
pred var = Val_medi;
model scale = 3500000000 range = 0.06 form = spherical;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118
by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for Med.Value';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = 'Median Unit Value';
run;
Variable: % Units Built After 1984
---------------------------------(see all plots for % Units Built After 1984 in Appendix D)
*fit theoretical variogram:
data outv3;
set outv;
c = 0.015; a0 = 0.10;
if (distance <= 0.10) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance /
a0)**3);
type = 'Spherical';
end;
else if (distance > 0.10) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %BltAft84 Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
45
*krig using the theoretical variogram, and plot the resulting
estimates:
proc krige2d data = wash84 outest = est;
pred var = Bltaft84;
model scale = 0.015 range = 0.10 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118
by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %BltAft84';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '% Units Built After 1984';
run;
*zero out any negative kriging estimates:
data final;
set final;
if (kaft84 < 0) then kaft84 = 0;
run;
Variable: Income Per Capita
--------------------------(see all plots for Income Per Capita in Appendix D)
data washinc;
set brady.washkrig (keep = Incprcap long lat);
run;
proc g3d data = washinc;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Incprcap / xticknum = 5 yticknum = 5
grid zmin = 4120 zmax = 59400;
label long = 'Longitude'
lat = 'Latitude'
Incprcap = 'Income Per Capita';
run;
quit;
proc variogram data = washinc outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
var Incprcap;
run;
46
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for IncPrCap Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*fit theoretical variogram:
data outv3;
set outv;
c = 75000000; a0 = 0.03;
if (distance <= 0.03) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance /
a0)**3);
type = 'Spherical';
end;
else if (distance > 0.03) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for IncPrCap Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
47
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting
estimates:
proc krige2d data = washinc outest = est;
pred var = Incprcap;
model scale = 75000000 range = 0.03 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118
by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for IncPrCap';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = 'Income Per Capita';
run;
Variable: % Units Built Between 1980-1984
----------------------------------------(see all plots for % Units Built Between 1980-1984 in Appendix D)
data wash8084;
set brady.washkrig (keep = Blt_8084 long lat);
run;
proc g3d data = wash8084;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Blt_8084 / xticknum = 5 yticknum = 5
grid zmin = 0 zmax = 0.1375;
label long = 'Longitude'
lat = 'Latitude'
Blt_8084 = '% Units Built 1980-1984';
run;
quit;
proc variogram data = wash8084 outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
var Blt_8084;
run;
48
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for %Blt8084 Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*fit theoretical variogram:
data outv3;
set outv;
c = 0.00155; a0 = 0.05;
if (distance <= 0.05) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance /
a0)**3);
type = 'Spherical';
end;
else if (distance > 0.05) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %Blt8084 Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
49
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting
estimates:
proc krige2d data = wash8084 outest = est;
pred var = Blt_8084;
model scale = 0.00155 range = 0.05 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118
by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %Blt8084';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '% Units Built 1980-1984';
run;
6. Using the final data set constructed after all kriging
estimates have been derived, obtain population estimates at the
locations representing the small areas by entering the kriging
estimates of the significant predictors for each small area into
the model relating the small areas:
data test;
set brady.krigdata;
estimate = 0.0912 * land_km + 31.67 * koccu + 0.00002363 *
kmedv + -0.0001586 * kinc + -11.05 * k8084 + 22.92 * kaft84;
est2 = estimate * estimate;
error = Persons - est2;
run;
*compute the mean and standard deviation of the desired errors:
proc means data = test;
var error;
run;
data brady.krigdata;
set test;
run;
50
APPENDIX D
Kriging Plots
51
Kriging Plots
1990 Washtenaw Census Tract Data: Location Summary
Plot 1: Scatter Plot of all Measurement Locations for Tracts in Washtenaw County
Plot 2: Distribution of Pairwise Distances Between Tracts
52
Variable: % Housing Units Occupied
53
Variable: Median Housing Unit Value
54
55
Variable: Income Per Capita
56
Variable: % Units Built Between 1980-1984
57
58
Variable: % Units Built After 1984
59
60
APPENDIX E
Additional SAS Code
61
SAS code used to create the data file that includes the locations of each Census Block
Group in Washtenaw County:
data bgid;
infile "c:\bg26_d90_pa.dat";
input number : 4. county $char5. tract 4. block 3.;
run;
data bgid2;
set bgid;
if county = '26161';
run;
data coords;
infile "c:\bg26_d90_p.dat" lrecl = 70;
input number 1-10 long 17-23 lat 46-51;
run;
data coords2;
set coords;
if number ne -0.8 and number ne -0.9;
run;
proc sort data = bgid2;
by number;
run;
proc sort data = coords2;
by number;
run;
data final;
merge bgid2 (in=ina) coords2(in=inb);
by number;
if ina;
run;
data temp;
set final;
file "c:\bgloc2.txt";
put @1 number @5 "," @6 county @11 "," @12 tract @16 ","
@17 block @18 "," @19 long @26 "," @27 lat;
run;
SAS code used to create the data file that includes the locations of each Census Tract in
Washtenaw County:
data tractid;
infile "c:\tr26_d90_pa.dat";
input number : 4. county $char5. tract 4.;
run;
62
data tractid2;
set tractid;
if county = '26125' or county = '26161' or county = '26163';
run;
data coords;
infile "c:\tr26_d90_p.dat" lrecl = 46;
input number 1-10 long 11-28 lat 29-46;
run;
data coords2;
set coords;
if number ne -8 and number ne -9;
run;
proc sort data = tractid2;
by number;
run;
proc sort data = coords2;
by number;
run;
data final;
merge tractid2 (in=ina) coords2(in=inb);
by number;
if ina;
run;
data temp;
set final;
file "c:\tractloc2.txt";
put @1 number @4 "," @5 county @10 "," @11 tract @15 "," @16
long @26 "," @27 lat;
run;
SAS code used to create the master data set, including all variables for each Census
Block Group (demographics, population and housing characteristics, latitude and
longitude coordinates, etc.):
data wash;
infile "c:\washdata.txt" dlm = ",";
input state county tract block land_km water_km persons
housing occupied vacant ownr_occ rent_occ val_medi rnt_medi
detached attached duplex apartmnt employed unemploy notinwrk
inc_medn incprcap childpov inpovrty bltbfr70 blt_7079 blt_8084
bltaft84 fitted fitted2;
run;
63
data locs;
infile "c:\bgloc2.txt" dlm = ",";
input number county tract block long lat;
run;
proc sort data = locs;
by tract block;
run;
proc sort data = wash;
by tract block;
run;
data last;
merge wash (in=ina) locs (in=inb);
by tract block;
if ina and inb;
run;
libname brady "h:\thesis";
data brady.washdata;
set last;
long = long * 100;
lat = lat * 100;
run;
64