Download math-stat2 - University of Vermont

Document related concepts

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Analyzing spatial
nonstationarity in multivariate
relationships
UVM math/stat department lecture, march 31, 2004
By Austin Troy1
1.
University of Vermont, Rubenstein School of Environment
and Natural Resources
Spatial Nonstationarity
• The notion that relationships change across space.
– That is, the relationship between y and (x1,x2....xn ) is non
constant from one location to the next.
• Can a variable be included in a model that proxies
for space?
• Sometimes this is possible, but very often the factors
that make one location different from another are
non-quantifiable, or involve extremely complex
interactions that cannot be parsimoniously
modeled. Especially in social research
Approaches to nonstationarity
•
There are several approaches to dealing
with this problem. Among them are:
1. Create zones of homogeneity and stratify
2. Allow parameters to vary constantly
First Approach
1. Parameterize a global model and look at
the residuals to detect patterns
First Approach
f ( x)   h  k  hk xk
f ( x)   i  k  ik xk
f ( x)   j  k  jk xk
•Use patterns in residuals to define patches
•Specify a separate equation for each patch, or stratum
Second Approach
2. Continuous variation: Create measures of
statistical relationships that are
continuously varying across space, such as
GWR
f ( x)   (ui , vi )  k  k (ui , vi ) xk
’s vary continuously as a function of location
(u,v) at each point i (Fotheringham 2002)
GWR
• Weighted moving window regression method developed
by Foterhingham and Brundson (2000, 2002), building
on works of Hastie and Tibshirani (1990) and Loader
(1999)
• Expanded form of simple multiple regression equation
Y ( x)   (ui , vi )  k  k (ui , vi ) xk  ei
• Coefficients are deterministic functions of location in
space
• Uses weighted least squares approach
• Fully unbiased estimate of local coefficient is impossible,
but estimates with only slight bias are possible
GWR estimation
• Separate regression is run for each
observation, using a spatial kernel that centers
on a given point and weights observations
subject to a distance decay function.
• Can used fixed size kernel or adaptive kernel
to determine number of local points that will
be included in each local regression
• Adaptive kernels used when data is not
evenly distributed
GWR kernel
GWR with fixed kernel
GWR with adaptive kernel
From Fotheringham, Brundson and Charlton. 2002. Geographically Weighted Regression
Points are weighted based on distance from center of kernel
e.g. Gaussian kernel where weighting is given by:
wi(g) = exp[-1/2(dij/b)2 where b is bandwidth
GWR kernel
Adaptive kernel width is determined through
minimization of the Akaike Information
Criterion
 n  tr( S) 
AIC c  2n log e (ˆ )  n log e (2 )  n

n

2

tr(
S
)


where tr(S) is the trace of the hat matrix and n is the number of
observations
Bias and variance tradeoff
• Tradeoff between bias and standard error
• The smaller the bandwidth, the more variance but
the lower the bias, the larger the bandwidth, the
more bias but the more variance is reduced
• This is because we assume there are many betas
over space and the more it is like a global
regression, the more biased it is.
• AIC minimization provides a way of choosing
bandwidth that makes optimal tradeoff between
bias and variance.
GWR outputs
• The result is a statistical output showing
global summary stats and parameters
estimates, local model summary stats and
non-stationarity stats for each parameter
• Also produces a map output of points with
parameter estimates, standard errors and
“pseudo t statistics” for each variable for
each point
A simple (?) question we can address
with this approach
• How is proximity to trees and other “green
assets” reflected in property values?
We can ask this with hedonic
analysis
• Econometric method for disaggregating
observed housing prices into a series of
unobserved “implicit” prices, reflective of
WTP for a given marginal change in attribute
• Price= fn(structure, lot, location)
ln( P)     S X S   N X N   L X L  e
• Has been used extensively for valuing
environmental amenities and disamenities
Hedonics and space
• Hedonic analysis is generally taken to be
spatially stationary; that is, it is assumed that
marginal WTP for an attribute is fixed within
a housing market.
• Usually these markets are assumed to be
quite big and tools for determining these
market boundaries are poorly developed (see
approach 1, earlier)
Case Study:Two Scales of
analysis
1. Block group level analysis of median housing
price as fn of tree cover, controlling for many
other factors
2. Property level analysis of housing prices as a
function of tree cover and parks, controlling
for structural, locational and environmental
attributes (hedonics)
Research is part of the Baltimore Ecosystem
Study, an NSF-LTER
Block group analysis
Initial pattern of
housing values tells us
what we know
intuitively: suburbs
are more valuable than
central city, but value
goes down again as
you get too far from
city center
Block group analysis
Average canopy cover
by block group, as
derived from 2000
USGS 30 m canopy
cover analysis
Running GWR at the BG level
• Given the spatial dependence in the data, it
is likely that the coefficients of a
multivariate relationship will be related to
the spatial processes underlying the spatial
dependence. Hence, we choose to run GWR
and compare it to global regression.
Equation
Pi   (ui , vi )  1 (ui , vi ) x1   2 (ui , vi ) x2  ....13 (ui , vi ) x13  ei
Predictor variables=
• Median age
• Percent with income greater than $100k
• Median household income
• Percent owner occupied housing
• Percent vacant buildings
• Percent single family detached homes
• Mean tree canopy percentage
• Median number of rooms per house
• Median age of housing
• Percent with mortgage
• Percent high school educated
• Percent African American
• Percent Protected Land
Data
• This was run using census block group data
from 2000 for the five counties in the
Baltimore Metro Area
• Observational problem: while population of
BGs is relatively constant, size is not, hence
may be some form of Modifiable Areal Unit
Problem; may be varying levels of
heterogeneity within block groups
Results: Global Model
• Global model:
highly
significant
• Canopy is
significant at α =
.05, while
percent
protected land is
not
• All other control
variables are
Comparison of Local and Global
• The ANOVA tests the null hypothesis that the GWR model
represents no improvement over a global model and rejects the null
• Notice also that Coefficient of Determination increases
significantly and AIC decreases
Local Parameter variabilty
• We also conduct a
Monte Carlo
Significance test
which finds that
almost all variables
are spatially nonstationary, although
protected land is not
How does this testing work?
• We expect all parameters to have slight spatial variations; is
that variation sufficient to reject the null hypothesis that it is
globally fixed?
• If so, then any permutation of regression variable against
locations is equally likely, allowing us to model a null
distribution of the variance
• A Monte Carlo approach is adopted to create this distribution
in which the geographical coordinates of the observations are
randomly permuted aginst the variables n times; results in n
values of the variance of the coefficient of interest which we
use as an experimental distribution
• We can then compare the actual value of the variance against
these values to obtain the experimental significance level
Local R
squared
shows that
model fit
varies by
locations
Notice also
how under
GWR, there
is no pattern
to the
residuals
because it
accounts for
spatial
effects
Results show
that the city
center is
where tree
canopy cover
is valued
highest on the
margin; this
might be
because trees
are scarcest
there
However, the
pseudo t
statistic
reveals that
not all areas
are significant
When we blank
out the nonsignificant
observations,
we see that trees
are only
significantly
reflected in
property values
in some areas,
and it’s negative
in Howard
County
Interpretation
• In some areas, canopy cover is valued more
highly than in others. The degree to which
canopy is valued may relate to the scarcity or
spatial distribution of trees at multiple scales.
It may be negative in Howard county because
trees are associated with some other factor,
like “ruralness,” which is not properly
quantified by the census and which exerts
downward influence on housing prices
Other variables
• We’ll ignore protected land, since it’s not
non-stationary. However, some of the other
control variables are telling. Let’s look at
median house age.
Pattern makes
sense: older
homes yield a
positive
premium in
the more
affluent
suburb, but a
negative
premium in
the poorer
central city
When we do a
cluster
analysis
(PAM) based
on 3
parameter
values
(canopy,
house age and
number of
rooms), we
get something
looking like
markets
Here’s what it
looked like
with 6
Note that the
silhouette
score for the
PAM was .55
for the 4 class
and .5 for the
6 class—
strong
structuring
Problems with this approach
• Using block group level data is not ideal for a
number of reasons
– Modifiable Areal Unit Problem
– BG obscures significant within unit heterogeneity
– Level of heterogeneity (variance) varies between
observations
– There may be spatial clustering within BGs
– Data is not very accurate at this level
– Attribution not great either
Property Level Analysis
• Property level analysis is way of dealing
with the local spatial heterogeneity and of
getting much better attribution
• Used Maryland Property View data set for
half of Baltimore City
• Regressed log price against 22 variables,
including structural, neighborhood and
locational variables, including several
environmental variables
•Simple plot of log
price shows that while
there are slight patterns,
there is still
considerable
heterogeneity over
space and no clear
patches are emergent
•Moreover, price is
only one factor and
social patches are
defined on a number of
factors
•Plotting out log
price normalized
by square footage
gives a similarly
heterogeneous
result
•Other plots of
other variables
also show similar
heterogeneity
Overall results
• Local model was a considerable
improvement over the global model
Non-stationarity of parameters
• Monte Carlo
significance test
used to determine
whether parameters
are significantly
non-stationary
• All but 4 were at
the = .05
significance level
Plotting out Parameter Estimates
and T values
Parameter on dummy
variable for trees
within 20 meters
T value on tree dummy
variable
The only real
significant relationship
is on east side of study
area
Note that 1458 out of
2350 observations had
a “1” for this value. A
better approach for
future would be tree
density index
Parameter value on
distance to nearest
park
T value on
Distance to nearest
park
Areas where parks
are highly valued in
the real estate
market
Number of Baths
Parameter values
show clear patchiness
Here an additional
bathroom adds 69% to home value
Here it adds 1.5 to
3%
T value of parameter on
number of bathrooms
shows that number of
bathrooms only
significantly related to
price in northern
neighborhoods
(small dots are observations
where parameter estimate is
insignificant)
Parameter value on
dummy for single
family home
T value on SFH
dummy
Note that even though
there are SFHs in the
south, SFH status only
appears to add value in
the North
Parameter values on
Age of Structure
Shows clear patchiness
T values on Age of
Structure
Shows age is most
significantly related to
price near center of
town. Only in
northeast is it
positively related,
suggest that older
homes have “historic”
value there but not
elsewhere
Generating fuzzy Surfaces
• With this data, surfaces can be interpolated
showing the change in parameter values
over space
• Unfortunately, this only works well where
most points have significant parameter
estimates
Surface Example:
Parameter on
Distance to park
• Interpolation
allows us to much
more clearly see
“patches” because
it smoothes out
some of the
within-group
heterogeneity
Surface Example:
Parameter on
structure age
T Value on
structure age
• Note that
interpolation of t
value looks almost
the same
Combining
interpolated layers
• In a preliminary attempt we
took four interpolated layers
for dummy variables (hence
all on the same scale),
including structure age>80,
trees w/in 20 meter, SFH and
brick and added them
together to get a test
composite layer; the more
layers we add, the more we
begin to see clear
differentiations between
neighborhoods
Patch
Classification
• This can then be used
to classify areas into
distinct patches or
zones
• These zone
boundaries are based
on a vector of
parameters from the
multivariate equation
• This is extremely
preliminary—an
example for display
Conclusion
• GWR offers an excellent way to define
meaningful socio-economic boundaries
• It allows us to look at the spatial arrangement of
the relationship between y and z, controlling for all
other variation
• In particular, it offers an excellent way to look at
the spatial variability of social phenomena, which
are often mediated by spatial processes that cannot
be quantified, like trying to stay friendly with your
neighbors and neighborhood association
• Suggests that preferences do vary over space and
that future analyses must take this into account