Download Conditioning Multiple Maps - Rice University Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Exponential smoothing wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Conditional Maps of Multiple Variables
Blair Christian and David Scott
Rice University
18 April 2002
We would like to thank Sue Bell for her help.
OUTLINE
Background
First Attempts at Conditional Maps
Smooth Conditional Maps
 Example (Cancer Mortality, Screening Rate)
Comments, Criticisms, Future Plans
Where We’re Going:
 We have multivariate spatial data (like
cancer/screening data by county)
 Goal is to visualize correlation between variables
o Colorectal Cancer (CC) Mortality
o County screening rates like the Fecal Occult Blood
Test (FOBT)
 Our result is a sequence of maps about the response
variable (Mortality) conditioned on another variable
(FOBT)
Important Considerations:
 Assume underlying data is continuous
 Behavior at borders, both internal and external
 Representing aggregated data (we have data for
counties, not point data)
 Performance, computational efficiency (Is it fast?),
statistical efficiency (Is it a “good” estimate?)
Bi
First Attempt:
Intuitive to condition by County (“Conditional
Chloropleth Map”)
A 2x2 Table, bin by values above/below median
Drawbacks
 No unique color ordering
 Difficult to interpret as number of levels increases
 Not Continuous
Smooth Conditional Maps
 Want E(Z1|x, y, z2), where Z1 is our response variable
of interest (Cancer Mortality), (x,y) are lat/long and z2
is the variable we are conditioning on (Sreening Rate)
 Given a point (x,y) and a level to condition on, the
expected value at (x,y) is the average of the z1i from
points near (x,y) ONLY WHEN corresponding z2i is
at the level we are conditioning on
E[ Z1 | X  xi , Y  y j , Z 2  z 2,k ]   z1 f ( z1 | xi , y j , z 2,k )dz1 
z1
z
1
f ( xi , y j , z1 , z 2,k )dz1
f ( xi , y j , z 2 , k )
To estimate f, we use the Average Shifted Histogram
(ASH), a computationally and statistically efficient density
estimator relative to other methods for large data sets
Idea: We construct a mesh (number of bins in vertical,
horizontal directions) around the data, and the value of a
bin is the average of the response variable in the bins in
the range we condition on. (see examples)
Simple Example
Four Bins,
No Smoothing
Div Count = 182
#
Div Count = 153
#
#
Div Count = 117
#
Testdatthree.shp
Divorced Count
117
118 - 153
154 - 182
TexasBoundry.shp
N
W
300
0
300
E
600 Miles
S
Div'd Rate Conditioned
on Highest Rate of Mobile Homes
4 Bins
No Smoothing
Condition on MH = 300
MH = 100
#
MH = 300
#
#
MH = 167
#
Testdatthree.shp
Divd mobhom 3 3 .shp
153
TexasBoundry.shp
Comments, Criticisms, Future Plans
Aggregated Data: Need data as points at the moment.
Have been taking the centroid of polygons (problems can
arise when we have highly non-convex polygons, ie
“doughnut”, “U” or stick figure shapes)
Behavior at Borders: Can be controlled by choice of
smoothing parameter and bin widths (Boundary Kernels)
Larger data sets are better (US County data , < 1 min to
allocate memory, maps take 1-30 min, depending on bin
width and smoothing)
References:
Scott and Whittaker (1996) “Multivariate Applications of
the ASH in Regression”, Communications in Statistics,
25:2521-2530.
Scott and Wojciechowski (2001) “Conditioning Multiple
Maps”, Computing Science, in press.
Nadaraya (1964) “On Estimating Regression”, Theory of
Probability and its Applications, 9:141-142.
Watson (1964) “Smooth Regression Analysis”, Sankhy A,
26:359-372.