Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Conditional Maps of Multiple Variables Blair Christian and David Scott Rice University 18 April 2002 We would like to thank Sue Bell for her help. OUTLINE Background First Attempts at Conditional Maps Smooth Conditional Maps Example (Cancer Mortality, Screening Rate) Comments, Criticisms, Future Plans Where We’re Going: We have multivariate spatial data (like cancer/screening data by county) Goal is to visualize correlation between variables o Colorectal Cancer (CC) Mortality o County screening rates like the Fecal Occult Blood Test (FOBT) Our result is a sequence of maps about the response variable (Mortality) conditioned on another variable (FOBT) Important Considerations: Assume underlying data is continuous Behavior at borders, both internal and external Representing aggregated data (we have data for counties, not point data) Performance, computational efficiency (Is it fast?), statistical efficiency (Is it a “good” estimate?) Bi First Attempt: Intuitive to condition by County (“Conditional Chloropleth Map”) A 2x2 Table, bin by values above/below median Drawbacks No unique color ordering Difficult to interpret as number of levels increases Not Continuous Smooth Conditional Maps Want E(Z1|x, y, z2), where Z1 is our response variable of interest (Cancer Mortality), (x,y) are lat/long and z2 is the variable we are conditioning on (Sreening Rate) Given a point (x,y) and a level to condition on, the expected value at (x,y) is the average of the z1i from points near (x,y) ONLY WHEN corresponding z2i is at the level we are conditioning on E[ Z1 | X xi , Y y j , Z 2 z 2,k ] z1 f ( z1 | xi , y j , z 2,k )dz1 z1 z 1 f ( xi , y j , z1 , z 2,k )dz1 f ( xi , y j , z 2 , k ) To estimate f, we use the Average Shifted Histogram (ASH), a computationally and statistically efficient density estimator relative to other methods for large data sets Idea: We construct a mesh (number of bins in vertical, horizontal directions) around the data, and the value of a bin is the average of the response variable in the bins in the range we condition on. (see examples) Simple Example Four Bins, No Smoothing Div Count = 182 # Div Count = 153 # # Div Count = 117 # Testdatthree.shp Divorced Count 117 118 - 153 154 - 182 TexasBoundry.shp N W 300 0 300 E 600 Miles S Div'd Rate Conditioned on Highest Rate of Mobile Homes 4 Bins No Smoothing Condition on MH = 300 MH = 100 # MH = 300 # # MH = 167 # Testdatthree.shp Divd mobhom 3 3 .shp 153 TexasBoundry.shp Comments, Criticisms, Future Plans Aggregated Data: Need data as points at the moment. Have been taking the centroid of polygons (problems can arise when we have highly non-convex polygons, ie “doughnut”, “U” or stick figure shapes) Behavior at Borders: Can be controlled by choice of smoothing parameter and bin widths (Boundary Kernels) Larger data sets are better (US County data , < 1 min to allocate memory, maps take 1-30 min, depending on bin width and smoothing) References: Scott and Whittaker (1996) “Multivariate Applications of the ASH in Regression”, Communications in Statistics, 25:2521-2530. Scott and Wojciechowski (2001) “Conditioning Multiple Maps”, Computing Science, in press. Nadaraya (1964) “On Estimating Regression”, Theory of Probability and its Applications, 9:141-142. Watson (1964) “Smooth Regression Analysis”, Sankhy A, 26:359-372.