Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to GIS Modeling
Week 9 — Spatial Data Mining
GEOG 3110 –University of Denver
Presented by
Joseph K. Berry
W. M. Keck Scholar, Department of Geography, University of Denver
Basic Descriptive Statistics and its GIS Expression:
Normalizing maps; Mapping spatial dependency
Linking Numeric and Geographic Patterns:
Map comparison; Similarity maps; Clustering mapped data;
Investigating map correlation; Developing prediction models;
Assessing prediction results
Kicking at the Finish (Waning Class Moments)
The last of the “Learning Opportunities” that remain are…
• Exercise #9 on Spatial Data Mining (or paper) for 50 points
• Exam #2 on Surface Modeling, Spatial Data Mining and Future Directions
material for 150 points
• Optional Exercises for up to 50 extra credit points (can only improve your grade)
2nd Exam Study Questions …posted Monday
2/27 by 12:00noon. Class initiative to
“group study” to collectively address the 24 study questions (complete by 5:00pm Thursday March 8)
Midterm Exam …you will download and take the 2-hour exam online (honor system)
sometime between 10:00 am Friday March 9 and 5:00 pm Tuesday March 13
Special, special offer provided you fully participate in the study question “group
study” you can choose not to take the second exam—
Fine print: I will simply allocate the points for the exam according to the current percentage of all of your graded materials
which means not taking the exam has no effect on your grade.
If you choose to take the exam and get a grade below your current percentage of all graded materials, the exam grade will
be ignored …therefore taking the exam can only improve your grade.
GIS and Map-ematical Perspectives (SS)
Spatial Statistics Operations – Numerical Context
GIS Perspective:
Map Analysis Toolbox
Surface Modeling (Density Analysis, Spatial Interpolation, Map Generalization)
Spatial Data Mining (Descriptive, Predictive, Prescriptive)
Statistical Perspective:
Grid Map Layers
Basic Descriptive Statistics (Min, Max, Median, Mean, StDev, etc.)
Basic Classification (Reclassify, Binary/Ranking/Rating Suitability)
Unique Map Descriptive Statistics (Roving Window Summaries)
Map Comparison (Joint Coincidence, Statistical Tests)
Surface Modeling (Density Analysis, Spatial Interpolation)
Advanced Classification (Map Similarity, Maximum Likelihood, Clustering)
Predictive Statistics (Map Correlation/Regression, Data Mining Engines)
Berry
Basic Concepts in Statistics (SN_Curve Shape)
Kurtosis …shape
(positive= peaked; negative= flat)
See Beyond Mapping III , Topic 7, Linking Data Space and Geographic Space
(Berry)
Basic Concepts in Statistics (SN_Curve Shape continued)
…multi-modal
…Skewness
(positive= right;
negative= left)
See Beyond Mapping III , Topic 7, Linking Data Space and Geographic Space
(Berry)
Linking Numeric & Geographic Distributions
…a Histogram depicts the numeric distribution
…a Map depicts the geographic distribution
(Mean/Central Tendency focus)
(Variance/Variability focus)
…Data Values
link the two
views—
Click anywhere on
the Map and the
Histogram interval
is highlighted
Click on the
Histogram interval
and the Map
locations are
highlighted
…simply different ways to organize and analyze “mapped data” (x,y= Where and z= What)
(See Beyond Mapping III, “Topic 7” for more information)
(Berry)
An Analytic Framework for GIS Modeling
(Last week)
Surface Modelling operations involve
creating continuous spatial distributions from
point sampled data (univariate).
(This week)
Spatial Data Mining operations
involve characterizing numerical patterns and
relationships among mapped data (multivariate).
See www.innovativegis.com/basis/Download/IJRSpaper/
(Berry)
Preprocessing Mapped Data (Preprocessing Types 1-3)
Preprocessing involves conversion of raw data into consistent values
that accurately represent mapped conditions (4 types of preprocessing)
Calibration 1 — “tweaking” the values… sort of like a slight turn on a
bathroom scale to alter the reading to what you know is your ‘true weight’
Translation 2 — converts map
values into appropriate units for
analysis, such as feet into meters
or bushels per acre (measure of
volume) into tons per hectare
(measure of mass)
Antenna Offset
GPS Fix Delay
Overlap and Multiple Passes
Mass Flow Lag and Mixing
Adjustment/Correction 3 —
dramatically changes the
data, such as post processing
GPS coordinates and/or Mass Flow Lag adjustment
… “trolling” for data
(Berry)
Normalizing Mapped Data (4
th
type of preprocessing)
Normalization — involves standardization of a data set, usually for
comparison among different types of data…
“apples and oranges to mixed fruit scale”
Goal …Norm_GOAL = (mapValue / 250 ) * 100
0-100 …Norm_0-100 = ((mapValue – min) * 100) / (max – min) + 0
SNV …Norm_SNV = ((mapValue - mean) / stdev) * 100
Norm_GOAL = (Yield_Vol / 250 ) * 100
…generates a standardized map based
on a yield goal of 250 bushels/acre.
This map can be used in analysis with
other goal-normalized maps, even
from different crops
Key Concept
Since normalization involves scalar
mathematics (constants), the
pattern of the numeric distribution
(histogram) and the spatial
distribution (map) do not change
…same relative distributions
See Beyond Mapping III , Topic 18, Understanding Grid-based Data
Note: the generalized rescaling equation is…
Normalize a data set to a fixed range of Rmin to Rmax = (((X-Dmin) * (Rmax – Rmin)) / (Dmax – Dmin)) + Rmin
…where Rmin and Rmax is the minimum and maximum values for the rescaled range, Dmin and Dmax is the minimum and maximum values for the input data
and X is any value in the data set to be rescaled.
(Berry)
Proximity Stratification
…proximity to field edge
Edge effects
“Sweet Spot”
(interior)
…Stratification
partitions the data
(numeric) or the project
area (spatial) into
logical groups—
…Proximity map
identifies the distance
from point, line or
polygon features to all
other locations
…unusually high yield
…proximity to high yield
Far
:
Close
…Yield map
> Average + 1Stdev
“High Yield”
vicinity
(Berry)
Summarizing Map Regions (template/data)
…creates a map summarizing values from
a data map (Phosphorous levels) that
coincide with the categories of a template
map (Soil types) or stratification
partitioning
Soil
Types
BIB
Phosphorous
levels
Soil Type Pavg
Ve
15.0
VdC
12.8
BIB
11.2
BIA
14.6
TuC
10.5
HvB
11.3
Individual
BIA clumps
Overall BIA
Pavg = 14.6
…average phosphorous
level for each soil type
13.6
15.5
…average P-level for each soil unit
(clump first before COMPOSITE)
8.6
(Berry)
Data Analysis (establishing relationships)
On-farm studies, such as seed hybrid performance, can be conducted using actual farm conditions…
…management action recommendations are based on local relationships instead of
Experiment Station research hundreds of miles away
…is radically changing research and management practices in agriculture and
numerous other fields from business to epidemiology and natural resources
(Berry)
Comparing Discrete Maps (Multivariate analysis)
Thematic Categorization
…we often represent continuous
spatial data (map surfaces) as a
set of discrete polygons
Which classified map is correct?
How similar are the three maps?
Spatial Precision
(Where — boundaries)
Medium
High
Low
of Points, Lines and Areas
(polygons) is a primary
concern of GIS, but we are
often less concerned with
Thematic Accuracy
(What — map values)
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Comparing Discrete Maps
Two ways to compare Discrete Maps…
Coincidence Summary
Proximal Alignment
693
…Coincidence Summary
generates a cross-tabular
listing of the intersection of
two maps.
Table Interpretation
Diagonal (Same)
Off-diagonal (Above/Below)
Percentages (% Same)
Overall Percentage
83%
((475+297+563)/1950)*100= 68%
((631+297+693)/1950)*100=
Raster versus Vector
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Comparing Discrete Maps (Coincident Summary)
Two ways to compare Discrete Maps…
Coincidence Summary
Proximal Alignment
…helpful in
answering
Question 2
Map2: Med-- 104 + 297 + 225 = 626; (297/626) *100= 47 percent matched
631 + 297 + 693 = 1621; (1621/1950) *100= 83 percent matched
Map1
…Coincidence Summary
generates a cross-tabular
listing of the intersection of
two maps.
Map2
Map1
Map3
Table Interpretation
Diagonal (Same)
Off-diagonal (Above/Below)
Percentages (% Same)
Overall Percentage
475 + 297 + 563 = 1335; (1335/1950) *100= 68 percent matched
Map3: Med-- 260 + 297 + 335= 912; (297/912) *100= 33 percent matched
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
83%
((475+297+563)/1950)*100= 68%
((631+297+693)/1950)*100=
Raster versus Vector
(Berry)
Comparing Discrete Maps (Proximal Alignment)
Two ways to compare Discrete Maps…
Coincident Summary
Proximal Alignment
Proximity_Map1_Category1 * Binary_Map3_Category1
…non-zero values identify changes and how far away
…Proximal Alignment
isolates a category on one
of the maps, generates its
proximity, then identifies
the proximity values that
align with the same
category on the other map.
Table Interpretation
Zeros (Agreement)
Values (> Disagreement)
PA Index (average)
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Comparing Map Surfaces (Statistical Tests)
Three ways to compare Map Surfaces…
Statistical Tests
Percent Difference
Surface Configuration
…must be
quantitative
isopleth data
…Statistical Tests compare one set of cell values to that of another based on the
differences in the distributions of the data— 1) data sets (partition or coincidence;
continuous or sampled) 2) statistical procedure (t-Test, f-Test, etc.)
Table 1
Box-and-whisker
graphs
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Comparing Map Surfaces (%Difference)
Three ways to compare Map Surfaces…
Statistical Tests
Percent Difference
Surface Configuration
Question 3
…Percent Difference capitalizes on the spatial arrangement of the values by
comparing the values at each map location— %Difference Map, %Difference Table
Table 2
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Comparing Map Surfaces (Surface Configuration)
Three ways to compare Map Surfaces…
Statistical Tests
Percent Difference
Surface Configuration
…Surface Configuration capitalizes on the spatial arrangement of the values by
comparing the localized trend in the values — Slope Map, Aspect Map, Surface
Configuration Index
Table 3
See Beyond Mapping III , Topic 10, Analyzing Map Similarity and Zoning
(Berry)
Spatial Dependency
Spatial Variable Dependence — what occurs at a location
in geographic space is related to:
• the conditions of that variable at nearby locations, termed
Spatial Autocorrelation (intra-variable dependence)
Surface Modeling
Discrete Point Map
Continuous Map Surface
• the conditions of other variables at that location, termed
Multivariate
Spatial Correlation (inter-variable dependence) Spatial
Data Mining
Map Stack– relationships among maps are investigated by aligning grid
maps with a common configuration… #cols/rows, cell size and geo-reference.
Data Shishkebab– each map represents a variable, each grid space a case
and each value a measurement with all of the rights, privileges, and
responsibilities of non-spatial mathematical , numerical and statistical analysis
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Visualizing Spatial Relationships
Interpolated Spatial Distribution
Phosphorous (P)
What spatial
relationships do you
see?
…do relatively high levels
of P often occur with high
levels of K and N?
…how often?
…where?
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Identifying Unusually High Measurements
…isolate areas with mean + 1 StDev (tail of normal curve)
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Level Slicing
…simply multiply the two maps to identify joint coincidence
1*1=1 coincidence (any 0 results in zero)
Question 4
2-dimensional data space Box
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Multivariate Data Space
…sum of a binary progression (1, 2 ,4 8, 16, etc.) provides
level slice solutions for many map layers
3-dimensional space Cube
(Parallel piped )
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Calculating Data Distance
…an n-dimensional plot depicts the multivariate distribution; the distance
between points determines the relative similarity in data patterns
…the closest floating ball is the least similar (largest data distance) from the comparison point
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Identifying Map Similarity
Question 5
…the relative data distance between the comparison point’s data pattern
and those of all other map locations form a Similarity Index
The green tones indicate field locations with fairly similar P, K and N levels; red tones indicate dissimilar areas.
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
Clustering Maps for Data Zones
Question 6
…a map stack is a spatially organized set of numbers
…groups of “floating balls” in data space
identify locations in the field with similar data
patterns– data zones
…fertilization rates vary for the different
clusters “on-the-fly”
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
Cyber-Farmer, Circa 1992
Variable Rate Application
(Berry)
Assessing Clustering Results
…Clustering results can be roughly evaluated using basic statistics
Average, Standard Deviation, Minimum and Maximum values within each cluster are calculated. Ideally
the averages between the two clusters would be radically different and the standard deviations small—large
difference between groups and small differences within groups.
Standard
Statistical Tests
of two data sets
Box and
Whisker Plots
to visualize
differences
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
How Clustering Works (IsoData algorithm)
1) The scatter
plot shows
Height versus
Weight data that
might have been
collected in your
old geometry
class
Data Space
3) The average
X,Y coordinates
of the assigned
students to each
“working”
cluster is
calculated and
used to
reposition the
cluster centers
See Beyond Mapping III , Topic 7, Linking Data Space and Geographic Space
2) The data
distance to each
weight/height
measurement pair
is calculated and
the point is
assigned to the
closest arbitrary
cluster center
4) Repeat data
distances, cluster
assignments and
repositioning until
no change in
cluster
membership
(centers do not
move)
(Berry)
Map Correlation (How it works)
Spatially Aggregated Correlation
Localized Correlation
Roving Window
Elevation
(Feet)
Slope
(Percent)
X elev = 2,063 feet
Yslope = 38%
…625 small data tables
within 5 cell reach =
81map values for localized summary
Point- by-Point
…one large data table
with 25rows x 25 columns =
625 map values for map wide summary
r=
= .562
= .432
…where x = Elevation value and y = Slope value
and n = number of value pairs
localized
map wide
(Berry)
Map Correlation (Aggregated and Localized results)
Spatially Aggregated
Correlation
Scalar Value – one value represents the
overall non-spatial relationship between the
two map surfaces
r = .432 map wide
Map Variable – a continuous quantitative
surface represents the localized spatial
relationship between the two map surfaces
Strong
Positive
Strong
Positive
Minimal
Correlation
Strong
Negative
Localized
Correlation
r = .562 Localized
Strong
Negative
(Berry)
An Analytic Framework for GIS Modeling
Spatial Data Mining operations involve
characterizing numerical patterns and
relationships among mapped data.
See www.innovativegis.com/basis/Download/IJRSpaper/
(Berry)
Regression (conceptual approach)
A line is “fitted” in data space that balances the data so the differences from the
points to the line (residuals) for all the points are minimized
and the sum of the differences is zero…
…the equation of the regression line is used to predict the
“Dependent” variable (Y axis) using one or more “Independent” variables (X axis)
(Berry)
Evaluating Prediction Maps (non-spatial)
Non-spatial …R-squared value looks at the
deviations from the regression line; data
patterns about the regression line
(Berry)
Map Variables
The Dependent Map variable is the one that you want to predict…
Question 7
…derive from
customer data
…from a set of existing or easily measured Independent Map variables
See Beyond Mapping III , Topic 28, Spatial Data Mining in Geo-Business
(Berry)
Map Regression Results (Bivariate)
Scatter plots and regression equations relating Loan Density
to three candidate driving variables (Housing Density, Value and Age)
Loans= fn( Housing Density )
Loans= fn( Home value )
Question 7
Creates the Loan
Concentration
map surface
Question 8
Creates
regression
equation and
R2 index
Loans= fn( Home Age )
The “R-squared index” provides a general measure of how good the predictions ought to be—
40%, 46% indicates a moderately weak predictors; 23% indicates a very weak predictor
(R-squared index = 100% indicates a perfect predictor; 0% indicates an equation with no predictive capabilities)
See Beyond Mapping III , Topic 28, Spatial Data Mining in Geo-Business
(Berry)
Generating a Multivariate Regression
…a regression equation using all three independent map variables using
multiple linear regression is used to generate a prediction map
Question 9
See Beyond Mapping III , Topic 28, Spatial Data Mining in Geo-Business
(Berry)
Evaluating Regression Results (multiple linear)
…a regression equation using all three independent map variables using
multiple linear regression is used to generate a prediction map
…that is compared to the actual dependent variable data — Error Surface
Optional Question 9-1
See Beyond Mapping III , Topic 28, Spatial Data Mining in Geo-Business
(Berry)
Using the Error Map to Stratify
One way to improve the predictions, however, is to stratify the data set by breaking it
into groups of similar characteristics …and then generating separate regressions
…generate a different regression for
each of the stratified areas– red,
yellow and green
…other stratification techniques include indigenous knowledge,
level-slicing and clustering
Optional Question 9-2
See Beyond Mapping III , Topic 28, Spatial Data Mining in Geo-Business
(Berry)
Spatial Data Mining (The Big Picture)
…making sense out of a map stack
Mapped data that
exhibits high spatial
dependency create
strong prediction
functions. As in
traditional statistical
analysis, spatial
relationships can be
used to predict
outcomes
…the difference is
that spatial statistics
predicts where
responses will be
high or low
See Beyond Mapping III , Topic 16, Characterizing Patterns and Relationships
(Berry)
An Analytic Framework for GIS Modeling
Spatial Data Mining operations involve
characterizing numerical patterns and
relationships among mapped data.
See www.innovativegis.com/basis/Download/IJRSpaper/
(Berry)
Prescriptive Mapping
Four primary types of applied spatial models:
 Suitability— mapping preferences (e.g., Habitat and Routing)
 Economic— mapping financial interactions (e.g., Combat Zone and Sales Propensity)
 Physical— mapping landscape interactions (e.g., Terrain Analysis and Sediment Loading)
 Mathematical/Statistical— mapping numerical relationships…
― Descriptive
math/stat models summarize existing mapped data
(e.g., Standard Normal Variable Map for Unusual Conditions and Clustering for Data Zones)
― Predictive
math/stat models develop equations relating mapped data
(e.g., Map Regression for Equity Loan Prediction and Probability of Product Sales )
― Prescriptive
math/stat models identify management actions based on
descriptive/predictive relationships (e.g., Retail Marketing and Precision Ag)…
Phosphorous (P)
Continuous Actions: Equation defining action(s)
Negative linear equation of the form: y = aX
Negative exponential equation of the form: y = e-x
(Berry)
P2O5/
If P is 0-4 ppm, then apply 50 lbs P2O5/Acre
If P is 4-8 ppm, then apply 18 lbs P2O5/Acre
If P is 8-12 ppm, then apply 7 lbs P2O5/Acre
If P is >12 ppm, then apply 0 lbs P2O5/Acre
50
50
7
0
18
0
0
P
12
more
P
12
more
50
P2O5/
Discrete Actions: If <condition(s)> Then <Action(s)>
0
0
Grid-Based Map Analysis
Spatial Analysis investigates the “contextual” relationships in mapped data…
 Reclassify— reassigning map values (position; value; size, shape; contiguity)
 Overlay— map overlay (point-by-point; region-wide)
 Distance— proximity and connectivity (movement; optimal paths; visibility)
 Neighbors— ”roving windows” (slope/aspect; diversity; anomaly)
Surface Modeling maps the “spatial distribution” of point data…
 Density Analysis— count/sum of points within a local window
 Spatial Interpolation— weighted average of points within a local window
 Map Generalization— fits mathematical relationship to all of the point data
Spatial Data Mining investigates the “numerical” relationships in mapped data…
 Descriptive— summary statistics, comparison, classification (e.g., clustering)
 Predictive— math/stat relationships among map layers (e.g., regression)
 Prescriptive— appropriate actions (e.g., optimization)
(Berry)