Download PP slides

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Geographical analysis
Overlay, cluster analysis, autocorrelation, trends, models, network
analysis, spatial data mining
How geographic analysis started
John Snow’s
cholera map
of 1854
Geographical analysis
• Combination of different geographic data sets or
themes by overlay or statistics, in particular
suitability analysis
• Discovery of patterns, dependencies
• Discovery of trends, changes (time)
• Development of models
• Interpolation, extrapolation, prediction
• Spatial decision support, planning
• Consequence analysis (What if?)
Example overlay
• Two subdivisions with labeled regions
soil
Soil
Soil
Soil
Soil
vegetation
type
type
type
type
1
2
3
4
Birch forest
Beech forest
Mixed forest
Birch forest
on soil type 2
Kinds of overlay
• Two subdivisions with the same boundaries
- nominal and nominal
Religion and voting per municipality
- nominal and ratio
Voting and income per municipality
- ratio and ratio
Average income and age of employees
• Two subdivisions with different boundaries
Soil type and vegetation
• Subdivision and elevation model
Vegetation and precipitation
Kinds of overlay, cont’d
• Subdivision and point set
quarters in city, occurrences of violence on the
street
• Two elevation models
elevation and precipitation
• Elevation model and point set
elevation and epicenters of earthquakes
• Two point sets
money machines, street robbery locations
• Network and subdivision, other network,
elevation model
Result of overlay
• New subdivision or map layer, e.g. for further
processing
• Table with combined data
• Count, surface area
Soil
Type
Type
Type
Type
….
1
2
3
4
Vegetation
Area
Beech
Birch
Mixed
Beech
….
30
15
8
2
ha
ha
ha
ha
#patches
2
2
1
1
Buffer and overlay
• Neighborhood analysis: data of a theme
within a given distance (buffer) of objects of
another theme
Sightings of nesting locations of the great blue heron (point set)
Rivers; buffer with width 500 m of a river
Overlay  Nesting locations great blue heron near river
Overlay: ways of combination
• Combination (join) of attributes
• One layer as selection for the other
– Vegetation types only for soil type 2
– Land use within 1 km of a river
Overlay in raster
• Pixel-wise operation, if the rasters have the same
coordinate (reference) system
Pixel-wise
AND
Forest
Population increase
above 2% per year
Both
Overlay in vector
• E.g. the plane sweep algorithm as given in
Computational Geometry (line segment
intersection), to get the overlay in a topological
structure
• Using R-trees as an indexing structure to find
intersections of boundaries
Combined (multi-way)
overlays
• Site planning, new housing sites depending on
multiple criteria
–
–
–
–
Proximity infrastructure
Proximity facilities (hospitals, schools)
Not in nature areas
…
• Another example (earth sciences):
Parametric land classification: partitioning of
the land based on chosen, classified themes
Elevation
Annual precipitation
Types of rock
Overlay: partitioning
based on the three themes
Suitability analysis
• Selection of location of new housing, a new store,
a new factory, etc.
• Typically done by
overlay using
multiple map
layers
Analysis point set
• Points in an attribute space: statistics, e.g.
regression, principal component analysis,
dendrograms
(area, population, #crimes)
(12, 34.000,
(14, 45.000,
(15, 41.000,
(17, 63.000,
(17, 66.000,
……
……
#crimes
#population
34)
31)
14)
82)
79)
Analysis point set
• Points in geographical space without associated
value: clusters, patterns, regularity, spread
Actual average nearest
neighbor distance
versus expected Av. NN.
Dist. for this number of
points in the region
For example: volcanoes in a region; crimes in a city
Analysis point set
• Points in geographical space with value: up to
what distance are measured values “similar” (or
correlated)?
11 10
12
13
12
19
14
16
18
21
21
20
22
17
15
16
Analysis point set
• Temperature at location x and 5 km away from x
is expected to be nearly the same
• Elevation (in Switzerland) at location x and 5 km
away from x is not expected to be related (even
over 1 km), but it is expected to be nearly the
same 100 meters away
• Other examples:
– depth to groundwater
– soil humidity
– nitrate concentration in the soil
Analysis point set
• Points in geographical space with value:
auto-correlation (~ up to what distance are
measured values “similar”, or correlated)
11 10
12
13
12
19
14
16
18
21
21
20
22
17
15
16
n points 
(n choose 2) pairs;
each pair has a
distance and a
difference in value
difference 2
Average
difference 2
 observed
expected
2
difference
distance
Classify distances and
determine average per class
distance
Observed variogram
Average
2
difference
 observed
expected
2
difference
Model variogram (linear)
sill
σ2
nugget
distance
range
distance
Smaller distances 
more correlation, smaller variance
Importance auto-correlation
• Descriptive statistic of a data set: describes the
distance-dependency of auto-correlation
• Interpolation based on data further away than
the range is nonsense
11
10
12
16
20
13
14
??
21
16
19
18
21
12
22
15
17
range
Importance auto-correlation
• If the range of a geographic variable is small,
more sample point measurements are needed to
obtain a good representation of the geographic
variable through spatial interpolation
 influences cost of an analysis or decision
procedure, and quality of the outcome of the
analysis
Analysis subdivision
• Nominal subdivision: auto-correlation
(~ clustering of equivalent classes)
• Ratio subdivision: auto-correlation
PvdA
CDA
VVD
Auto-correlation
No auto-correlation
Auto-correlation nominal subdivision
Join count statistic:
PvdA
CDA
VVD
• 22 neighbor relations (adjacencies)
among 12 provinces
• Pr(prov. A =VVD and prov. B =VVD)
= 4/12 * 3/11
• E(VVD adj. VVD) = 22 * 12/132 = 2
• Reality: 4 times
• E(CDA adj. PvdA) = 5.33; reality
once
4/12 * 4/11 * 2 * 22
Geographical models
• Properties of (geographical) models:
–
–
–
–
–
–
selective (simplification, more ideal)
approximative
analogous (resembles reality)
structured (usable, analyzable, transformable)
suggestive
re-usable (usable in related situations)
Geographical models
• Functions of models:
–
–
–
–
–
–
psychological (for understanding, visualization)
organizational (framework for definitions)
explanatory
constructive (beginning of theories, laws)
communicative (transfer scientific ideas)
predictive
Example: forest fire
• Is the Kröller-Müller museum well enough
protected against (forest)fire?
• Data: proximity fire dept., burning properties of
land cover, wind, origin of fire
• Model for: fire spread
Time neighbor pixel on fire:
[1.41 *] b * ws * (1- sh) * (0.2 + cos )
b = burn factor
ws = wind speed
 = angle wind – direction pixel
sh = soil humidity
Forest fire
Wind, speed 3
Forest; burn factor 0.8
Heath; burn factor 0.6
Road; burn factor 0.2
Museum
Soil
humidity
Origin
< 3 minutes
< 6 minutes
< 9 minutes
> 9 minutes
Forest fire model
• Selective: only surface cover, humidity and
wind; no temperature, seasonal differences, …
• Approximative: surface cover in 4 classes; no
distinction in forest type, etc., pixel based so
direction discretized
• Structured: pixels, simple for definition relations
between pixels
• Re-usable: approach/model also applies to
other locations (and other spread processes)
Forest fire model
• The forest fire model is deterministic (every run
gives the same outcome) stochastic models
exist too and can account for randomness
• There are static and dynamic models; dynamic
models involve change over time
• Error analysis: assume an error distribution of
a parameter and run Monte Carlo simulation
(burn factor values)
• Sensitivity analysis: examine how changing a
parameter influences the outcome (wind in
forest fire example)
More models
•
•
•
•
•
Population growth
Landslides, avalanches
Crime change over time
Road accidents
…
Network analysis
• When distance or travel time on a network
(graph) is considered
• Dijkstra’s shortest path algorithm
• Reachability measure for a destination: potential
value
potential (i )   w c

j ij
j
wj = weight origin j
 = distance decay parameter
c ij = distance cost between
origin j and destination i
Think of i as a potential shop location and j as
the population (potential customers)
Example reachability
• Law Ambulance Transport: every location
must be reachable within 15 minutes (from
origin of ambulance)
Example reachability
• Physician’s practice:
- optimal practice size: 2350 (minimum: 800)
- minimize distance to practice
- improve current situation with as few changes
as possible
Current
situation: 16
practices,
30.000 people,
average 1875
per practice
Computed,
improved
situation: 13
practices
Example in table
Original
New
16
13
Number of practice locations
9
7
Number of practices < 800 size
2
0
3957
4624
Average travel distance (km)
0,9
1,2
Largest distance (km)
5,2
5,4
Number of practices
Number of people > 3 km
Analysis elevation model
• Landscape shape recognition:
- peaks and pits
- valleys and ridges
- convexity, concavity
• Water flow, erosion,
watershed regions,
landslides, avalanches
Spatial data mining
• Finding spatial patterns in large spatial data sets
– within one spatial data set
– across two or more data sets
• With time: spatio-temporal data mining
• Main operations:
– clustering
– co-location patterns
– spatial association rule mining
(spatio-temporal association rule mining)
Clustering
• Partition-based clustering: produce clusters
– k-means clustering
– DBSCAN
– ...
• Hierarchical clustering: produce a hierarchy
– agglomerative (root-down)
– divisive (bottom-up)
k-Means clustering
• Assume k (number of clusters) is known
• Start with k points as seed set S
• Repeat
– Assign every point to the nearest seed in S to make k
clusters
– For every cluster, compute the center of gravity to
form a new seed set
Until convergence
Running time: no. of iterations times O(nk),
or times O(n log k) (using VD and point location)
DBSCAN clustering
• Popular method by Ester, Kriegel, Sander, and Wu
(1996)
• Assume two parameters eps and minPts are given
• A point q is core if there are  minPts within
distance eps
• A point p is core-close to a point q if q is core and
p is a point within eps of q
• A point p is density-reachable from a point q if
p=p0, p1, ..., pm=q exist and pi is core-close to pi+1
• Two points p, p’ are density-connected if a point q
exists from which p and p’ are density-reachable
DBSCAN clustering
• DBSCAN clustering is the clustering of all densityconnected points into a cluster
• Clustering of core points is unique, other points
are not necessarily uniquely clustered
density-reachable
from two clusters
of core points
not core
minPts = 4
eps
outlier
core
cluster of
core points
DBSCAN clustering
• If minPts is constant, then DBSCAN can be
implemented to run in O(n log n) time using
higher-order Voronoi diagrams:
– a minPts-order Voronoi diagram gives for every point
the minPts closest points
– the distance to the furthest of these tells if a point is
core or not
– make a graph where every core point has a directed
edge to the minPts nearest points
– find a cluster by DFS from any core point, until all core
points are in clusters
– then assign non-core points to clusters, if possible
Divisive hierarchical clustering
• Start with n clusters of single points
• While #clusters > k: merge the two nearest
clusters (that have shortest minimum distance)
• Can be implemented in O(n log n) time using
Voronoi diagrams
• Maximizes the distance between any two points
in different clusters
• Also called single-link clustering
Clustering
• Largest cluster is of interest
• Entities not involved in clusters may be
interesting (outliers)
• Number of occurring clusters is of interest
• No established way to know how many clusters
to use
• Setting the parameters is important
Co-location
• Whenever there are two data sets, object view
• Presence of objects in one data set almost
implies the presence of the other (need not be
symmetric relation)
Egyptian Plover bird and the Nile crocodile
Co-location
• Degree of co-location of the two types may be
interesting
• Entities not involved in co-location may be
interesting
• Asymmetry of co-location may be interesting
Spatial association rules
• Association rules with a spatial aspect
• Market basket analysis:
If a shopping basket contains
also contains
, then it
• Quality of rule:
– Support: number of transactions with
– Confidence: fraction of transactions with
that also have
&
Support and confidence
count
ratio of counts
Spatial Association Rules
• Some examples with proximity:
– If a house is close to the sea, then it is expensive
– If a hotel is near touristy sites, then it is frequented by
tourists
– If a lake is close to dump sites, then its water is polluted
Towards spatial support and
spatial confidence
• Need appropriate definitions
– Option 1: define “close” with a threshold of distance
– Option 2: convert distance to a [0:1] -score (degree of
closeness) and use fuzzy association rule ideas
Towards spatial support and
spatial confidence
Towards spatial support and
spatial confidence
not close
not close
not close
close
close
close
threshold for distance
Towards spatial support and
spatial confidence
score 0.1
score 0
score 0.4
score 0.6
score 0.8
score 1
score for closeness
Advantages of thresholding
• Simpler
• Can use standard support and confidence
measures
Approach taken by Koperski and Han (1995),
Gidofalvi and Pedersen (2005)
not close
close
Advantages of distance
conversion
• More versatile
• Correct “guess” of the thresholds not so critical
as in the one-threshold case
Approach taken by Chawla & Verhein (2006, 2008)
not close
somewhat
close
close
Spatial association rules
• The antecedent and/or the consequent are spatial
(often involving spatial proximity)
• Transaction ≈ occurrence of object from
antecedent
– houses [close to sea]
– hotels [close to touristy sites]
– lakes
[close to dump sites]
Spatial support and spatial
confidence
• Spatial support: sum over the objects of the
degree for which the rule is true for that object
• Spatial confidence: spatial support divided by
total sum of antecedent scores
score 0.1
score 0
score 0.4
score 0.6
score 0.8
score 1
Example: “house is close to sea 
expensive”
Assume all houses are expensive
except for the left one
Spatial support = 2.3
Spatial confidence = 2.3 / 2.9
Spatial antecedents and spatial
consequents
• “If a village is close to a highway intersection,
then it is close to a motel”
motel
village
motel
village
village
motel
Antecedent score 1
Consequent score 0.5
Antecedent score 0.5
Consequent score 1
Antecedent score 0.5
Consequent score 0.5
Spatial support = 0.5
Spatial support = 0.5
Spatial support = 0.25
Spatial antecedents and spatial
consequents
• “If a village is close to a highway intersection,
then it is close to a motel”
motel
village
motel
village
village
motel
Antecedent score 1
Consequent score 0.5
Antecedent score 0.5
Consequent score 1
Antecedent score 0.5
Consequent score 0.5
Spatial support = 0.5
Spatial confidence = 0.5
Spatial support = 0.5
Spatial confidence = 1
Spatial support = 0.25
Spatial confidence = 0.5
Spatial support and spatial
confidence: definition
• Rule: A  C
on e.g. houses from a set H
• Spatial support:
 scoreA(h)  scoreC(h)
h H
• Spatial confidence:

scoreA(h)  scoreC(h)
h H
 scoreA(h)
h H
All scores are in [0:1]
Spatio-temporal data
• Locations have a time stamp
• Interesting patterns involve space and time
• Examples
– earthquakes have an epicenter and a time of occurrence
– trees have a location and a day of first blooming
– traffic jams have a location and a start time
Summary
• There are many types of geographical analysis,
it is the main task of a GIS
• Overlay analysis is the most important type
• Auto-correlation, modeling, network analysis
are also important
• Spatial and spatio-temporal data mining gives
new types of analysis of geographic data