Download Introduction to Spatial Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Agenda Today
We will discuss a few interesting spatial data mining
patterns
Then come back to summarize what we have learned
in this course so far
1
Spatial Data Management: Summary
2
Course Summary
1.
2.
3.
4.
5.
Introduction to Spatial Databases
Spatial Concepts and Data Models
Spatial Query Languages: SQL3
Spatial Storage and Indexing: R-tree, Grid File
Query Processing and Query Optimization
Strategies for range query, nearest neighbor query
Spatial joins (e.g. tree matching), cost models
6. Spatial Network Model
7. Spatial Data Mining
Spatial auto-correlation, co-location patterns, spatial
outliers, classification methods
8. Trends in Spatial Database (Moving Object)
3
1. Introduction
Traditional (non-spatial) database management systems provide:
Persistence across failures
Allows concurrent access to data
Scalability to search queries on very large datasets which do not fit
inside main memories of computers
Efficient for non-spatial queries, but not for spatial queries
Non-spatial queries:
List the names of all bookstore with more than ten thousand titles.
List the names of ten customers, in terms of sales, in the year 2001
Use an index to narrow down the search
Spatial Queries:
List the names of all bookstores with ten miles of Minneapolis
List all customers who live in Tennessee and its adjoining states
List all the customers who reside within fifty miles of the company
headquarter
4
1. Spatial Data Examples
Examples of non-spatial data
Names, phone numbers, …
Examples of Spatial data
Census Data
NASA satellites imagery - terabytes of data per day
Weather and Climate Data
Rivers, Farms, ecological impact
Medical Imaging
5
2. Spatial Object Model
Object model concepts
Objects: distinct identifiable things relevant to an application
Objects have attributes and operations
Attribute: a simple (e.g. numeric, string) property of an object
Operations: function maps object attributes to other objects
Example from a roadmap
Objects: roads, landmarks, ...
Attributes of road objects:
• spatial: location, e.g. polygon boundary of land-parcel
• non-spatial: name (e.g. Route 66), type (e.g. interstate,
residential street), number of lanes, speed limit, …
Operations on road objects: determine center line, determine
length, determine intersection with other roads, ...
6
2. Classifying Spatial objects
• Spatial objets are spatial attributes of general objects
• Spatial objects are of many types
•Simple
•0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces)
•Example given at the bottom of this slide
•Collections
•Polygon collection (e.g. boundary of Japan or Hawaii), …
•See more complete list in Figure 2.2
Spatial Object Types Example Object
Dimension
Point
City
0
Curve
River
1
Surface
Country
2
7
2. Spatial Object Types in OGIS Data Model
Fig 2.2: Each rectangle shows a distinct spatial object type
8
2. Classifying Operations on spatial objects in
Object Model
•Classifying operations
•Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points
• A set operation (e.g. intersection) of 2 polygons produce another
polygon
• Topological operations: Boundary of USA touches boundary of Canada
• Directional: New York city is to east of Chicago
• Metric: Chicago is about 700 miles from New York city.
Set theory based
Union, Intersection, Containment,
Topological
Touches, Disjoint, Overlap, etc.
Directional
East,North-West, etc.
Metric
Distance
9
2. Specifying topological operation
Fig 2.3: 9 intersection matrices for a few topological operations
10
2. Conceptual DM: The ER Model
3 basic concepts
Entities have an independent conceptual or physical
existence.
• Examples: Forest, Road, Manager, ...
Entities are characterized by Attributes
• Example: Forest has attributes of name, elevation, etc.
An Entity interacts with another Entity through
relationships.
• Road allow access to Forest interiors.
• This relationship may be name “Accesses”
11
2. ER Diagram for “State-Park”
Fig 2.4
12
Pictorial Enhanced ER Diagram for “State-Park
13
2. Mapping ER to Relational
•Highlights of translation rules
•Entity becomes Relation
•Attributes become columns in the relation
•Multi-valued attributes become a new relation
•includes foreign key to link to relation for the entity
•Relationships (1:1, 1:N) become foreign keys
•M:N Relationships become a relation
•containing foreign keys or relations from participating
entities
14
3. Three Components of SQL?
Data Definition Language (DDL)
Creation and modification of relational schema
Schema objects include relations, indexes, etc.
Data Manipulation Language (DML)
Insert, delete, update rows in tables
Query data in tables
Data Control Language (DCL)
Concurrency control, transactions
Administrative tasks, e.g. set up database users,
security permissions
15
3. Creating Tables in SQL
• Table definition
• “CREATE TABLE” statement
• Specifies table name, attribute names and data types
• Create a table with no rows.
• See an example at the bottom
• Related statements
• ALTER TABLE statement modifies table schema if needed
• DROP TABLE statement removes an empty table
16
3. Populating Tables in SQL
• Adding a row to an existing table
• “INSERT INTO” statement
• Specifies table name, attribute names and values
• Example:
INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000)
• Related statements
• SELECT statement with INTO clause can insert multiple rows in a table
• Bulk load, import commands also add multiple rows
• DELETE statement removes rows
• UPDATE statement can change values within selected rows
17
3. SELECT Statement- General Information
• Clauses
•SELECT specifies desired columns
•FROM specifies relevant tables
•WHERE specifies qualifying conditions for rows
•ORDER BY specifies sorting columns for results
•GROUP BY, HAVING specifies aggregation and statistics
•Operators and functions
•arithmetic operators, e.g. +, -, …
•comparison operators, e.g. =, <, >, BETWEEN, LIKE…
•logical operators, e.g. AND, OR, NOT, EXISTS,
•set operators, e.g. UNION, IN, ALL, ANY, …
•statistical functions, e.g. SUM, COUNT, ...
• many other operators on strings, date, currency, ...
18
4. Query Operation & Spatial Index
Filter Step:
Select the objects whose mbb satisfies the spatial
predicate
Traverse the index apply the spatial test on the mbb
Output: set of oids
Refinement Step:
Spatial test is done on the actual geometries of objects
whose mbb satisfied the filter step
Costly operation
Executed only on a limited number of objects
Concentrate on the design of efficient SAMs for the filter
step
19
4. Why spatial index method?
B-tree & hash tables
Guarantee the number of I/O operations is
respectively logarithmic and constant in the
collection sized
Index a collection on a key
Rely on a total order on the key domain, the order
of natural numbers, or the lexicographic order on
strings
There is no such total order for geometric objects
SAMs were designed to try as much as possible to
preserve spatial object proximity
20
4. Space-Driven v.s. Data-Driven SAMs
Space-Driven structures:
Partition the embedding 2D Space into rectangular cells
Independently of the distribution of the objects
Objects are mapped to the cells based on some
geometric criterion
Grid file, linear structure
Data-Driven structures:
Organized by partitioning the set of objects, as
opposed to the embedding space
Adapts to the objects’ distribution in the embedding
space
R-tree, R* tree, R+ tree
21
4. Grid File – point indexing
One page is associated
with each cell
When a cell overflow, it is
split into two cells and
the points are assigned to
the new cell
Two adjacent cells can
reference the same
page
The cells are of different
size and the partition
adapts to the point
distribution
22
4. The Quad tree
The index is represented as a quaternary tree
Each internal node has four children, one per
quadrant
NW, NE, SW, SE
Each leaf is associated a disk page, which stores
the index entries
23
4. The original R-Tree
A leaf entry is a pair (mbb, oid)
A non-leaf node contains an array of node entries
The number of entries is between m and M
For each entry (dr, node_id) in a non-leaf node N, dr is the directory
rectangle of a child node of N, whose page address is node_id
All leaves are at the same level
An object appears in one, and only one of the tree leaves
24
4. The R+ Tree
The directory rectangles at a given level do not overlap
For a point query, a single path is followed from the
root to a leaf
The I/O complexity is bounded by the depth of the tree
25
5. What is Query Processing and Optimization (QPO)?
Basic idea of QPO
In SQL, queries are expressed in high level
declarative form
QPO translates a SQL query to an execution plan
• over physical data model
• using operations on file structures, indices, etc.
Ideal execution plan answers Q in as little time as
possible
Constraints: QPO overheads are small
• Computation time for QPO steps << that for execution
plan
26
5. QPO Challenges in SDBMS
Building Blocks for spatial queries
Rich set of spatial data types, operations
A consensus on “building blocks” is lacking
Current choices include spatial select, spatial join, nearest
neighbor
Choice of strategies
Limited choice for some building blocks, e.g. nearest neighbor
Choosing best strategies
Cost models are more complex since
• Spatial Queries are both CPU and I/O intensive
• While traditional queries are I/O intensive
Cost models of spatial strategies are not mature.
27
5. Choice of building blocks
Choice of building blocks
Varies across software vendors and products
List of representative building blocks
Point Query- Name a highlighted city on a digital map.
• Return one spatial object out of a table
Range Query- List all countries crossed by of the river Amazon.
• Returns several objects within a spatial region from a table
Spatial Join: List all pairs of overlapping rivers and countries.
• Return pairs from 2 tables satisfying a spatial predicate
Nearest Neighbor: Find the city closest to Mount Everest.
• Return one spatial object from a collection
28
5. Strategies for Spatial Joins
Recall Spatial Join Example:
List all pairs of overlapping rivers and countries.
Return pairs from 2 tables satisfying a spatial predicate
List of strategies
Nested loop:
• Test all possible pairs for spatial predicate
• All rivers are paired with all countries
Space Partitioning:
• Test pairs of objects from common spatial regions only
• Rivers in Africa are tested with countries in Africa only!
Tree Matching
• Hierarchical pairing of object groups from each table, section 5.1.6
pp.121
Other, e.g. spatial-join-index based, external plane-sweep, …
29
5. Query Processing and Optimizer process
• A site-seeing trip
•Start: A SQL Query
•End: An execution plan
•Intermediate Stopovers
•query trees
•logical tree transforms
•strategy selection
• What happens after the journey?
•Execution plan is executed
•Query answer returned
Fig 5.2
30
5. Query Trees
• Nodes = building blocks of (spatial) queries
• See section 3.2 (pp.55) for symbols sigma, pi and join
• Children = inputs to a building block
• Leafs = Tables
• Example SQL query and its query tree follows:
Fig 5.3
31
5. Logical Transformation of Query Trees
• Motivation
• Transformation do not change the answer of the query
• But can reduce computational cost by
• reducing data produced by sub-queries
• reducing computation needs of parent node
• Example Transformation
• Push down select operation below join
• Example: Fig. 5.4 (compare w/ Fig 5.3, last slide)
• Reduces size of table for join operation
• Other common transformations
• Push project down
• Reorder join operations
• ...
Fig 5.4
32
5. Execution Plans
An execution plan has 3 components
A query tree
An ordering of evaluation of non-leaf nodes
A strategy selected for each non-leaf node
Example
Strategies for Query tree in Fig. 5.5
Fig 5.5
• Use scan for Area(L.Geometry) > 20
• Use index for Fa.Name = ‘Campground’
• Use space-partitioning join for
– Distance(Fa, L) < 50
• Use on-the-fly for projection
Ordering
• As listed above
33
7. What is Spatial Data Mining?
Non-trivial search for interesting and unexpected spatial pattern
Non-trivial Search
Large (e.g. exponential) search space of plausible hypothesis
Ex. Asiatic cholera : causes: water, food, air, insects, …; water
delivery mechanisms - numerous pumps, rivers, ponds, wells,
pipes, ...
Interesting
Useful in certain application domain
Ex. Shutting off identified Water pump => saved human life
Unexpected
Pattern is not common knowledge
May provide a new understanding of world
Ex. Water pump - Cholera connection lead to the “germ” theory
34
7. Choice of Methods
Two Approaches to mining Spatial Data
Pick spatial features; use classical DM methods
Use novel spatial data mining techniques
Possible Approach:
Define the problem: capture special needs
Explore data using maps, other visualization
Try reusing classical DM methods
If classical DM perform poorly, try new methods
Evaluate chosen methods rigorously
Performance tuning as needed
35
7. Location Prediction as a classification problem
Given:
1. Spatial Framework
S  {s1 ,...sn }
2. Explanatory functions: f X : S  R
3. A dependent class: fC : S  C  {c1 ,...cM }
4. A family  of function
mappings: R ... R  C
k
Find: Classification model: fˆc  
Nest locations
Distance to open water
Objective:maximize
classification_accuracy ( fˆc , f c )
Constraints:
Spatial Autocorrelation exists
Vegetation durability
Water depth
Color version of Fig. 7.3, pp. 188
36
7. Techniques for Location Prediction
Classical method:
logistic regression, decision trees, bayesian classifier
assumes learning samples are independent of each other
Spatial auto-correlation violates this assumption!
Q? What will a map look like where the properties of a pixel was
independent of the properties of other pixels? (see below - Fig. 7.4, pp.
189)
New spatial methods
Spatial auto-regression (SAR),
Markov random field
• bayesian classifier
37
7. Spatial AutoRegression (SAR)
•
Spatial Autoregression Model (SAR)
• y = Wy + X + 
• W models neighborhood relationships
•  models strength of spatial dependencies
•  error vector
•
Solutions
•  and  - can be estimated using ML or Bayesian stat.
• e.g., spatial econometrics package uses Bayesian approach
using sampling-based Markov Chain Monte Carlo (MCMC)
method.
• Likelihood-based estimation requires O(n3) ops.
• Other alternatives – divide and conquer, sparse matrix, LU
decomposition, etc.
38
7. Associations, Spatial associations, Co-location
Answers:
and
39
7. Association Rules: Formal Definitions
Consider a set of items,
I  {i1 ,..., ik }
T  t1 ,..., tn 
Consider a set of transactions
where eacht i is a subset of I.
Support of C
 (C)  t | t T , C  t
Then i1  i2 iff
 (i1  i2 )
Support: occurs in at least s percent of the transactions:
|T |

(
i

i
)
Confidence: At least c% 1 2
 (i1 )
Example: Table 7.4 (pp. 202) using data in Section 7.4
i1
40
7. Co-location rules vs. association rules
Association rules
Co-location rules
Underlying space
discrete sets
continuous space
item-types
item-types
events /Boolean spatial features
collection
Transaction (T)
Neighborhood (N)
prevalence measure
support
participation index
conditional probability metric
Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at location L ]
Participation index = min{pr(fi, c)}
Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}:
= fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby
N(L) = neighborhood of location L
41
7. Spatial Outlier Detection
• Compute
Z S ( x) 
| S ( x)  u s |
 ( s)
where
S ( x)  [ f ( x)  E yN ( x ) ( f ( y ))]
•Select points (e.g. S with Z(S(x)) above 3
42
7. Spatial Outlier Detection: Example
Color version of Fig. 7.19 pp. 219
Given
A spatial graph G={V,E}
A neighbor relationship (K neighbors)
f
An attribute function : V -> R
Find
O = {vi | vi V, vi is a spatial outlier}
Spatial Outlier Detection Test
1. Choice of Spatial Statistic
S(x) = [f(x)–E y N(x)(f(y))]
2. Test for Outlier Detection
| (S(x) - s) / s | > 
Rationale:
Theorem: S(x) is normally distributed
if f(x) is normally
43
8. Spatiotemporal Data
Two types of problems:
Indexing the current positions and movements of
objects and querying their anticipated future
positions.
Indexing and querying the past movements of
mobile objects.
On Indexing Mobile Objects
Indexing the Positions of Continuously Moving
Objects
44
Spatiotemporal Data (cont’d)
Indexing current/future locations mobile objects
The TPR-tree
•
•
Like the R-tree, but the MBRs are time-parameterized
to conservative bounding intervals (CBI).
How are the CBI computed? What is the best way to
group objects into a CBI?
–
•
By minimizing an objective function (e.g., overlap) over
the time the TPR-tree is valid.
How do we answer queries using the TPR-tree?
45
Conclusion
Good progress… still more work is needed:
Devising clean and complete semantics for data models
and operators for spatial data, spatial-temporal data
Efficient implementation
Indexing, query processing, query optimization, cost
model
Develop efficient algorithms to mine spatial data
Alternatives architectures
• spatial-temporal data, moving objects
• mobile, wireless applications
• web GIS
46