Download CS206 --- Electronic Commerce

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
What we have covered?
Indexing and Hashing
Data warehouse and OLAP
Data Mining
Information Retrieval and Web Mining
XML and XQuery
Spatial Databases
Transaction Management
1
Lecture 6:
Spatial Data Management
2
Types of Spatial Data
 Point Data
 Points in a multidimensional space
 E.g., Raster data such as satellite imagery, where
each pixel stores a measured value
 E.g., Feature vectors extracted from text
 Region Data
 Objects have spatial extent with location and
boundary
 DB typically uses geometric approximations
constructed using line segments, polygons, etc.,
called vector data.
3
Applications of Spatial Data
 Geographic Information Systems (GIS)
 E.g., ESRI’s ArcInfo; OpenGIS Consortium
 Geospatial information
 All classes of spatial queries and data are common
 Computer-Aided Design/Manufacturing
 Store spatial objects such as surface of airplane fuselage
 Range queries and spatial join queries are common
 Multimedia Databases
 Images, video, text, etc. stored and retrieved by content
 First converted to feature vector form; high dimensionality
 Nearest-neighbor queries are the most common
4
Types of Spatial Queries
 Spatial Range Queries
 Find all cities within 50 miles of Madison
 Query has associated region (location, boundary)
 Answer includes overlapping or contained data
regions
 Nearest-Neighbor Queries
 Find the 10 cities nearest to Madison
 Results must be ordered by proximity
 Spatial Join Queries
 Find all cities near a lake
 Expensive, join condition involves regions and
proximity
5
Spatial Indexing
 Point Access Methods (PAMs) vs Spatial
Access Methods (SAMs)
 PAM: index only point data
 Hierarchical (tree-based) structures
 Multidimensional Hashing
 Space filling curve
 SAM: index both points and regions
 Transformations
 Overlapping regions
 Clipping methods (non-overlapping)
 Data partitioning vs Space partitioning
6
Single-Dimensional Indexes
 B+ trees are fundamentally single-dimensional
indexes.
 When we create a composite search key B+ tree,
e.g., an index on <age, sal>,
we effectively linearize the 2-dimensional space
since we sort entries first by age and then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
80
70
60
SAL 50
40
30
20
10
B+ tree
order
11
12
13 AGE
Multidimensional Indexes
 A multidimensional index clusters entries so as to
exploit “nearness” in multidimensional space.
 Keeping track of entries and maintaining a balanced
index structure presents a challenge!
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
80
70
60
50
40
30
20
10
Spatial
clusters
11
12
13
B+ tree
order
Motivation for Multidimensional
Indexes
 Spatial queries (GIS, CAD).
 Find all hotels within a radius of 5 miles from the
conference venue.
 Find the city with population 500,000 or more that is
nearest to Kalamazoo, MI.
 Find all cities that lie on the Nile in Egypt.
 Find all parts that touch the fuselage (in a plane design).
 Similarity queries (content-based retrieval).
 Given a face, find the five most similar faces.
 Multidimensional range queries.
 50 < age < 55 AND 80K < sal < 90K
What’s the difficulty?
 An index based on spatial location needed.
 One-dimensional indexes don’t support
multidimensional searching efficiently. (Why?)
 Hash indexes only support point queries; want to
support range queries as well.
 Must support inserts and deletes gracefully.
 Ideally, want to support non-point data as
well (e.g., lines, shapes).
PAMs
Point Access Methods
Hierarchical methods: kd-tree based
Space Filling Curves: Z-ordering
Multidimensional Hashing: Grid File
Exponential growth of the directory
11
The problem
 Given a point set and a rectangular query, find the
points enclosed in the query
 We allow insertions/deletions on line
Query
12
Tree-based PAMs
Most of tb-PAMs are based on kd-tree
kd-tree is a main memory binary tree
for indexing k-dimensional points
Needs to be adapted for the disk model
Levels rotate among the dimensions,
partitioning the space based on a value
for that dimension
kd-tree is not necessarily balanced
13
kd-tree
 At each level we use a different dimension
x=5
C
B
A
x<5
E
x>=5
y=6
y=3
D
x=6
14
Kd-tree properties
Height of the tree O(log2 n)
Search time for exact match: O(log2 n)
Search time for range query: O(n1/2 + k)
15
kd-tree example
X=7
X=3
X=5
y=6
y=5
Y=6
x=3
x=8
x=7
y=2
Y=2
X=5
X=8
16
External memory kd-trees
 Similar to B-tree, tree nodes split many ways
instead of two ways
 insertion becomes quite complex and expensive.
 No storage utilization guarantee since when a higher
level node splits, the split has to be propagated all the
way to leaf level resulting in many empty blocks.
 Pack many interior nodes (forming a subtree)
into a block.
 it may not be feasible to group nodes at lower level into
a block productively.
 Many interesting papers on how to optimally pack nodes
into blocks recently published.
17
PAMs
Point Access Methods
Hierarchical methods: kd-tree based
Space Filling Curves: Z-ordering
Multidimensional Hashing: Grid File
Exponential growth of the directory
18
Single-Dimensional Indexes
 B+ trees are fundamentally single-dimensional
indexes.
 When we create a composite search key B+ tree,
e.g., an index on <age, sal>,
we effectively linearize the 2-dimensional space
since we sort entries first by age and then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
80
70
60
SAL 50
40
30
20
10
B+ tree
order
11
12
13 AGE
Z-Curve
What is a Z-curve?
 A space filling curve
 Generated from interleaving bits
x, y coordinate
See Fig. 4.6
Alternative generation method
see Fig. 4.5
Connecting points by z-order
see Fig. 4.4
looks like Ns or Zs
Fig 4.6
Implementing file operations
Fig 4.4
20
Example of Z-values
Figure 4.7
 Left part shows a map with spatial object A, B, C
 Right part and Left bottom part Z-values within A, B and C
Note C gets z-values of 2 and 8, which are not close
Exercise: Compute z-values for B.
Fig 4.7
21
Hilbert Curve
 A space filling curve
Fig 4.5
Example: Fig. 4.5
More complex to generate
due to rotations
Illustration on next slide!
 Implementing file operations
22
Calculating Hilbert Values
(Optional Topic)
23
Fig 4.8
PAMs
Point Access Methods
Hierarchical methods: kd-tree based
Space Filling Curves: Z-ordering
Multidimensional Hashing: Grid File
Exponential growth of the directory
24
Grid File
 Hashing methods for multidimensional points
(extension of Extensible hashing)
 Idea: Use a grid to partition the space each
cell is associated with one page
 Two disk access principle (exact match)
25
Grid File
 Start with one bucket for the whole
space.
 Select dividers along each
dimension. Partition space into cells
 Dividers cut all the way.
 Each cell corresponds to 1 disk
page.
 Many cells can point to the same
page.
 Cell directory potentially
exponential in the number of
dimensions
26
Grid File Implementation
 Dynamic structure using a grid directory
 Grid array: a 2 dimensional array with pointers to
buckets (this array can be large, disk resident)
G(0,…, nx-1, 0, …, ny-1)
 Linear scales: Two 1 dimensional arrays that used to
access the grid array (main memory) X(0, …, nx-1),
Y(0, …, ny-1)
27
Example
Buckets/Disk
Blocks
Grid Directory
Linear scale
Y
Linear scale X
28
Grid File Search
 Exact Match Search: at most 2 I/Os assuming linear scales fit in
memory.
 First use liner scales to determine the index into the cell
directory
 access the cell directory to retrieve the bucket address (may
cause 1 I/O if cell directory does not fit in memory)
 access the appropriate bucket (1 I/O)
 Range Queries:
 use linear scales to determine the index into the cell directory.
 Access the cell directory to retrieve the bucket addresses of
buckets to visit.
 Access the buckets.
29
Grid File Insertions
 Determine the bucket into which insertion must
occur.
 If space in bucket, insert.
 Else, split bucket
 how to choose a good dimension to split?
 If bucket split causes a cell directory to split do so
and adjust linear scales.
 insertion of these new entries potentially requires a
complete reorganization of the cell directory--expensive!!!
30
Grid File Deletions
 Deletions may decrease the space utilization.
Merge buckets
 We need to decide which cells to merge and
a merging threshold
 Buddy system and neighbor system
 A bucket can merge with only one buddy in each
dimension
 Merge adjacent regions if the result is a rectangle
31
Grid File Example
(N=6)
A
1
A
6
2
A 1
2
3
4
5
6
5
3
4
32
Grid File Example
(N=6)
A
B
1
A
6
B
7
2
8
9
5
A 1
2
3
3
5
4
7
85
B
4
6
9
11 12
2
10
6
12
3
10
11
4
33
Grid File Example
(N=6)
A
14
1
B
6
7
13
15
8
B
C
B
2
9
5
C
A
A 1
3
7
5
8
15
13
7 14
8 10
B
2
4
6
9
C
3
5
10
11 12
12
3
10
11
4
34
Grid File Example
(N=6)
A
D
1
14
B
6
7
13
16
15
8
D
B
B
C
B
C
B
2
9
5
C
A
A 1
2
3
7
8
15
13
3 16
5
8
13
4 14
7
8 10
5
6
B
2
4
6
C
3
5
10
9
11 12
12
3
10
11
4
D 7
14 15
35
Grid File Example
(N=6)
y4
A
y3
y2
H D
I
F
B
E
G
y1
C
x1 x2
x3
A
H
D
F
B
A
I
D
F
B
A
I
G
F
B
E
E
G
F
B
C
C
C
C
B
x4
36
The R-Tree
 The R-tree is a tree-structured index that
remains balanced on inserts and deletes.
 Each key stored in a leaf entry is intuitively a
box, or collection of intervals, with one
interval per dimension.
Root of
 Example in 2-D:
R Tree
Y
X
Leaf
level
R-Tree Properties
 Leaf entry = < n-dimensional box, rid >
 key value being a box.
 Box is the tightest bounding box for a data object.
 Non-leaf entry = < n-dim box, ptr to child node >
 Box covers all boxes in child node (in fact, subtree).
 All leaves at same distance from root.
 Nodes can be kept 50% full (except root).
 Can choose a parameter m that is <= 50%, and ensure
that every node is at least m% full.
Example of an R-Tree
Leaf entry
R1
R4
R3
R8
R9
R10
Index entry
R11
Spatial object
approximated by
bounding box R8
R5 R13
R14
R12
R7
R6
R15
R18
R17
R16
R19
R2
Example R-Tree (Contd.)
R1 R2
R3 R4 R5
R8 R9 R10 R11 R12
R6 R7
R13 R14
R15 R16
R17 R18 R19
Search for Objects Overlapping
Box Q
Start at root.
1. If current node is non-leaf, for each
entry <E, ptr>, if box E overlaps Q,
search subtree identified by ptr.
2. If current node is leaf, for each entry
<E, rid>, if E overlaps Q, rid identifies
an object that might overlap Q.
Note: May have to search several subtrees at each node!
(In contrast, a B-tree equality search goes to just one leaf.)
Improving Search Using Constraints
 It is convenient to store boxes in the R-tree as
approximations of arbitrary regions, because
boxes can be represented compactly.
 But why not use convex polygons to
approximate query regions more accurately?
 Will reduce overlap with nodes in tree, and reduce
the number of nodes fetched by avoiding some
branches altogether.
 Cost of overlap test is higher than bounding box
intersection, but it is a main-memory cost, and can
actually be done quite efficiently. Generally a win.
42
Insert Entry <B, ptr>
 Start at root and go down to “best-fit” leaf L.
 Go to child whose box needs least enlargement to
cover B; resolve ties by going to smallest area child.
 If best-fit leaf L has space, insert entry and
stop. Otherwise, split L into L1 and L2.
 Adjust entry for L in its parent so that the box now
covers (only) L1.
 Add an entry (in the parent node of L) for L2. (This
could cause the parent node to recursively split.)
Splitting a Node During Insertion
 The entries in node L plus the newly inserted
entry must be distributed between L1 and L2.
 Goal is to reduce likelihood of both L1 and L2
being searched on subsequent queries.
 Idea: Redistribute so as to minimize area of L1
plus area of L2.
GOOD SPLIT!
BAD!
Spatial Data Warehousing
 Spatial data warehouse: Integrated, subject-oriented,
time-variant, and nonvolatile spatial data repository for
data analysis and decision making
 Spatial data integration: a big issue
 Structure-specific formats (raster- vs. vector-based, OO vs.
relational models, different storage and indexing, etc.)
 Vendor-specific formats (ESRI, MapInfo, Integraph, etc.)
 Spatial data cube: multidimensional spatial database
 Both dimensions and measures may contain spatial components
45
Dimensions and Measures in
Spatial Data Warehouse
 Dimension modeling
 nonspatial
 e.g. temperature: 25-30
degrees generalizes to hot
 spatial-to-nonspatial
 e.g. region “B.C.”
generalizes to description
“western provinces”
 spatial-to-spatial
 e.g. region “Burnaby”
generalizes to region
“Lower Mainland”
 Measures
 numerical
 distributive (e.g. count, sum)
 algebraic (e.g. average)
 holistic (e.g. median, rank)
 spatial
 collection of spatial pointers
(e.g. pointers to all regions
with 25-30 degrees in July)
46
Example: BC weather pattern
analysis
 Input
 A map with about 3,000 weather probes scattered in B.C.
 Daily data for temperature, precipitation, wind velocity, etc.
 Concept hierarchies for all attributes
 Output
 A map that reveals patterns: merged (similar) regions
 Goals
 Interactive analysis (drill-down, slice, dice, pivot, roll-up)
 Fast response time
 Minimizing storage space used
 Challenge
 A merged region may contain hundreds of “primitive” regions (polygons)
47
Star Schema of the BC
Weather Warehouse
 Spatial data warehouse
 Dimensions
 region_name
 time
 temperature
 precipitation
 Measurements
 region_map
 area
 count
Dimension table
Fact table
48
Spatial Merge
Precomputing all: too
much storage space
 On-line merge: very
expensive

49
Methods for Computation of
Spatial Data Cube
 On-line aggregation: collect and store pointers to spatial
objects in a spatial data cube
 expensive and slow, need efficient aggregation techniques
 Precompute and store all the possible combinations
 huge space overhead
 Precompute and store rough approximations in a spatial
data cube
 accuracy trade-off
 Selective computation: only materialize those which will be
accessed frequently
 a reasonable choice
50
Spatial Association Analysis
 Spatial association rule: A  B [s%, c%]
 A and B are sets of spatial or nonspatial predicates
 Topological relations: intersects, overlaps, disjoint, etc.
 Spatial orientations: left_of, west_of, under, etc.
 Distance information: close_to, within_distance, etc.
 s% is the support and c% is the confidence of the rule
 Examples
 is_a(x, large_town) ^ intersect(x, highway)  adjacent_to(x, water)

[7%,
85%]
 is_a(x, large_town) ^adjacent_to(x, georgia_strait)  close_to(x,
u.s.a.)
[1%,
78%]
51
Progressive Refinement Mining of
Spatial Association Rules
 Hierarchy of spatial relationship:
 g_close_to: near_by, touch, intersect, contain, etc.
 First search for rough relationship and then refine it
 Two-step mining of spatial association:
 Step 1: Rough spatial computation (as a filter)
 Using MBR or R-tree for rough estimation
 Step2: Detailed spatial algorithm (as refinement)
 Apply only to those objects which have passed the rough spatial
association test (no less than min_support)
52
Spatial Classification and
Spatial Trend Analysis
 Spatial classification
 Analyze spatial objects to derive classification schemes, such as
decision trees in relevance to certain spatial properties (district,
highway, river, etc.)
 Example: Classify regions in a province into rich vs. poor
according to the average family income
 Spatial trend analysis
 Detect changes and trends along a spatial dimension
 Study the trend of nonspatial or spatial data changing with space
 Example: Observe the trend of changes of the climate or
vegetation with the increasing distance from an ocean
53
LSD-tree
Local Split Decision – tree
Use kd-tree to partition the space. Each
partition contains up to B points. The
kd-tree is stored in main-memory.
If the kd-tree (directory) is large, we
store a sub-tree on disk
Goal: the structure must remain
balanced: external balancing property
54
Example: LSD-tree
N2 N6
N7
x:x1
(internal)
y:y2
y:y1
directory
y3
x:x2
y1
x:x3
(external)
y4
y2
y:y3
N8
N5
N1
N1
N3
x1
N2
N3
N4
N5
y:y4
N6
N7
N8
buckets
N4
x2 x3
55
LSD-tree: main points
Split strategies:
Data dependent
Distribution dependent
Paging algorithm
Two types of splits: bucket splits and
internal node splits
56