Download ppt - DBS

Document related concepts
no text concepts found
Transcript
Christian Böhm
Ludwig Maximilians Universität München
The Similarity Join:
A Powerful Database Primitive for
High Performance Data Mining
Tutorial, 17th Int. Conf. on Data Engineering, 2001-04-02
Christian Böhm
2
150
Motivation
High Performance Data Mining
Marketing
 Fraud Detection
 CRM
 Online Scoring
 OLAP
Christian Böhm

3
150
Fast decisions require knowledge just in time
Previous Approaches to Fast Data Mining



Christian Böhm

4
150
Sampling
Approximations (grid) Loss of quality
Dimensionality reduct.
Expensive & complex
Parallelism
All approaches combinable with join
KDD appl. get parallelism for free
Christian Böhm
Feature Based Similarity
5
150
Simple Similarity Queries
•
Specify query object and
-
Christian Böhm
-
6
150
Find similar objects – range query
Find the k most similar objects – nearest neighbor q.
Similarity – Range Queries
•
Christian Böhm
•
7
150
•
Given: Query point q
Maximum distance e
Formal definition:
Cardinality of the result set is
difficult to control:
e too small  no results
e too large  complete DB
Christian Böhm
Index Based Processing of Range Queries
8
150
Christian Böhm
Similarity – Nearest Neighbor Queries
9
150
•
Given:
Query point q
•
Formal definition:
•
Ties must be handled:
-
Result set enlargement
Non-determinism (don’t care)
Christian Böhm
Index Based Processing of NN Queries
10
150
k-Nearest Neighbor Search and Ranking
•
k-nearest neighbor query:
-
Do not only search only for one nearest neighbor but k
Stop distance is the distance of the kth (last) candidate point
Christian Böhm
-
11
150
•
Ranking-query:
-
Incremental version of k-nearest neighbor search
First call of FetchNext() returns first neighbor
Second call of FetchNext() returns second neighbor...
Typically only few results are fetched  Don‘t generate all!
Advanced Applications: Duplicates
•
Duplicate detection
-
E.g. Astronomic catalogue matching
Christian Böhm
C1
12
150
C2
•
Similarity queries for large number of query obj
Advanced Applications: Data Mining
Christian Böhm
•
13
150
Density based clustering (DBSCAN)
What is a Similarity Join?
•
•
Given two sets R, S of points
Find all pairs of points according to similarity
Christian Böhm
R
14
150
S
•
Various exact definitions for the similarity join
What is a Similarity Join?
•
Christian Böhm
•
15
150
•
•
Similarity join corresponds to set of identical
similarity queries, evaluated for a large number
of query points
Sequential evaluation of similarity queries with
index is the easiest similarity join algorithm
Many more sophisticated approaches exist
Powerful database primitive to support modern
applications of data analysis and data mining
Curse of Dimensionality
•
Christian Böhm
•
16
150
Index structures fail (outperformed by the
sequential scan) if the data space dimension
becomes too high
Many effects usually called
Curse of Dimensionality
Curse of Dimensionality
[Berchtold, Böhm, Keim, Kriegel: A Cost Model for High-Dim. Nearest Neighb. Search, PODS 1997]

With increasing dimension also increases...

Christian Böhm

17
150

Typical radius of range queries
Distance of a point to its nearest neighbor
Edge length of regions of index structures
0.51=0.5
0.720.5
0.830.5
Curse of Dimensionality
Christian Böhm

18
150
A cost model for the access probability of index
pages using the concept of Minkowski Sum
Curse of Dimensionality
Christian Böhm

19
150
Binomial formula:
Christian Böhm
Curse of Dimensionality
20
150
•
Asymptotic behavior of similarity search
•
Suppose number points   VMink  2d VSphere
Access probability = O(2d), but limited by 100%
Saturation area with near linear I/O cost O(n)
•
•
Curse of Dimensionality
•
Christian Böhm
•
21
150
•
•
•
For high dimension:
Each similarity query accesses considerable
fraction of all index pages.
Index does not pay off, anyway  sequ. scan
Strategies needed for efficient evaluation
Join: Base applications on powerful database
primitive that exploits high number of queries
Efficient algorithms for Similarity Join
Organization of the Tutorial
1.
2.
Christian Böhm
3.
22
150
4.
5.
Motivation
Defining the Similarity Join
Applications of the Similarity Join
Similarity Join Algorithms
Conclusion & Future Potential
Christian Böhm
Defining the Similarity Join
23
150
What Is a Similarity Join?
Intuitive notion: 3 properties of the similarity join
1.
The similarity join is a join in the relational sense
Two sets R and S are combined into one such that
the new set contains pairs of points that fulfill a
Christian Böhm
join condition
24
150
2.
3.
Vector or metric objects
rather than ordinary tuples of any type
The join condition involves similarity
What Is a Similarity Join?
Christian Böhm
Similarity Join
25
150
Distance Range Join
NN-based Approaches
Closest Pair Query
k-NN Join
Christian Böhm
Distance Range Join (e-Join)
26
150
•
Intuitition: Given parameter e
All pairs of points where distance  e
•
Formal Definition:
•
In SQL-like notation:
SELECT * FROM R, S WHERE ||R.obj - S.obj||  e
Distance Range Join (e-Join)
•
Christian Böhm
•
27
150
Most widespread and best evaluated join
Often also called the similarity join
Christian Böhm
Distance Range Join (e-Join)
28
150
•
The distance range self join
•
is of particular importance for data mining
(clustering) and robust similarity search
Change definition to exclude trivial results
•
Distance Range Join (e-Join)
•
Disadvantage for the user:
Result cardinality difficult to control:
-
Christian Böhm
-
29
150
•
•
e too small
e too large
 no result pairs are produced
 all pairs from R  S are produced
Worst case complexity is at least o(|R||S|)
For reasonable result set size, advanced join
algorithms yield asymptotic behavior which is
better than O(|R||S|)
k-Closest Pair Query
•
Christian Böhm
•
•
•
Intuition:
Find those k pairs that yield least distance
The principle of nearest neighbor search is
applied on a basis per pair
Classical problem of Computational Geometry
In the database context introduced by
[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998]
30
150
•
There called distance join
Christian Böhm
k-Closest Pair Query
31
150
•
Formal Definition:
•
Ties solved by result set enlargement
Other possibility: Non-determinism
(don’t care which of the tie tuples are reported)
•
k-Closest Pair Query
Christian Böhm
In SQL notation:
32
150
SELECT * FROM R, S
ORDER BY ||R.obj - S.obj||
STOP AFTER k
k-Closest Pair Query
•
Self-join:
-
Applications:
Christian Böhm
•
Exclude |R| trivial pairs (ri,ri) with distance 0
Result is symmetric
-
33
150
-
-
Find all pairs of stock quota in a database that are
most similar to each other
Find music scores which are similar to each other
Noise robust duplicate elimination
k-Closest Pair Query
•
Christian Böhm
•
34
150
Incremental ranking instead of exact
specification of k
No STOP AFTER clause:
SELECT * FROM R, S
ORDER BY ||R.obj - S.obj||
•
•
Open cursor and fetch results one-by-one
Important: Only few results typically fetched
 Don’t determine the complete ranking
k-Nearest Neighbor Join
•
Christian Böhm
•
35
150
•
Intuition:
Combine each point with its k nearest neighbors
The principle of nearest neighbor search is
applied for each point of R
In the database context introduced by
[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998]
•
There called distance semijoin
Christian Böhm
k-Nearest Neighbor Join
36
150
•
Formal Definition:
•
Ties solved by result set enlargement
Other possibility: Non-determinism
(don’t care which of the tie tuples are reported)
•
k-Nearest Neighbor Join
Christian Böhm
In SQL notation:
(limited to k = 1)
37
150
SELECT * FROM R, S
GROUP BY R.obj
ORDER BY ||R.obj - S.obj||
STOP AFTER K (*  k *)
k-Nearest Neighbor Join
Christian Böhm
•
38
150
The k-NN-join is inherently asymmetric:
k-Nearest Neighbor Join
•
Applications of the k-NN-join:
-
Christian Böhm
-
k-means and k-medoid clustering
Simultaneous nearest neighbor classification:
A large set of new objects without class label are
assigned according to the majority of k nearest
neighbors of each of the new objects
•
•
39
150
•
Astronomic observation
Online customer scoring
Ranking on the k-NN-join is difficult to define
Further possible definitions
•
Christian Böhm
•
40
150
Inverse nearest neighbor join:
Combine each point ri of R with every point of
S which considers ri to be its nearest neighbor
Metric data sets:
Instead of vectors use arbitrary objects with a
distance metric
-
E.g. Text sequences with edit distance
Text mining using the similarity join applies A*
Christian Böhm
Applications
41
150
Christian Böhm
Density Based Data Mining
42
150
Christian Böhm
Schema for Data Mining Algorithms
43
150
Algorithmic Schema A1
foreach Point p  D
PointSet S := SimilarityQuery (p, e);
foreach Point q  S
DoSomething (p,q) ;
Iterative similarity queries and cache

Due to curse of dimensionality:
No sufficient inter-query locality of the pages
Average cache hit ratio
Christian Böhm
0,08
10-nn query
sim. range query
0,07
0,06
0,05
0,04
0,03
0,02
0,01
0,00
44
150
0
10
20
30
Dimension (d )
40
Christian Böhm
Iterative similarity queries and cache
45
150
Idea: Query Order Transformation
[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]

Christian Böhm

46
150

Transform order of similarity queries such that
packing of points into pages is considered
If one pair of index pages is in the cache:
 process all sim. queries regarding this pair
Each pair of pages is considered at most once
Christian Böhm
Idea: Query Order Transformation
47
150
Christian Böhm
Transform the Original Schema A1…
48
150
Algorithmic Schema A1
foreach Point p  D
PointSet S := SimilarityQuery (p, e);
foreach Point q  S
DoSomething (p,q) ;
Christian Böhm
…Into a New Algorithmic Schema A2
49
150
foreach DataPage P
LoadAndPinPage (P) ;
foreach DataPage Q
if (mindist (P,Q)  e)
CachedAccess (Q) ;
foreach Point p  P
foreach Point q  Q
if (distance (p,q)  e)
DoSomething’ (p,q) ;
UnFixPage (P) ;
Similarity Join
A2 is a Similarity-Join-Algorithm:
Christian Böhm
foreach PointPair (p,q) 
DoSomething’ (p,q) ;
50
150
Where
denotes the Similarity-Join:
SELECT * FROM R r1, R r2
WHERE distance (r1.object, r2.object)  e
Implementation Variants
•
Change of the order in which points are
combined must partially be considered
Christian Böhm
Implementation
51
150
Semantic
Change algorithm to take
unknown order into account
Materialization
Materialize join result j and
answer original queries by j
Example Clustering Algorithms

DBSCAN

[Ester, Kriegel, Sander, Xu: A Density Based
Algorithm for Discovering Clusters in Large
Spatial Databases with Noise´, KDD 1996]
Flat clustering
(non hierarchical)
Christian Böhm

OPTICS
[Ankerst, Breunig, Kriegel, Sander: OPTICS:
Ordering Points To Identify the Clustering
Structure, SIGMOD Conf. 1999]

Hierachical
cluster-structure
1
3
2
52
150
Semantic Rewriting
Materialization
Transformation by Semantic Rewriting
•
Christian Böhm
•
53
150
Rewrite the algorithm to take the changed order
of pairs into account
Don´t assume any specific order in which pairs
are generated
 Arbitrary similarity join algorithm possible
Example: DBSCAN

Christian Böhm

p core object in D wrt. e, MinPts: | Ne (p) |  MinPts
p directly density-reachable from q in D wrt. e, MinPts:
1) p  Ne(q) and
2) q is a core object wrt. e, MinPts

density-reachable: transitive closure.

cluster:
-
54
150
-
maximal wrt. density reachability
any two points are density-reachable from
a third object
Christian Böhm
Implementation of DBSCAN on Join
55
150

Core point property:
DoSomething() increments a counter attribute

Determination of maximal density-reachable clusters:
DoSomething():
-
Assign ID of known cluster point to unknown cluster points
Unify two known clusters
Christian Böhm
Implementation of DBSCAN on Join
56
150
Christian Böhm
Implementation of DBSCAN on Join
57
150
Implementing OPTICS (Materialization)
•
Christian Böhm
•
58
150
•
•
•
The join result is predetermined before starting
the actual OPTICS algorithm
The result is materialized in some table with
GROUP-BY on the first point of the pair
The OPTICS algorithm runs unchanged
Similarity queries are answered from the join
materialization table (much faster)
Disadvantage: High memory requirements
Experimental Results: Page Capacity
Color image data
64-dimensional
1000000
1000000
100000
100000
runtime [sec]
runtime [sec]
Christian Böhm
Meteorology data
9-dimensional
10000
59
150
Q-DBSCAN (R*-tree)
10000
1000
100
100
2000
4000 6000 8000 10000
page capacity
Q-DBSCAN (X-tree)
J-DBSCAN (R*-tree)
1000
0
Q-DBSCAN (Seq. Scan)
J-DBSCAN (X-tree)
0
100
200
page capacity
300
Experimental Results: Scalability
Christian Böhm
60
150
Meteorology data
150000
150000
120000
120000
runtime [sec]
runtime [sec]
Color image data
90000
60000
90000
60000
30000
30000
0
0
30000
60000
90000
0
50000
150000
250000
size of database [points]
size of database [points]
Q-DBSCAN (Seq. Scan)
Q-OPTICS (Seq. Scan)
Q-DBSCAN (X-tree)
J-DBSCAN (X-tree)
Q-OPTICS (X-tree)
J-OPTICS (X-tree)
Experimental Results: Query Range
Color image data
Color image data
140000
70000
120000
Christian Böhm
61
150
runtime [sec]
runtime [sec]
60000
50000
40000
30000
100000
80000
60000
20000
40000
10000
20000
0
0,00
0
0,05
0,10
0,15
0,20
epsilon
Querybased
Q-DBSCAN(X-tree)
(X-tree)
J-DBSCAN
(X-tree)
Joinbased
(X-tree)
0,25
0,1
0,15
0,2
0,25
epsilon
Q-OPTICS (Seq. Scan)
Q-OPTICS (X-tree)
J-OPTICS (X-tree)
0,3
Robust Similarity Search
[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]
•
Usual similarity search with feature vectors:
Not robust with respect to
-
Noise:
Christian Böhm
Euclidean distance sensitive to mismatch in single dimension
62
150
-
Partial similarity:
Not complete objects are similar, but parts thereof
•
Concept to achieve robustness:
Decompose each data object and query object into sub-objects
and search for a maximum number of similar subobjects
Robust Similarity Search
•
Christian Böhm
•
63
150
•
Prominent concept borrowed from IR research:
String decomposition: Search for similar words
by indexing of character triplets (n-lets)
Query transformed to set of similarity queries
 similarity join between query set and data set
Robustness achieved in result recombination:
-
Noise robustness: Ignore missing matches
Partial search: Dont enforce complete recombination
Robust Similarity Search
Applications:
• Robust search for sequences:
[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]
Christian Böhm
•
64
150
Principle can be generalized for objects like
-
-
Raster images
CAD objects
3D molecules
etc.
Astronomic Catalogue Matching
•
Relative position of catalogues approx. known:
-
Position and intensity parameters in different bands
Christian Böhm
C1
65
150
C2
•
•
C1
C2
Determine e according to device tolerance
Astronomic Catalogue Matching
•
Relative position unknown:
-
Match according to triangles and intensity
Christian Böhm
C1
66
150
C2
•
•
Search triangles and store parameters (height,...)
triangles (C1)
triangles (C2)
k-Nearest Neighbor Classification
•
Simultaneous classification of many objects
[Braunmüller, Ester, Kriegel, Sander: Efficiently Supporting Multiple Similarity
Queries for Mining in Metric Databases, ICDE 2000]
-
Astronomy
Christian Böhm
•
•
-
Online customer scoring
•
•
67
150
Some 10,000 new objects collected per night
Classify according to some millions of known objects
Some 1,000 customers online
Rate them according to some millions of known patterns
k-Nearest Neighbor Classification
•
Example:
k=3
Christian Böhm
Objects with known class
68
150
New objects
•
New objects
Known objects
k-Means and k-Medoid Clustering
•
•
•
k Points initially randomly selected („centers“)
Each database point assigned to nearest center
Centers are re-determined
Christian Böhm
-
69
150
-
•
k-means: Means of all assigned points (artificial p.)
k-medoid: One central database point of the cluster
Assignment and center determination are
repeated until convergence
k-Means and k-Medoid Clustering
Christian Böhm
•
70
150
Example: (k-means with k = 3)
Convergence!
•
Each assignment phase: DB-Points
Centers
Christian Böhm
Similarity Join Algorithms
71
150
Algorithms´ Overview
Similarity join
Range dist. join
on-the-fly index
Christian Böhm
Index based
72
150
Hashing based
Sorting based
Closest pair qu.
k-NN join
Optimization
Cost modeling
CPU optimizing
Algorithms´ Overview

Distance range join (e-join)

Index joins with depth-first and breadth-first search
[Brinkhoff, Kriegel, Seeger: Efficient Proc. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]
[Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996]
[Huang, Jing, Rundensteiner: Spatial Joins Usg. R-trees: Breadth-First Traversal..., VLDB 1997]
Christian Böhm

73
150
Index construction on-the-fly
[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994]
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
[Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997]
[van den Bercken, Schneider, Seeger: Plug&Join, EDBT 2000]

Join-algorithms based on hashing
[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996]
[Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]
Algorithms´ Overview

Join-algorithms based on sorting
[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991]
[Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997]
[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001]
Christian Böhm

74
150
Closest pair query and nearest neighbor join
[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998]
[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]
[Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial
Databases, SIGMOD Conf. 2000]

Optimization approaches
[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday 1630]
[Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted]
Nested Loop Join
Simple nested loop join:
-
75
150
S
Nested block loop join:
-
First iterate over blocks
Nested iterate over tuples
 S scanned |R|/|B| times
R-tuples
•
S-blocks
Christian Böhm
-
R
Iterate over R-points
Nested iteration over S-points
 S is scanned |R| times, high I/O cost
R-blocks
•
S-tuples
Indexed Nested Loop Join
•
Christian Böhm
•
76
150
•
Iterate over every point of R
Determine matches in S by
similarity queries on the index
R
S
Due to the curse of dimensionality:
 Performance deterioration of the similarity q.
 Then not competitive with nested loop join
(Depends on dimensionality and selectivity determined by e)
Spatial Join
•
•
Christian Böhm
•
77
150
•

2D polygon databases
Join-predicate: Overlap
Conserv. approximation:
MBR (ax-par. rectangle)
Similarity Join
•
•
•
High-D point databases
Join-predicate: Distance
Map e-join to spatial join
Cube with edge-length e
e
Some strategies can be borrowed from the spatial join
R-tree Spatial Join (RSJ)
[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]
•
•
Christian Böhm
•
78
150
•
Originally: Spatial join for 2D rect. intersection
Depth-first search in R-trees and similar indexes
Assumption: Index preconstructed on R and S
Simple recursion scheme (equal tree height):
procedure r_tree_join (R, S: page)
foreach r  R.children do
foreach s  S.children do
if intersect (r,s) then r_tree_join (r,s) ;
R-tree Spatial Join (RSJ)
•
Christian Böhm
•
79
150
Adaptation for the similarity join:
Distance predicate rather than intersection
For pair (R,S) of pages: mindist (R,S)
 Least possible distance of two points in (R,S)
Christian Böhm
R-tree Spatial Join (RSJ)
80
150
procedure r_tree_sim_join (R, S, e)
if IsDirpg (R)  IsDirpg (S) then
foreach r  R.children do
foreach s  S.children do
if mindist (r,s)  e then
CacheLoad(r); CacheLoad(s);
r_tree_sim_join (r,s,e) ;
else (* assume R,S both DataPg *)
foreach p  R.points do
foreach q  S.points do
if |p - q| e then report (p,q);
R
e
S
R-tree Spatial Join (RSJ)
•
•
•
Extension to different tree heights straightforw.
Several additional optimizations possible
CPU-bound
Christian Böhm
-
•
Disadvantages
-
81
150
Cost dominated by point-distance calculations
No clear strategies for page access priorization
Single page accesses
 Can be outperformed by nested block loop join
Parallel RSJ
[Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996]
•
•
Again spatial join for 2D rectangle intersection
Three phases of parallel execution:
Christian Böhm
-
82
150
-
•
Task creation (non-parallel)
Task assignment (non-parallel)
Task execution (completely parallel)
A task corresponds to a pair of subtrees
-
At high tree level (e.g. root or second level)
Parallel RSJ
Christian Böhm
•
83
150
Example for the task definition
Parallel RSJ
Christian Böhm
•
84
150
Strategy 1: Static Range Assignment
Parallel RSJ
Christian Böhm
•
85
150
Strategy 2: Static Round-Robin Assignment
Parallel RSJ
•
Strategy 3: Dynamic task assignment
-
Christian Böhm
-
86
150
Processor requests a task when idle
Best load balancing
Breadth-First R-tree Join (BFRJ)
[Huang, Jing, Rundensteiner: Spatial Joins Using R-trees: Breadth-First Traversal..., VLDB 1997]
•
•
Again spatial join for 2D rectangle intersection
Shortcoming of RSJ:
Christian Böhm
-
87
150
-
No strategy in outer loop improving locality in inner
Depth-first traversal not flexible, because a pair of
tree branches must be ended before next pair started
 unnecessary page accesses
Breadth-First R-tree Join (BFRJ)
•
Solution:
-
Christian Böhm
-
88
150
-
Proceed level by level (breadth-first traversal)
Determine all relevant pairs for the next level
 intermediate join index (IJI)
Sort the IJI according to suitable order before
accessing the next level
 global optimization strategy
Christian Böhm
Breadth-First R-tree Join (BFRJ)
89
150
Breadth-First R-tree Join (BFRJ)
Options for ordering:
1.
2.
Christian Böhm
3.
90
150
4.
5.
No particular order
Consider the lower x-coordinate of R´s nodes
Sum of the centers of x-coordinates of R and S
x-coordinate of center of common MBR
Hilbert-value of center of common MBR
Higher locality (better cache hit rates) for better
ordering strategies.
Christian Böhm
Breadth-First R-tree Join (BFRJ)
91
150
Approaches without Preconstructed Index
•
•
Indexes can be constructed temporarily for join
R-tree construction by INSERT too expensive
 Use cheap bottom-up-construction
-
Hilbert R-trees: O (n log n)
Christian Böhm
[Kamel, Faloutsos: Hilbert R-trees: An Improved R-tree using Fractals, VLDB 1994]
92
150
-
Sort points by SFC and pack adjacent points to page
Buffer trees
[van den Bercken, Seeger, Widmayer: A Generic Approach to Bulk Loading.., VLDB 1997]
-
Repeated partitioning
[Berchtold, Böhm, Kriegel: Improving the Query Performance ..., EDBT 1998]
•
Index construction can amortize during join
Seeded Trees
[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994]
•
Christian Böhm
•
93
150
•
•
Again spatial join for 2D rectangle intersection
Assumption:
Only one data set (R) is supported by index
Typical application:
Set S is subquery result
Idea:
Use partitioning of R as a template for S
Seeded Trees
•
Motivation
-
Christian Böhm
-
94
150
Early inserts to R-trees decide initial organization
We know that S will be matched with R
Start with small template tree instead of empty root
 seed levels
Seeded Trees
•
Tree consist of
-
Christian Böhm
•
•
Tree unbalanced
Phases of tree
construction:
-
95
150
Seed levels
Grown levels
-
Seeding phase
Growing phase
Cleanup phase
Seeded Trees
•
Seeding phase:
-
Christian Böhm
-
-
Copy k levels of the R-tree of set R
Last level: defined MBRs, but empty child pointers
 called slot
Three strategies for (slot and other) MBRs:
•
•
•
96
150
Copy complete MBR
Use only center point rather than complete MBR
Center point at slot level, otherwise complete MBR
Seeded Trees
•
Growing phase
-
Insert of points: Choose subtree like in R*-tree
Seed level is not affected during growth phase:
•
Christian Böhm
•
97
150
-
No insertions to seed level nodes
No split of seed level nodes
If point is inserted into empty slot (NULL pointer):
•
•
•
A new empty data node is allocated
Further, this node is treated like a root in R-trees:
on overflow, no split is propagated upward (new root)
The R-trees in the slots are called grown subtree.
Seeded Trees
•
Growing phase (cont...)
-
Various strategies for update of the MBRs in the
seed levels during insert operations:
•
Christian Böhm
•
•
•
-
In seed levels: In general, the page regions are ...
•
98
150
No updates
Enlarge bounding box after insert of a not contained point
Determine minimum bounding rectangle after insert
...
•
Not bounding rectangles, i.e. no conservative appx. of set
Not minimal
Seeded Trees
•
Cleanup Phase
-
The MBR property of page regions is needed ...
•
•
Christian Böhm
-
99
150
-
-
•
... not for tree construction
... but for join processing
Therefore, actual MBRs are determined in cleanup
Empty slots (without grown subtrees) are deleted
No attempt to make the tree balanced
Join the two indexed sets R and S like in RSJ
Seeded Trees
Christian Böhm
•
100
150
Experimental results (spatial data)
The e-kdB-tree
[Shim, Srikant, Agrawal:
High-dimensional Similarity Joins, ICDE 1997]
•
Algorithm for the
Christian Böhm
range distance self join
101
150
•
General idea:
Grid approximation where
grid line distance = e
•
Not all dimensions used for decomposition:
As many dimensions as needed defined node capacity
Christian Böhm
The e-kdB-tree
102
150
The e-kdB-tree
•
•
Christian Böhm
•
103
150
Node fanout: 1/e(assuming data space [0..1]d)
Tree structure is specific to given parameter e
 must be constructed for each join
The e-kdB-trees of two adjacent stripes are
assumed to fit into main memory
Christian Böhm
The e-kdB-tree
104
150
procedure t_match (R, S: node)
if is_leaf (R)  is_leaf (S) then
...
else
for i:=1 to 1/e- 1 do
t_match(R.child[i], S.child [i]) ;
t_match (R.child[i], S.child [i+1]) ;
t_match (R.child[i+1], S.child[i]) ;
t_match (R.child[1/e], S.child[1/e]) ;
The e-kdB-tree
•
Christian Böhm
•
105
150
•
•
Limitation:
For large e values not really scalable
In high-dimensional cases, e=0.3 can be typical
 60% of data must be held in main memory
As long as data fit into main memory:
e-kdB-tree is one of the best similarity join alg.
Unfortunately:
IBM does not provide any code for comparison
Christian Böhm
The e-kdB-tree
106
150
The Parallel e-kdB-tree
[Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997]
•
Parallel construction of the e-kdB-tree:
-
Christian Böhm
-
107
150
-
Each processor has random subset of the data (1/N)
Each processor constructs e-kdB-tree of its own set
Identical structure is enforced e.g. by split broadcast
CPU1
CPU2
The Parallel e-kdB-tree
•
Workload distribution:
-
Christian Böhm
-
-
Global determination of the cumulated node sizes
A unit workload is a pair (r,s) of leaf nodes
The cost of a workload is
|r||s| for different leaves
and |r|(|r|+1)/2 for a single leaf (self join)
Data is redistributed: Each processor gets 1/N work
•
108
150
•
join units are clustered to preserve locality
minimize redistribution (communication) and replication
The Parallel e-kdB-tree
•
Workload execution:
-
Christian Böhm
-
109
150
delete internal structure
cum. node size too large
 second growth phase
data redistribution performed asynchronously:
Data sent in depth-first
order of tree traversal to
avoid network flooding
Christian Böhm
The Parallel e-kdB-tree
110
150
Plug & Join
[van den Bercken, Schneider, Seeger: Plug&Join: An Easy-to-Use Generic Algorithm, EDBT 2000]
Generic technique for several kinds of join
-
Christian Böhm
-
111
150
Main-memory R-tree constructed from R-sample
Partition R and S acc. to R-tree (buffers at leaves)
R
main
memory
1 2 3 4
flush
S
main
memory
1 2 3 4
Spatial Hash Join
[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996]
•
Method for the spatial join using replication
-
Christian Böhm
-
112
150
-
Set R is partitioned without replication
Set S is partitioned according to R‘s buckets;
replication if intersection with more than 1 R-bucket
Join only corresponding buckets
Spatial Hash Join
•
Partitioning of R:
-
Christian Böhm
-
113
150
-
-
Using bootstrap-seeding, generates a seeded tree
A suitable number # of slots is determined
The set R is sampled (sample size c  #)
Using some clustering method, # cluster centers are
determined in the set
The cluster centers are the slots in the seeded tree
Assign each R-obj. to slot with least enlargement
Spatial Hash Join
•
Partitioning of S and join phase:
-
Christian Böhm
-
114
150
-
-
-
Bucket extents of R are copied to S-buckets
For spatial join: Each object s of S is assigned ...
... to all buckets b which are intersected by s
For similarity join:
... to all buckets b with mindist (s,b)  e
All corresponding bucket pairs (r,s) are joined by
constructing a quadratic split-R-tree on r.
Each obj in s is probed to the R-tree on r.
Christian Böhm
Spatial Hash Join
115
150
figure 6
Partition Based Spatial Merge Join
[Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]
•
Again spatial join method using replication
-
Christian Böhm
-
116
150
-
Both sets R and S are partitioned with replication
Space is regularly decomposed into tiles
Partitions either correspond to tiles or are
determined from them
using hashing
Partition Based Spatial Merge Join
•
Christian Böhm
•
117
150
Duplicate pairs can be generated
 duplicate elimination by sorting according
to (OIDR, OIDS)
Initial number of partitions determined:
(|R| + |S|)  size_pt / memsize
This formula does not take into account:
-
replication
data skew
Christian Böhm
Partition Based Spatial Merge Join
118
150
Christian Böhm
Approaches Using Space Filling Curves
119
150
•
Space filling curves recursively decompose the data
space in uniform pieces
•
Various different orders:
Christian Böhm
Approaches Using Space Filling Curves
•
Efficient filter for the join:
Objects in different cells cannot
intersect each other
 Sort-merge-join e.g. on Z-order
•
Problem:
Object may cross grid lines
-
120
150
-
either decompose object (redundant)
or assign to containing cell
Christian Böhm
Approaches Using Space Filling Curves
121
150
•
If all cells have uniform size:
 Equi-join on grid cell numbers (bit strings)
•
If cells have varying size:
 Bit strings of varying length
•
Objects may intersect ...
-
if bitstr (r) is prefix of bitstr (s)
or bitstr (s) is prefix of bitstr (r)
Orenstein‘s Spatial Join
[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991]
•
•
Allows (limited) redundancy, object decompos.
Algorithm:
-
Christian Böhm
-
122
150
-
Objects are decomposed
Partial objects are ordered according to the
lexicographical order of the bit strings
Objects are accessed in sort-merge like fashion
Two stacks are maintained to keep track of the
prefix objects of R and S.
Orenstein‘s Spatial Join
Christian Böhm
•
123
150
Stacks for prefix objects:
Orenstein‘s Spatial Join
•
Christian Böhm
•
124
150
•
Mergesort principle:
From the two files, read the next element which
is smaller according to the lexicographical order
The stacks are updated:
Discard anything thats not a prefix of new string
The new object is compared to every object on
the other stack
Orenstein‘s Spatial Join
•
Controlling redundancy:
-
Christian Böhm
-
125
150
•
Allowing no redundancy:
 Many objects approximated by empty string
Decomposing every object until basis resolution
 No manageable set of objects
2 Methods for controlling redundancy:
-
Size-bound: Given a max. number of partial objects
Error-bound: Given a max. error volume of appx.
Multidimensional Spatial Join
[Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997, Best Paper Award]
•
Christian Böhm
•
126
150
•
No redundancy allowed at all
Instead of stacks:
Separate level files for different bitstring length
Problems with no redundancy:
-
With increasing dimension: increasing e
Increasing chance that object intersects one of the
primary decomposition lines  approx. by < >
Christian Böhm
Multidimensional Spatial Join
127
150
Epsilon Grid Order
Christian Böhm
[Böhm, Braunmüller, Krebs, Kriegel:
Epsilon Grid Order, SIGMOD Conf. 2001]
128
150
•
Motivation like e-kdB-tree:
Based on grid with grid
line distance e
•
Possible join mates
restricted to 3d cells
•
Here no tree structure but sort order of points based on
lexicographical order of the grid cells
Epsilon Grid Order
Christian Böhm
•
129
150
Christian Böhm
Epsilon Grid Order
130
150
•
A simple exclusion test (used for I/O):
A point q with
or
cannot be join mate of point p or any point
beyond p (with respect to epsilon grid order)
•
The interval between p-[e,...,e]T and p+[e,...,e]T
is called e-interval
Epsilon Grid Order
Christian Böhm
•
131
150
Sort file and decompose it into I/O units
Christian Böhm
Epsilon Grid Order
132
150
Christian Böhm
Epsilon Grid Order
133
150
Closest Pair Queries
[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998]
•
Christian Böhm
•
134
150
•
For both point objects and spatial objects
Find k objects with least distance
Basis algorithm* for nearest neighbor search
extended to take point pairs into account
* [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]
Basis Algorithm for NN Search
Active Page List:
Christian Böhm
proot
| |p|p3p33| |pp2312| |pp2123| |pp2213 | p21 | p22
12 | |pp414| |pp24
14
424
135
150
1
2
3
4
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
Hjaltason/Samet: Closest Pair Queries
•
•
•
Christian Böhm
•
136
150
•
•
Nearest Neighbor  Closest Pair Query
k result points
 k point pairs
active page list
 list of active page pairs
initialization root  pair (rootR, rootS)
distance point/query distance of point pair
mindist page/query  mindist betw. page pair
Hjaltason/Samet: Closest Pair Queries
Active Page List:
Christian Böhm
(root,p1)|(root,p2)|(root,p3)|(root,p4)
(root,root)
137
150
1
2
3
4
Hjaltason/Samet: Closest Pair Queries
•
Christian Böhm
•
138
150
•
Unidirectional node expansion:
Given a pair (ri,sj) only one node is expanded
Closest pair ranking:
Incremental version of k-closest pair queries
 stopping criterion is validation of next pair
k-nearest neighbor join:
Runs a closest pair ranking and filters out the
(k+1)st occurrence (and more) of each point of R
Hjaltason/Samet: Closest Pair Queries
•
Two strategies for tie breaks (same distance):
-
Christian Böhm
•
139
150
Depth-first
Breadth first
Three policies for tree traversal
-
Basic (one tree determines priority)
Even (priority to node with shallower depth)
Simultaneous (all possible pairs are candidates for
traversal)
Alternative Approaches
[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]
•
Various improvements and optimizations
-
Bidirectional node expansion
Christian Böhm
(root,root)
140
150
-
-
(p1,p3) | (p2, p3) | (p2, p4) | (p1, p2) | (p3, p4) | (p1, p4)
Plane sweep technique for bidirectional node exp.
Adaptive multi-stage algorithm
•
Aggressive pruning using estimated distances
Alternative Approaches
[Corral, Manolopoulos, Theodoridis,
Vassilakopoulos: Closest Pair Queries in
Spatial Databases, SIGMOD Conf. 2000]
mindist
Christian Böhm
•
141
150
5 different algorithms for closest point queries
-
-
-
Naive: Depth-first traversal of the two R-trees
 recursive call for each child pair (ri,sj) of (r,s)
Exhaustive: like naive but prune page pairs the
mindist of which exceeds the current k-CP-dist
Simple recursive: addit. prune using minmaxdist
Alternative Approaches
•
5 different algorithms (...)
-
Before descending sort child
mindist
pairs acc. to their mindist
 fast get good distance for pruning. Analogous to
Christian Böhm
142
150
Sorted distances recursive:
[Roussopoulos, Kelley, Vincent: Nearest Neighbor Queries. SIGMOD Conf. 1995]
-
Heap algorithm:
Similar to the algorithm by Hjaltason & Samet
with some minor differences
•
New strategies for ties and different tree height
Modeling and Optimization
[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]

Mating probability of index pages:

Christian Böhm

143
150
Probability that distance between two pages e
Two-fold application of Minkowski sum
Modeling and Optimization
•
I/O cost:
•
•
Christian Böhm
•
144
150
High const. cost per page
Large capacity optimum
CPU cost:
•
•
Low const. cost per page
Low capacity optimum
 CPU-performance like CPU optimized index
 I/O- performance like I/O optimized index
Plane Sweep Optimization
[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]
For the directory in the R-tree spatial join (RSJ):
-
Christian Böhm
-
145
150
-
-
Avoid computation of all C2 box overlaps/distances
Sort boxes according to lower x-coordinates
Plane sweep to
Sweep plane
determine the box pairs:
Hold all rectangles intersected by sweep plane
in the status structure
Plane Sweep Optimization
[Arge, Procopiuc, Ramaswamy, Suel, Vitter: Scalable Sweeping Based Spatial Join, VLDB 1998]
•
A plane sweep algorithm for the spatial join
-
Christian Böhm
-
146
150
-
-
•
Partition space into k stripes
 at most 2N/k objects start/end in each stripe
Rectangle contained in a single strip is called small
Other rectangles decomposed: start, end, centerpiece
Recursive determination of intersections for startand endpieces and small rectangles
Optimum complexity O(n log n + |R
S|)
Plane Sweep Optimization
[Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted for pub.]
•
Reduction of the computational cost of point-distances
•
•
Plane-sweep or also sort-merge method:
Christian Böhm
•
147
150
•
•
Most important cost factor for all similairty join algorithms
Sort points on both pages according to a selected dimension
Many point pairs can be excluded beforehand
Crucial: Dimension
•
•
•
Distance or overlap
Extent of the pages
Probability model
Christian Böhm
Conclusions
148
150
Summary
•
•
Similarity join is a powerful database primitive
Supports many new applications of
-
Christian Böhm
-
149
150
•
Data mining
Data analysis
Considerable performance improvements
Summary
•
Many different algorithms for the similarity join
-
Christian Böhm
-
150
150
•
•
Most for the distance range join (e join)
Some approaches for closest pair queries
Important operation of nearest neighbor join has
almost not been considered yet
All 3 types of join have different applications
Comparison of different e join algorithms:
-
Mostly a competition for speed
Summary
•
Only few other advantages/disadvantages:
-
Scalability:
•
-
Existence of an index:
•
Christian Böhm
151
150
•
-
MSJ and e-kdB-tree have high main memory
requirements in high-dimensional spaces
Actually no matter because R-trees can be fast
constructed bottom-up. Construction time often
much less than join time
Even if preconstructed indexes exist:
Approaches based on sorting often better
No good criteria known for algorithm selection
Future Research Directions
•
Applications:
-
Many standard data mining methods accelerable:
•
•
Christian Böhm
•
•
-
New data mining methods will become feasable:
•
•
152
150
Outlier detection
Various clustering algorithms (e.g. obstacle clustering)
Hough transformation and similar analysis methods
...
•
Subspace clustering & correlation detection
Methods may become interactive
...
Future Research Directions
•
Algorithms
-
Christian Böhm
-
153
150
-
Sufficient research for e join and closest pair query
Almost no convincing approaches for the k-NN-join
Important database primitive for many applications
Parallel Algorithms
Non-vector metric data (e.g. text mining)
Approximative join algorithms
•
•
-
...
Similarity search: Approximative search often sufficient
Join performance could be considerably improved
Future Research Directions
•
Optimization of various critical parameters
-
Christian Böhm
-
154
150
-
Dimension
Replication
Index scan strategies
...
Christian Böhm
155
150
Questions
Comparison with Multiple Queries
70000
156
150
runtime [sec]
Christian Böhm
60000
SQ-DBSCAN (X-tree)
MQ-DBSCAN (Scan)
MQ-DBSCAN
J-DBSCAN (X-tree)
50000
40000
30000
20000
10000
0
0,00
0,05
0,10
epsilon
0,15
0,20
Experimente: Seitenkapazität
Color image data
64-dimensional
1000000
1000000
100000
100000
runtime [sec]
runtime [sec]
Christian Böhm
Meteorology data
9-dimensional
10000
157
150
Q-DBSCAN (R*-tree)
10000
1000
100
100
2000
4000 6000 8000 10000
page capacity
Q-DBSCAN (X-tree)
J-DBSCAN (R*-tree)
1000
0
Q-DBSCAN (Seq. Scan)
J-DBSCAN (X-tree)
0
100
200
page capacity
300
Experimente: Anfrageregion
Color image data
Color image data
140000
70000
120000
Christian Böhm
158
150
runtime [sec]
runtime [sec]
60000
50000
40000
30000
100000
80000
60000
20000
40000
10000
20000
0
0,00
0
0,05
0,10
0,15
0,20
epsilon
Querybased
Q-DBSCAN(X-tree)
(X-tree)
J-DBSCAN
(X-tree)
Joinbased
(X-tree)
0,25
0,1
0,15
0,2
0,25
epsilon
Q-OPTICS (Seq. Scan)
Q-OPTICS (X-tree)
J-OPTICS (X-tree)
0,3
Experimente: Künstliche Daten
Christian Böhm
4d-UNIFORM
159
150
8d-UNIFORM
8d-UNIFORM
Future Work

Weitere KDD-Algorithmen auf Join abstützen


Christian Böhm


Neue Algorithmen für den Similarity Join



160
150
Z.B. Outlier Detection
Subspace Clustering, Ermittlung von Korrelationen
Interaktivität

Nutzung des Optimierungspotentials (Dimension,...)
Parallelisierung
Approximative Join-Bearbeitung
„k-nearest-neighbor Joins“ und „k-best-pair Joins“
161
150
Christian Böhm
e
162
150
Christian Böhm
e
KDD Algorithms Based on Similarity Queries
DBSCAN
Christian Böhm
OPTICS
163
150
....
LOF
Simultan. Spatial
Nearest
Trend
Dist.
Neighbor Detect.
Based Classific.
Spatial
Outliers
Assoc.
....
....
Rules
Curse of Dimensionality

Cost model opens optimization potential

Optimization of the page capacity (# points)
[Böhm, Kriegel: Dynamically Optimizing High-Dimensional Index, EDBT 2000]
Christian Böhm

164
150
Optimized index compression
[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization: An
Index Compression Technique for High-Dimensional Spaces, ICDE 2000]

Optimized dimension assignment
[Berchtold, Böhm, Keim, Kriegel, Xu: Optimal Multidimensional Query
Processing Using Tree Striping, DaWaK 2000]