Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th Int. Conf. on Data Engineering, 2001-04-02 Christian Böhm 2 150 Motivation High Performance Data Mining Marketing Fraud Detection CRM Online Scoring OLAP Christian Böhm 3 150 Fast decisions require knowledge just in time Previous Approaches to Fast Data Mining Christian Böhm 4 150 Sampling Approximations (grid) Loss of quality Dimensionality reduct. Expensive & complex Parallelism All approaches combinable with join KDD appl. get parallelism for free Christian Böhm Feature Based Similarity 5 150 Simple Similarity Queries • Specify query object and - Christian Böhm - 6 150 Find similar objects – range query Find the k most similar objects – nearest neighbor q. Similarity – Range Queries • Christian Böhm • 7 150 • Given: Query point q Maximum distance e Formal definition: Cardinality of the result set is difficult to control: e too small no results e too large complete DB Christian Böhm Index Based Processing of Range Queries 8 150 Christian Böhm Similarity – Nearest Neighbor Queries 9 150 • Given: Query point q • Formal definition: • Ties must be handled: - Result set enlargement Non-determinism (don’t care) Christian Böhm Index Based Processing of NN Queries 10 150 k-Nearest Neighbor Search and Ranking • k-nearest neighbor query: - Do not only search only for one nearest neighbor but k Stop distance is the distance of the kth (last) candidate point Christian Böhm - 11 150 • Ranking-query: - Incremental version of k-nearest neighbor search First call of FetchNext() returns first neighbor Second call of FetchNext() returns second neighbor... Typically only few results are fetched Don‘t generate all! Advanced Applications: Duplicates • Duplicate detection - E.g. Astronomic catalogue matching Christian Böhm C1 12 150 C2 • Similarity queries for large number of query obj Advanced Applications: Data Mining Christian Böhm • 13 150 Density based clustering (DBSCAN) What is a Similarity Join? • • Given two sets R, S of points Find all pairs of points according to similarity Christian Böhm R 14 150 S • Various exact definitions for the similarity join What is a Similarity Join? • Christian Böhm • 15 150 • • Similarity join corresponds to set of identical similarity queries, evaluated for a large number of query points Sequential evaluation of similarity queries with index is the easiest similarity join algorithm Many more sophisticated approaches exist Powerful database primitive to support modern applications of data analysis and data mining Curse of Dimensionality • Christian Böhm • 16 150 Index structures fail (outperformed by the sequential scan) if the data space dimension becomes too high Many effects usually called Curse of Dimensionality Curse of Dimensionality [Berchtold, Böhm, Keim, Kriegel: A Cost Model for High-Dim. Nearest Neighb. Search, PODS 1997] With increasing dimension also increases... Christian Böhm 17 150 Typical radius of range queries Distance of a point to its nearest neighbor Edge length of regions of index structures 0.51=0.5 0.720.5 0.830.5 Curse of Dimensionality Christian Böhm 18 150 A cost model for the access probability of index pages using the concept of Minkowski Sum Curse of Dimensionality Christian Böhm 19 150 Binomial formula: Christian Böhm Curse of Dimensionality 20 150 • Asymptotic behavior of similarity search • Suppose number points VMink 2d VSphere Access probability = O(2d), but limited by 100% Saturation area with near linear I/O cost O(n) • • Curse of Dimensionality • Christian Böhm • 21 150 • • • For high dimension: Each similarity query accesses considerable fraction of all index pages. Index does not pay off, anyway sequ. scan Strategies needed for efficient evaluation Join: Base applications on powerful database primitive that exploits high number of queries Efficient algorithms for Similarity Join Organization of the Tutorial 1. 2. Christian Böhm 3. 22 150 4. 5. Motivation Defining the Similarity Join Applications of the Similarity Join Similarity Join Algorithms Conclusion & Future Potential Christian Böhm Defining the Similarity Join 23 150 What Is a Similarity Join? Intuitive notion: 3 properties of the similarity join 1. The similarity join is a join in the relational sense Two sets R and S are combined into one such that the new set contains pairs of points that fulfill a Christian Böhm join condition 24 150 2. 3. Vector or metric objects rather than ordinary tuples of any type The join condition involves similarity What Is a Similarity Join? Christian Böhm Similarity Join 25 150 Distance Range Join NN-based Approaches Closest Pair Query k-NN Join Christian Böhm Distance Range Join (e-Join) 26 150 • Intuitition: Given parameter e All pairs of points where distance e • Formal Definition: • In SQL-like notation: SELECT * FROM R, S WHERE ||R.obj - S.obj|| e Distance Range Join (e-Join) • Christian Böhm • 27 150 Most widespread and best evaluated join Often also called the similarity join Christian Böhm Distance Range Join (e-Join) 28 150 • The distance range self join • is of particular importance for data mining (clustering) and robust similarity search Change definition to exclude trivial results • Distance Range Join (e-Join) • Disadvantage for the user: Result cardinality difficult to control: - Christian Böhm - 29 150 • • e too small e too large no result pairs are produced all pairs from R S are produced Worst case complexity is at least o(|R||S|) For reasonable result set size, advanced join algorithms yield asymptotic behavior which is better than O(|R||S|) k-Closest Pair Query • Christian Böhm • • • Intuition: Find those k pairs that yield least distance The principle of nearest neighbor search is applied on a basis per pair Classical problem of Computational Geometry In the database context introduced by [Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998] 30 150 • There called distance join Christian Böhm k-Closest Pair Query 31 150 • Formal Definition: • Ties solved by result set enlargement Other possibility: Non-determinism (don’t care which of the tie tuples are reported) • k-Closest Pair Query Christian Böhm In SQL notation: 32 150 SELECT * FROM R, S ORDER BY ||R.obj - S.obj|| STOP AFTER k k-Closest Pair Query • Self-join: - Applications: Christian Böhm • Exclude |R| trivial pairs (ri,ri) with distance 0 Result is symmetric - 33 150 - - Find all pairs of stock quota in a database that are most similar to each other Find music scores which are similar to each other Noise robust duplicate elimination k-Closest Pair Query • Christian Böhm • 34 150 Incremental ranking instead of exact specification of k No STOP AFTER clause: SELECT * FROM R, S ORDER BY ||R.obj - S.obj|| • • Open cursor and fetch results one-by-one Important: Only few results typically fetched Don’t determine the complete ranking k-Nearest Neighbor Join • Christian Böhm • 35 150 • Intuition: Combine each point with its k nearest neighbors The principle of nearest neighbor search is applied for each point of R In the database context introduced by [Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998] • There called distance semijoin Christian Böhm k-Nearest Neighbor Join 36 150 • Formal Definition: • Ties solved by result set enlargement Other possibility: Non-determinism (don’t care which of the tie tuples are reported) • k-Nearest Neighbor Join Christian Böhm In SQL notation: (limited to k = 1) 37 150 SELECT * FROM R, S GROUP BY R.obj ORDER BY ||R.obj - S.obj|| STOP AFTER K (* k *) k-Nearest Neighbor Join Christian Böhm • 38 150 The k-NN-join is inherently asymmetric: k-Nearest Neighbor Join • Applications of the k-NN-join: - Christian Böhm - k-means and k-medoid clustering Simultaneous nearest neighbor classification: A large set of new objects without class label are assigned according to the majority of k nearest neighbors of each of the new objects • • 39 150 • Astronomic observation Online customer scoring Ranking on the k-NN-join is difficult to define Further possible definitions • Christian Böhm • 40 150 Inverse nearest neighbor join: Combine each point ri of R with every point of S which considers ri to be its nearest neighbor Metric data sets: Instead of vectors use arbitrary objects with a distance metric - E.g. Text sequences with edit distance Text mining using the similarity join applies A* Christian Böhm Applications 41 150 Christian Böhm Density Based Data Mining 42 150 Christian Böhm Schema for Data Mining Algorithms 43 150 Algorithmic Schema A1 foreach Point p D PointSet S := SimilarityQuery (p, e); foreach Point q S DoSomething (p,q) ; Iterative similarity queries and cache Due to curse of dimensionality: No sufficient inter-query locality of the pages Average cache hit ratio Christian Böhm 0,08 10-nn query sim. range query 0,07 0,06 0,05 0,04 0,03 0,02 0,01 0,00 44 150 0 10 20 30 Dimension (d ) 40 Christian Böhm Iterative similarity queries and cache 45 150 Idea: Query Order Transformation [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000] Christian Böhm 46 150 Transform order of similarity queries such that packing of points into pages is considered If one pair of index pages is in the cache: process all sim. queries regarding this pair Each pair of pages is considered at most once Christian Böhm Idea: Query Order Transformation 47 150 Christian Böhm Transform the Original Schema A1… 48 150 Algorithmic Schema A1 foreach Point p D PointSet S := SimilarityQuery (p, e); foreach Point q S DoSomething (p,q) ; Christian Böhm …Into a New Algorithmic Schema A2 49 150 foreach DataPage P LoadAndPinPage (P) ; foreach DataPage Q if (mindist (P,Q) e) CachedAccess (Q) ; foreach Point p P foreach Point q Q if (distance (p,q) e) DoSomething’ (p,q) ; UnFixPage (P) ; Similarity Join A2 is a Similarity-Join-Algorithm: Christian Böhm foreach PointPair (p,q) DoSomething’ (p,q) ; 50 150 Where denotes the Similarity-Join: SELECT * FROM R r1, R r2 WHERE distance (r1.object, r2.object) e Implementation Variants • Change of the order in which points are combined must partially be considered Christian Böhm Implementation 51 150 Semantic Change algorithm to take unknown order into account Materialization Materialize join result j and answer original queries by j Example Clustering Algorithms DBSCAN [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise´, KDD 1996] Flat clustering (non hierarchical) Christian Böhm OPTICS [Ankerst, Breunig, Kriegel, Sander: OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD Conf. 1999] Hierachical cluster-structure 1 3 2 52 150 Semantic Rewriting Materialization Transformation by Semantic Rewriting • Christian Böhm • 53 150 Rewrite the algorithm to take the changed order of pairs into account Don´t assume any specific order in which pairs are generated Arbitrary similarity join algorithm possible Example: DBSCAN Christian Böhm p core object in D wrt. e, MinPts: | Ne (p) | MinPts p directly density-reachable from q in D wrt. e, MinPts: 1) p Ne(q) and 2) q is a core object wrt. e, MinPts density-reachable: transitive closure. cluster: - 54 150 - maximal wrt. density reachability any two points are density-reachable from a third object Christian Böhm Implementation of DBSCAN on Join 55 150 Core point property: DoSomething() increments a counter attribute Determination of maximal density-reachable clusters: DoSomething(): - Assign ID of known cluster point to unknown cluster points Unify two known clusters Christian Böhm Implementation of DBSCAN on Join 56 150 Christian Böhm Implementation of DBSCAN on Join 57 150 Implementing OPTICS (Materialization) • Christian Böhm • 58 150 • • • The join result is predetermined before starting the actual OPTICS algorithm The result is materialized in some table with GROUP-BY on the first point of the pair The OPTICS algorithm runs unchanged Similarity queries are answered from the join materialization table (much faster) Disadvantage: High memory requirements Experimental Results: Page Capacity Color image data 64-dimensional 1000000 1000000 100000 100000 runtime [sec] runtime [sec] Christian Böhm Meteorology data 9-dimensional 10000 59 150 Q-DBSCAN (R*-tree) 10000 1000 100 100 2000 4000 6000 8000 10000 page capacity Q-DBSCAN (X-tree) J-DBSCAN (R*-tree) 1000 0 Q-DBSCAN (Seq. Scan) J-DBSCAN (X-tree) 0 100 200 page capacity 300 Experimental Results: Scalability Christian Böhm 60 150 Meteorology data 150000 150000 120000 120000 runtime [sec] runtime [sec] Color image data 90000 60000 90000 60000 30000 30000 0 0 30000 60000 90000 0 50000 150000 250000 size of database [points] size of database [points] Q-DBSCAN (Seq. Scan) Q-OPTICS (Seq. Scan) Q-DBSCAN (X-tree) J-DBSCAN (X-tree) Q-OPTICS (X-tree) J-OPTICS (X-tree) Experimental Results: Query Range Color image data Color image data 140000 70000 120000 Christian Böhm 61 150 runtime [sec] runtime [sec] 60000 50000 40000 30000 100000 80000 60000 20000 40000 10000 20000 0 0,00 0 0,05 0,10 0,15 0,20 epsilon Querybased Q-DBSCAN(X-tree) (X-tree) J-DBSCAN (X-tree) Joinbased (X-tree) 0,25 0,1 0,15 0,2 0,25 epsilon Q-OPTICS (Seq. Scan) Q-OPTICS (X-tree) J-OPTICS (X-tree) 0,3 Robust Similarity Search [Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995] • Usual similarity search with feature vectors: Not robust with respect to - Noise: Christian Böhm Euclidean distance sensitive to mismatch in single dimension 62 150 - Partial similarity: Not complete objects are similar, but parts thereof • Concept to achieve robustness: Decompose each data object and query object into sub-objects and search for a maximum number of similar subobjects Robust Similarity Search • Christian Böhm • 63 150 • Prominent concept borrowed from IR research: String decomposition: Search for similar words by indexing of character triplets (n-lets) Query transformed to set of similarity queries similarity join between query set and data set Robustness achieved in result recombination: - Noise robustness: Ignore missing matches Partial search: Dont enforce complete recombination Robust Similarity Search Applications: • Robust search for sequences: [Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995] Christian Böhm • 64 150 Principle can be generalized for objects like - - Raster images CAD objects 3D molecules etc. Astronomic Catalogue Matching • Relative position of catalogues approx. known: - Position and intensity parameters in different bands Christian Böhm C1 65 150 C2 • • C1 C2 Determine e according to device tolerance Astronomic Catalogue Matching • Relative position unknown: - Match according to triangles and intensity Christian Böhm C1 66 150 C2 • • Search triangles and store parameters (height,...) triangles (C1) triangles (C2) k-Nearest Neighbor Classification • Simultaneous classification of many objects [Braunmüller, Ester, Kriegel, Sander: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, ICDE 2000] - Astronomy Christian Böhm • • - Online customer scoring • • 67 150 Some 10,000 new objects collected per night Classify according to some millions of known objects Some 1,000 customers online Rate them according to some millions of known patterns k-Nearest Neighbor Classification • Example: k=3 Christian Böhm Objects with known class 68 150 New objects • New objects Known objects k-Means and k-Medoid Clustering • • • k Points initially randomly selected („centers“) Each database point assigned to nearest center Centers are re-determined Christian Böhm - 69 150 - • k-means: Means of all assigned points (artificial p.) k-medoid: One central database point of the cluster Assignment and center determination are repeated until convergence k-Means and k-Medoid Clustering Christian Böhm • 70 150 Example: (k-means with k = 3) Convergence! • Each assignment phase: DB-Points Centers Christian Böhm Similarity Join Algorithms 71 150 Algorithms´ Overview Similarity join Range dist. join on-the-fly index Christian Böhm Index based 72 150 Hashing based Sorting based Closest pair qu. k-NN join Optimization Cost modeling CPU optimizing Algorithms´ Overview Distance range join (e-join) Index joins with depth-first and breadth-first search [Brinkhoff, Kriegel, Seeger: Efficient Proc. of Spatial Joins Using R-trees, SIGMOD Conf. 1993] [Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996] [Huang, Jing, Rundensteiner: Spatial Joins Usg. R-trees: Breadth-First Traversal..., VLDB 1997] Christian Böhm 73 150 Index construction on-the-fly [Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994] [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] [Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997] [van den Bercken, Schneider, Seeger: Plug&Join, EDBT 2000] Join-algorithms based on hashing [Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996] [Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997] Algorithms´ Overview Join-algorithms based on sorting [Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991] [Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997] [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001] Christian Böhm 74 150 Closest pair query and nearest neighbor join [Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998] [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] [Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial Databases, SIGMOD Conf. 2000] Optimization approaches [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday 1630] [Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted] Nested Loop Join Simple nested loop join: - 75 150 S Nested block loop join: - First iterate over blocks Nested iterate over tuples S scanned |R|/|B| times R-tuples • S-blocks Christian Böhm - R Iterate over R-points Nested iteration over S-points S is scanned |R| times, high I/O cost R-blocks • S-tuples Indexed Nested Loop Join • Christian Böhm • 76 150 • Iterate over every point of R Determine matches in S by similarity queries on the index R S Due to the curse of dimensionality: Performance deterioration of the similarity q. Then not competitive with nested loop join (Depends on dimensionality and selectivity determined by e) Spatial Join • • Christian Böhm • 77 150 • 2D polygon databases Join-predicate: Overlap Conserv. approximation: MBR (ax-par. rectangle) Similarity Join • • • High-D point databases Join-predicate: Distance Map e-join to spatial join Cube with edge-length e e Some strategies can be borrowed from the spatial join R-tree Spatial Join (RSJ) [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993] • • Christian Böhm • 78 150 • Originally: Spatial join for 2D rect. intersection Depth-first search in R-trees and similar indexes Assumption: Index preconstructed on R and S Simple recursion scheme (equal tree height): procedure r_tree_join (R, S: page) foreach r R.children do foreach s S.children do if intersect (r,s) then r_tree_join (r,s) ; R-tree Spatial Join (RSJ) • Christian Böhm • 79 150 Adaptation for the similarity join: Distance predicate rather than intersection For pair (R,S) of pages: mindist (R,S) Least possible distance of two points in (R,S) Christian Böhm R-tree Spatial Join (RSJ) 80 150 procedure r_tree_sim_join (R, S, e) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) e then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,e) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p - q| e then report (p,q); R e S R-tree Spatial Join (RSJ) • • • Extension to different tree heights straightforw. Several additional optimizations possible CPU-bound Christian Böhm - • Disadvantages - 81 150 Cost dominated by point-distance calculations No clear strategies for page access priorization Single page accesses Can be outperformed by nested block loop join Parallel RSJ [Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996] • • Again spatial join for 2D rectangle intersection Three phases of parallel execution: Christian Böhm - 82 150 - • Task creation (non-parallel) Task assignment (non-parallel) Task execution (completely parallel) A task corresponds to a pair of subtrees - At high tree level (e.g. root or second level) Parallel RSJ Christian Böhm • 83 150 Example for the task definition Parallel RSJ Christian Böhm • 84 150 Strategy 1: Static Range Assignment Parallel RSJ Christian Böhm • 85 150 Strategy 2: Static Round-Robin Assignment Parallel RSJ • Strategy 3: Dynamic task assignment - Christian Böhm - 86 150 Processor requests a task when idle Best load balancing Breadth-First R-tree Join (BFRJ) [Huang, Jing, Rundensteiner: Spatial Joins Using R-trees: Breadth-First Traversal..., VLDB 1997] • • Again spatial join for 2D rectangle intersection Shortcoming of RSJ: Christian Böhm - 87 150 - No strategy in outer loop improving locality in inner Depth-first traversal not flexible, because a pair of tree branches must be ended before next pair started unnecessary page accesses Breadth-First R-tree Join (BFRJ) • Solution: - Christian Böhm - 88 150 - Proceed level by level (breadth-first traversal) Determine all relevant pairs for the next level intermediate join index (IJI) Sort the IJI according to suitable order before accessing the next level global optimization strategy Christian Böhm Breadth-First R-tree Join (BFRJ) 89 150 Breadth-First R-tree Join (BFRJ) Options for ordering: 1. 2. Christian Böhm 3. 90 150 4. 5. No particular order Consider the lower x-coordinate of R´s nodes Sum of the centers of x-coordinates of R and S x-coordinate of center of common MBR Hilbert-value of center of common MBR Higher locality (better cache hit rates) for better ordering strategies. Christian Böhm Breadth-First R-tree Join (BFRJ) 91 150 Approaches without Preconstructed Index • • Indexes can be constructed temporarily for join R-tree construction by INSERT too expensive Use cheap bottom-up-construction - Hilbert R-trees: O (n log n) Christian Böhm [Kamel, Faloutsos: Hilbert R-trees: An Improved R-tree using Fractals, VLDB 1994] 92 150 - Sort points by SFC and pack adjacent points to page Buffer trees [van den Bercken, Seeger, Widmayer: A Generic Approach to Bulk Loading.., VLDB 1997] - Repeated partitioning [Berchtold, Böhm, Kriegel: Improving the Query Performance ..., EDBT 1998] • Index construction can amortize during join Seeded Trees [Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994] • Christian Böhm • 93 150 • • Again spatial join for 2D rectangle intersection Assumption: Only one data set (R) is supported by index Typical application: Set S is subquery result Idea: Use partitioning of R as a template for S Seeded Trees • Motivation - Christian Böhm - 94 150 Early inserts to R-trees decide initial organization We know that S will be matched with R Start with small template tree instead of empty root seed levels Seeded Trees • Tree consist of - Christian Böhm • • Tree unbalanced Phases of tree construction: - 95 150 Seed levels Grown levels - Seeding phase Growing phase Cleanup phase Seeded Trees • Seeding phase: - Christian Böhm - - Copy k levels of the R-tree of set R Last level: defined MBRs, but empty child pointers called slot Three strategies for (slot and other) MBRs: • • • 96 150 Copy complete MBR Use only center point rather than complete MBR Center point at slot level, otherwise complete MBR Seeded Trees • Growing phase - Insert of points: Choose subtree like in R*-tree Seed level is not affected during growth phase: • Christian Böhm • 97 150 - No insertions to seed level nodes No split of seed level nodes If point is inserted into empty slot (NULL pointer): • • • A new empty data node is allocated Further, this node is treated like a root in R-trees: on overflow, no split is propagated upward (new root) The R-trees in the slots are called grown subtree. Seeded Trees • Growing phase (cont...) - Various strategies for update of the MBRs in the seed levels during insert operations: • Christian Böhm • • • - In seed levels: In general, the page regions are ... • 98 150 No updates Enlarge bounding box after insert of a not contained point Determine minimum bounding rectangle after insert ... • Not bounding rectangles, i.e. no conservative appx. of set Not minimal Seeded Trees • Cleanup Phase - The MBR property of page regions is needed ... • • Christian Böhm - 99 150 - - • ... not for tree construction ... but for join processing Therefore, actual MBRs are determined in cleanup Empty slots (without grown subtrees) are deleted No attempt to make the tree balanced Join the two indexed sets R and S like in RSJ Seeded Trees Christian Böhm • 100 150 Experimental results (spatial data) The e-kdB-tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] • Algorithm for the Christian Böhm range distance self join 101 150 • General idea: Grid approximation where grid line distance = e • Not all dimensions used for decomposition: As many dimensions as needed defined node capacity Christian Böhm The e-kdB-tree 102 150 The e-kdB-tree • • Christian Böhm • 103 150 Node fanout: 1/e(assuming data space [0..1]d) Tree structure is specific to given parameter e must be constructed for each join The e-kdB-trees of two adjacent stripes are assumed to fit into main memory Christian Böhm The e-kdB-tree 104 150 procedure t_match (R, S: node) if is_leaf (R) is_leaf (S) then ... else for i:=1 to 1/e- 1 do t_match(R.child[i], S.child [i]) ; t_match (R.child[i], S.child [i+1]) ; t_match (R.child[i+1], S.child[i]) ; t_match (R.child[1/e], S.child[1/e]) ; The e-kdB-tree • Christian Böhm • 105 150 • • Limitation: For large e values not really scalable In high-dimensional cases, e=0.3 can be typical 60% of data must be held in main memory As long as data fit into main memory: e-kdB-tree is one of the best similarity join alg. Unfortunately: IBM does not provide any code for comparison Christian Böhm The e-kdB-tree 106 150 The Parallel e-kdB-tree [Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997] • Parallel construction of the e-kdB-tree: - Christian Böhm - 107 150 - Each processor has random subset of the data (1/N) Each processor constructs e-kdB-tree of its own set Identical structure is enforced e.g. by split broadcast CPU1 CPU2 The Parallel e-kdB-tree • Workload distribution: - Christian Böhm - - Global determination of the cumulated node sizes A unit workload is a pair (r,s) of leaf nodes The cost of a workload is |r||s| for different leaves and |r|(|r|+1)/2 for a single leaf (self join) Data is redistributed: Each processor gets 1/N work • 108 150 • join units are clustered to preserve locality minimize redistribution (communication) and replication The Parallel e-kdB-tree • Workload execution: - Christian Böhm - 109 150 delete internal structure cum. node size too large second growth phase data redistribution performed asynchronously: Data sent in depth-first order of tree traversal to avoid network flooding Christian Böhm The Parallel e-kdB-tree 110 150 Plug & Join [van den Bercken, Schneider, Seeger: Plug&Join: An Easy-to-Use Generic Algorithm, EDBT 2000] Generic technique for several kinds of join - Christian Böhm - 111 150 Main-memory R-tree constructed from R-sample Partition R and S acc. to R-tree (buffers at leaves) R main memory 1 2 3 4 flush S main memory 1 2 3 4 Spatial Hash Join [Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996] • Method for the spatial join using replication - Christian Böhm - 112 150 - Set R is partitioned without replication Set S is partitioned according to R‘s buckets; replication if intersection with more than 1 R-bucket Join only corresponding buckets Spatial Hash Join • Partitioning of R: - Christian Böhm - 113 150 - - Using bootstrap-seeding, generates a seeded tree A suitable number # of slots is determined The set R is sampled (sample size c #) Using some clustering method, # cluster centers are determined in the set The cluster centers are the slots in the seeded tree Assign each R-obj. to slot with least enlargement Spatial Hash Join • Partitioning of S and join phase: - Christian Böhm - 114 150 - - - Bucket extents of R are copied to S-buckets For spatial join: Each object s of S is assigned ... ... to all buckets b which are intersected by s For similarity join: ... to all buckets b with mindist (s,b) e All corresponding bucket pairs (r,s) are joined by constructing a quadratic split-R-tree on r. Each obj in s is probed to the R-tree on r. Christian Böhm Spatial Hash Join 115 150 figure 6 Partition Based Spatial Merge Join [Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997] • Again spatial join method using replication - Christian Böhm - 116 150 - Both sets R and S are partitioned with replication Space is regularly decomposed into tiles Partitions either correspond to tiles or are determined from them using hashing Partition Based Spatial Merge Join • Christian Böhm • 117 150 Duplicate pairs can be generated duplicate elimination by sorting according to (OIDR, OIDS) Initial number of partitions determined: (|R| + |S|) size_pt / memsize This formula does not take into account: - replication data skew Christian Böhm Partition Based Spatial Merge Join 118 150 Christian Böhm Approaches Using Space Filling Curves 119 150 • Space filling curves recursively decompose the data space in uniform pieces • Various different orders: Christian Böhm Approaches Using Space Filling Curves • Efficient filter for the join: Objects in different cells cannot intersect each other Sort-merge-join e.g. on Z-order • Problem: Object may cross grid lines - 120 150 - either decompose object (redundant) or assign to containing cell Christian Böhm Approaches Using Space Filling Curves 121 150 • If all cells have uniform size: Equi-join on grid cell numbers (bit strings) • If cells have varying size: Bit strings of varying length • Objects may intersect ... - if bitstr (r) is prefix of bitstr (s) or bitstr (s) is prefix of bitstr (r) Orenstein‘s Spatial Join [Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991] • • Allows (limited) redundancy, object decompos. Algorithm: - Christian Böhm - 122 150 - Objects are decomposed Partial objects are ordered according to the lexicographical order of the bit strings Objects are accessed in sort-merge like fashion Two stacks are maintained to keep track of the prefix objects of R and S. Orenstein‘s Spatial Join Christian Böhm • 123 150 Stacks for prefix objects: Orenstein‘s Spatial Join • Christian Böhm • 124 150 • Mergesort principle: From the two files, read the next element which is smaller according to the lexicographical order The stacks are updated: Discard anything thats not a prefix of new string The new object is compared to every object on the other stack Orenstein‘s Spatial Join • Controlling redundancy: - Christian Böhm - 125 150 • Allowing no redundancy: Many objects approximated by empty string Decomposing every object until basis resolution No manageable set of objects 2 Methods for controlling redundancy: - Size-bound: Given a max. number of partial objects Error-bound: Given a max. error volume of appx. Multidimensional Spatial Join [Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997, Best Paper Award] • Christian Böhm • 126 150 • No redundancy allowed at all Instead of stacks: Separate level files for different bitstring length Problems with no redundancy: - With increasing dimension: increasing e Increasing chance that object intersects one of the primary decomposition lines approx. by < > Christian Böhm Multidimensional Spatial Join 127 150 Epsilon Grid Order Christian Böhm [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001] 128 150 • Motivation like e-kdB-tree: Based on grid with grid line distance e • Possible join mates restricted to 3d cells • Here no tree structure but sort order of points based on lexicographical order of the grid cells Epsilon Grid Order Christian Böhm • 129 150 Christian Böhm Epsilon Grid Order 130 150 • A simple exclusion test (used for I/O): A point q with or cannot be join mate of point p or any point beyond p (with respect to epsilon grid order) • The interval between p-[e,...,e]T and p+[e,...,e]T is called e-interval Epsilon Grid Order Christian Böhm • 131 150 Sort file and decompose it into I/O units Christian Böhm Epsilon Grid Order 132 150 Christian Böhm Epsilon Grid Order 133 150 Closest Pair Queries [Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998] • Christian Böhm • 134 150 • For both point objects and spatial objects Find k objects with least distance Basis algorithm* for nearest neighbor search extended to take point pairs into account * [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995] Basis Algorithm for NN Search Active Page List: Christian Böhm proot | |p|p3p33| |pp2312| |pp2123| |pp2213 | p21 | p22 12 | |pp414| |pp24 14 424 135 150 1 2 3 4 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Hjaltason/Samet: Closest Pair Queries • • • Christian Böhm • 136 150 • • Nearest Neighbor Closest Pair Query k result points k point pairs active page list list of active page pairs initialization root pair (rootR, rootS) distance point/query distance of point pair mindist page/query mindist betw. page pair Hjaltason/Samet: Closest Pair Queries Active Page List: Christian Böhm (root,p1)|(root,p2)|(root,p3)|(root,p4) (root,root) 137 150 1 2 3 4 Hjaltason/Samet: Closest Pair Queries • Christian Böhm • 138 150 • Unidirectional node expansion: Given a pair (ri,sj) only one node is expanded Closest pair ranking: Incremental version of k-closest pair queries stopping criterion is validation of next pair k-nearest neighbor join: Runs a closest pair ranking and filters out the (k+1)st occurrence (and more) of each point of R Hjaltason/Samet: Closest Pair Queries • Two strategies for tie breaks (same distance): - Christian Böhm • 139 150 Depth-first Breadth first Three policies for tree traversal - Basic (one tree determines priority) Even (priority to node with shallower depth) Simultaneous (all possible pairs are candidates for traversal) Alternative Approaches [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] • Various improvements and optimizations - Bidirectional node expansion Christian Böhm (root,root) 140 150 - - (p1,p3) | (p2, p3) | (p2, p4) | (p1, p2) | (p3, p4) | (p1, p4) Plane sweep technique for bidirectional node exp. Adaptive multi-stage algorithm • Aggressive pruning using estimated distances Alternative Approaches [Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial Databases, SIGMOD Conf. 2000] mindist Christian Böhm • 141 150 5 different algorithms for closest point queries - - - Naive: Depth-first traversal of the two R-trees recursive call for each child pair (ri,sj) of (r,s) Exhaustive: like naive but prune page pairs the mindist of which exceeds the current k-CP-dist Simple recursive: addit. prune using minmaxdist Alternative Approaches • 5 different algorithms (...) - Before descending sort child mindist pairs acc. to their mindist fast get good distance for pruning. Analogous to Christian Böhm 142 150 Sorted distances recursive: [Roussopoulos, Kelley, Vincent: Nearest Neighbor Queries. SIGMOD Conf. 1995] - Heap algorithm: Similar to the algorithm by Hjaltason & Samet with some minor differences • New strategies for ties and different tree height Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630] Mating probability of index pages: Christian Böhm 143 150 Probability that distance between two pages e Two-fold application of Minkowski sum Modeling and Optimization • I/O cost: • • Christian Böhm • 144 150 High const. cost per page Large capacity optimum CPU cost: • • Low const. cost per page Low capacity optimum CPU-performance like CPU optimized index I/O- performance like I/O optimized index Plane Sweep Optimization [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993] For the directory in the R-tree spatial join (RSJ): - Christian Böhm - 145 150 - - Avoid computation of all C2 box overlaps/distances Sort boxes according to lower x-coordinates Plane sweep to Sweep plane determine the box pairs: Hold all rectangles intersected by sweep plane in the status structure Plane Sweep Optimization [Arge, Procopiuc, Ramaswamy, Suel, Vitter: Scalable Sweeping Based Spatial Join, VLDB 1998] • A plane sweep algorithm for the spatial join - Christian Böhm - 146 150 - - • Partition space into k stripes at most 2N/k objects start/end in each stripe Rectangle contained in a single strip is called small Other rectangles decomposed: start, end, centerpiece Recursive determination of intersections for startand endpieces and small rectangles Optimum complexity O(n log n + |R S|) Plane Sweep Optimization [Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted for pub.] • Reduction of the computational cost of point-distances • • Plane-sweep or also sort-merge method: Christian Böhm • 147 150 • • Most important cost factor for all similairty join algorithms Sort points on both pages according to a selected dimension Many point pairs can be excluded beforehand Crucial: Dimension • • • Distance or overlap Extent of the pages Probability model Christian Böhm Conclusions 148 150 Summary • • Similarity join is a powerful database primitive Supports many new applications of - Christian Böhm - 149 150 • Data mining Data analysis Considerable performance improvements Summary • Many different algorithms for the similarity join - Christian Böhm - 150 150 • • Most for the distance range join (e join) Some approaches for closest pair queries Important operation of nearest neighbor join has almost not been considered yet All 3 types of join have different applications Comparison of different e join algorithms: - Mostly a competition for speed Summary • Only few other advantages/disadvantages: - Scalability: • - Existence of an index: • Christian Böhm 151 150 • - MSJ and e-kdB-tree have high main memory requirements in high-dimensional spaces Actually no matter because R-trees can be fast constructed bottom-up. Construction time often much less than join time Even if preconstructed indexes exist: Approaches based on sorting often better No good criteria known for algorithm selection Future Research Directions • Applications: - Many standard data mining methods accelerable: • • Christian Böhm • • - New data mining methods will become feasable: • • 152 150 Outlier detection Various clustering algorithms (e.g. obstacle clustering) Hough transformation and similar analysis methods ... • Subspace clustering & correlation detection Methods may become interactive ... Future Research Directions • Algorithms - Christian Böhm - 153 150 - Sufficient research for e join and closest pair query Almost no convincing approaches for the k-NN-join Important database primitive for many applications Parallel Algorithms Non-vector metric data (e.g. text mining) Approximative join algorithms • • - ... Similarity search: Approximative search often sufficient Join performance could be considerably improved Future Research Directions • Optimization of various critical parameters - Christian Böhm - 154 150 - Dimension Replication Index scan strategies ... Christian Böhm 155 150 Questions Comparison with Multiple Queries 70000 156 150 runtime [sec] Christian Böhm 60000 SQ-DBSCAN (X-tree) MQ-DBSCAN (Scan) MQ-DBSCAN J-DBSCAN (X-tree) 50000 40000 30000 20000 10000 0 0,00 0,05 0,10 epsilon 0,15 0,20 Experimente: Seitenkapazität Color image data 64-dimensional 1000000 1000000 100000 100000 runtime [sec] runtime [sec] Christian Böhm Meteorology data 9-dimensional 10000 157 150 Q-DBSCAN (R*-tree) 10000 1000 100 100 2000 4000 6000 8000 10000 page capacity Q-DBSCAN (X-tree) J-DBSCAN (R*-tree) 1000 0 Q-DBSCAN (Seq. Scan) J-DBSCAN (X-tree) 0 100 200 page capacity 300 Experimente: Anfrageregion Color image data Color image data 140000 70000 120000 Christian Böhm 158 150 runtime [sec] runtime [sec] 60000 50000 40000 30000 100000 80000 60000 20000 40000 10000 20000 0 0,00 0 0,05 0,10 0,15 0,20 epsilon Querybased Q-DBSCAN(X-tree) (X-tree) J-DBSCAN (X-tree) Joinbased (X-tree) 0,25 0,1 0,15 0,2 0,25 epsilon Q-OPTICS (Seq. Scan) Q-OPTICS (X-tree) J-OPTICS (X-tree) 0,3 Experimente: Künstliche Daten Christian Böhm 4d-UNIFORM 159 150 8d-UNIFORM 8d-UNIFORM Future Work Weitere KDD-Algorithmen auf Join abstützen Christian Böhm Neue Algorithmen für den Similarity Join 160 150 Z.B. Outlier Detection Subspace Clustering, Ermittlung von Korrelationen Interaktivität Nutzung des Optimierungspotentials (Dimension,...) Parallelisierung Approximative Join-Bearbeitung „k-nearest-neighbor Joins“ und „k-best-pair Joins“ 161 150 Christian Böhm e 162 150 Christian Böhm e KDD Algorithms Based on Similarity Queries DBSCAN Christian Böhm OPTICS 163 150 .... LOF Simultan. Spatial Nearest Trend Dist. Neighbor Detect. Based Classific. Spatial Outliers Assoc. .... .... Rules Curse of Dimensionality Cost model opens optimization potential Optimization of the page capacity (# points) [Böhm, Kriegel: Dynamically Optimizing High-Dimensional Index, EDBT 2000] Christian Böhm 164 150 Optimized index compression [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization: An Index Compression Technique for High-Dimensional Spaces, ICDE 2000] Optimized dimension assignment [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal Multidimensional Query Processing Using Tree Striping, DaWaK 2000]