* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download FROM R
Survey
Document related concepts
Transcript
1 39 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Keynote at iiWAS 2002 2 39 Similarity Search 3 39 Feature Based Similarity 4 39 Simple Similarity Queries Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q. 5 39 Multidimensional Index Structure (R-tree) Directory Data Page: Page: rectangle 1, address1 point 1: x11, x12, x13, ... rectangle 2,xaddress 2 point : x , 2 21 22, x23, ... rectangle 3,xaddress 3 point : x , 3 31 32, x33, ... rectangle4, address4 6 39 Range Query with Depth-First Traversal 7 39 Nearest Neighbor: Priority Algorithm [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995] 4 page accesses 8 39 Problems of High-Dim. Index Structures „Curse of dimensionality“: • Search performance of index deteriorates in high dim. • Outperformed by sequential scan Solution • Optimize various parameters of index structures Needed: Cost model for queries How many pages are expected to be accessed for • Range queries (with given e) • Nearest neighbor queries (with given k) 9 39 Cost Estimation (Uniformity/Independence) Minkowski sum: Estimation of the access probability of a page [Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000] Nearest neighbor: Estimate distance by point density 10 39 Cost Estimation Boundary and saturation effects in high dim. space (considered by our model extension) Correlation between attributes (considered by the concept of fractal dimension) Cluster structure has also impact on performance • Currently neglected by our model • Histograms and similar data descriptions difficult in high-dimensional space (number of histo-bins exponential in dimensionality) • Other descriptions of cluster structure (dendrograms) Subject to future work 11 39 Optimization of Index Structures To avoid the possibility to outperform index based query processing by the sequential scan: Optimize various parameters such as • • • • Logical block size of the index pages Indexed dimension I/O schedule optimization (fast index scan) Data quantization Observe the balance! (Master Confucius) 12 39 Page Size Optimization 13 39 Page Size Optimization 14 39 Optimized Dimension Assignment Hi-dim. Index R-tree Problem in hi-dim: Too few splits in each dimension Inverted List B-tree Problem in hi-dim: Too many results in each dimension Matching 15 39 Optimized Dimension Assignment Hi-dim. Index R-tree Inverted List B-tree Compromise: A moderate number of R-trees each indexing a few dimensions Matching OPTIMIZE! 16 39 Schedule Optimization (Fast Index Scan) Range Query: Required Pages are known from the directory 17 39 Schedule Optimization (NN Queries) Current expenses are traded for possible later savings Start at 100% page and extend forward and backward Optimize the cumulated cost balance (CCB): 18 39 Quantization Approximate the points by quantization grid based on quantiles Benefit:fewer bits for representation Cost: Grid cell partially intersected access the original point data How to choose grid resolution ??? [Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998] 19 39 Independent Quantization (IQ tree) Combines index, scan, and quantization [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000] Grid resolution optimized by cost model 20 39 Open Research Problems in Optimization Multi-Parameter Optimization: • How can parameters be optimized simultaneously? • Are there conflicts between optimization goals? Example: Uniform data: Quantization Correlated data: Tree Striping 21 39 Open Research Problems in Optimization Consider Insert/Delete/Update: If the data set faces heavy update, the constructed index should look differently compared with more static data sets • Update-bound: Construct index rather simple • Query-bound: Spend more effort to organize data Can be considered as an optimization problem 22 39 Data Mining 23 39 KDD Algorithms Based on Similarity Queries LOF Simultan. Nearest Dist. Neighbor OPTICS Based Classific. Outliers DBSCAN .... .... .... Spatial Trend Detect. Spatial Assoc. Rules 24 39 Join Applikationen Katalogkonversion (Catalogue Matching) • z.B. Astronomie-Kataloge R S 25 39 Clustering Clustering (e.g. DBSCAN) [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996] 26 39 Cache Behavior 27 39 Clustering and Similarity Join DBSCAN uses similarity join as basic operations [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000] 28 39 k-Nearest Neighbor Classification Example: k=3 Objects with known class New objects • New objects Known objects 29 39 Distance Range Join (e-Join) • • Most widespread and best evaluated join Often also called the similarity join 30 39 k-Closest Pair Query In SQL notation: SELECT * FROM R, S ORDER BY ||R.obj - S.obj|| STOP AFTER k 31 39 k-Nearest Neighbor Join In SQL notation: SELECT * FROM R, S (limited to k = 1) GROUP BY R.obj ORDER BY ||R.obj - S.obj|| STOP AFTER K (* k *) 32 39 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, e) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) e then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,e) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p - q| e then report (p,q); R e S 33 39 Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630] Mating probability of index pages: Probability that distance between two pages e Two-fold application of Minkowski sum 34 39 Modeling and Optimization I/O cost: • High const. cost per page • Large capacity optimum CPU cost: • Low const. cost per page • Low capacity optimum CPU-performance like CPU optimized index I/O- performance like I/O optimized index 35 39 Open Problems for Research (Sim. Join) Modeling and Optimization: • • • • Dimension Quantization Page scheduling Caching strategies Nearest Neighbor Join • Applications • Algorithms General • Integration into object-relational DBMS 36 39 New Challenges New Challenges Incertain Features: Application: • Biometric Identification Particularities: Relative probability 37 39 • Features individually associated with incertainty (e.g. as Gaussian distributions) Queries: • Probability of match • Find objects with highes probability of match • Find objects with probability of match >= e Feature a1 38 39 New Challenges Support of e-commerce in all phases • Marketing • Sales and booking • Add-on products Advanced Similarity • • • • Adaptable Multimodal models Relevance-feedback Convex hull customer segmentation advanced similarity search Sales transaction analysis 39 39 New Challenges Stock quota: Technical chart analysis Known: Database techniques for similarity search in time sequences (DFT, etc.) 40 39 New Challenges Professional analyst tools use: • Trading signals generated by indicators (etc. MACD) • Formations indicating trends in charts • Relationships to the market and to derivatives 41 39 Conclusion Database primitives: abstraction from application: Similarity Search Range Queries Nearest Neighbor Queries Clustering Classification Similarity Join Outlier Detection Advantages • General solution, reuse • Separately optimizable