Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni1 and Ratko Orlandic2 1 Illinois Institute of Technology, Chicago 2 University of Illinois at Springfield Database and Expert Systems Applications 2006 ┼ Work supported by the NSF under grant no. IIS-0312266. Outline • • • • • • • • Problem Definition Existing Solutions Our Goal Design Principle GardenHD Clustering and Γ Partitioning System Architecture and Processes Results Conclusions CS695 April 13, 2007 2 Problem Definition • Consider a database of addresses of clubs • Typical queries are: [d2] – Find all the clubs within 35 miles of 10 West 31st. Street, Chicago. – Find 5 nearest clubs [d1] CS695 April 13, 2007 3 Problem Definition • K-Nearest Neighbor (k-NN) Search: – Given a database with N points and a query point q in some metric space, find k 1 points closest to q. [1] • Applications: – – – – – Computational geometry Geographic information systems (GIS) Multimedia databases Data mining Etc. CS695 April 13, 2007 4 Challenge of k-NN Search • In High-dimensional feature spaces indexing structures face the problem of dead space (KDBTrees) or overlaps (R-tree). • Volume and area grows exponentially with respect to number of dimensions. • Finding k-NN points is costly. • Traditional access methods are at par with sequential scan – “Curse of dimensionality” CS695 April 13, 2007 5 Existing Solutions • Approximation and dimensionality reduction. • Exact Nearest Neighbor Solutions • R-tree • SS-tree • SR-tree • VA-File • A-tree • iDistance • Significant effort in finding the exact nearest neighbors has yielded limited success. CS695 April 13, 2007 6 Goal • Our goal: – Scalability with respect to dimensionality – Acceptable pre-processing (data-loading) time – Ability to work on incremental loads of data. CS695 April 13, 2007 7 Our Solution 1 •Clustering •Space partitioning •Indexing 0 CS695 April 13, 2007 1 8 Design Principle • “multi-dimensional data must be grouped on storage in a way that minimizes the extensions of storage clusters along all relevant dimensions and achieves high storage utilization”. CS695 April 13, 2007 9 What does it Imply? • Storage organization must maximize the densities of storage clusters • Reduce their internal empty space • Improve search performance even before the retrieval process hits persistent storage • For best results, employ a genuine clustering algorithm CS695 April 13, 2007 10 Achieving the Principles • Data space reduction: – • Detecting dense areas (dense cells) in the space with minimum amounts of empty space. Data clustering: – Detecting the largest areas with the above mentioned property, called data clusters. CS695 April 13, 2007 11 GardenHD Clustering • Motivated by the stated principle. • Efficiently and effectively separates disjoint areas with points. • Hybrid of cell- and density-based clustering that operates in two phases. • Recursive space partition - partitioning. • Merging of dense cells. CS695 April 13, 2007 12 partitioning 1 region1 region3 Region4 0 CS695 April 13, 2007 • G no. generators • D no. dimensions, • No. regions = 1+(G–1)D subspace • Space partition is compactly represented by a filter (in memory). subspace region0 region2 1 13 Data-Sensitive Gamma Partition • DSGP :– Data-Sensitive Gamma Partition 3 KDB-Trees 2 1 4 Effective boundaries CS695 April 13, 2007 14 System Architecture Data Clustering “Data-Sensitive” Space Partitioning Incremental Data Loading Data Loading Data Retrieval Region Search CS695 April 13, 2007 Similarity Search 15 Basic Processes • Each region in space represented by separate KDB-tree – KDB-trees perform implicit slicing • Initial and incremental loading of data – Dynamic assignment of multi-dimensional data to index pages • Retrieval – Region and k-nearest neighbor search – Several stages of refinement CS695 April 13, 2007 16 Similarity Search - GammaNN • Nearest neighbor search using GammaNN. Region Representatives Query Point Clipped portions to be queried CS695 April 13, 2007 Query Hyper-sphere 17 Region Search 3 2 1 4 CS695 April 13, 2007 18 Experimental Setup • PC with 3.6 GHz CPU, 3GB RAM, and 280GB disk. • Page size was 8K bytes. • Normalized D-dimensional space [0,1]D. • The GammaNN implementations with and without explicit clustering are referred to here as ‘data aware’ and ‘data blind’ algorithms, respectively. • Comparison with Sequential Scan and VA-File. CS695 April 13, 2007 19 Datasets • Data: – Synthetic data • Up to 100 dimensions, 100,000 points. • Distributed across 11 clusters—one in the center and 10 in random corners of the space – Real data • 54-dimensional, 580,900 points, forest cover type (“covtype”). • Distributed across 11 different classes. • UCI Machine learning repository. CS695 April 13, 2007 20 Metrics • Pre-processing time – Time of space partitioning, I/O and the time for data loading (i.e., the construction of indices plus insertion of data). – For VA-File, only the time to generate the vector approximation file. • Performance – Average page access for k-NN queries. – Time to process k-NN queries. CS695 April 13, 2007 21 Experimental Results Pre-processing time for the three algorithms covtype data, 54 dim , 580900 points 160 Time in seconds 14 0 120 100 80 60 40 20 0 Data Aware CS695 April 13, 2007 Data Blind VA-File 22 Cumulative time for 100 queries 10 NN, synthetic data Time in seconds 450 400 350 300 250 200 150 100 50 0 Sequential Scan VA-File Data Blind Data Aware 10 20 30 40 50 60 70 80 90 100 Number of dimensions CS695 April 13, 2007 Avg page accesses Performance Synthetic Data Average page accesses for 100 queries 10 NN, synthetic data 3000 Sequential Scan Data Blind VA-File Data Aware 2500 2000 1500 1000 500 0 10 20 30 40 50 60 70 80 90 100 Number of dimensions 23 Cumulative time for 100 queries 10 NN, real data Time in seconds 1200 1000 800 600 400 200 0 S. Scan Data Blind CS695 April 13, 2007 VA-File Data Aware Avg page accesses Performance Real Data Average page accesses for 100 queries 10 NN, real data 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 S. Scan Data Blind VA-File Data Aware 24 Progress with k in k-NN 1400 1200 1000 800 600 400 200 0 Data Blind Sequential Scan VA-File Data Aware 1 10 100 Number of nearest neighbors CS695 April 13, 2007 Progress of page accesses with respect to k for k-NN, real data Avg page accesses Time in seconds Progress in time with respect to value of k for k-NN, real data 10000 8000 6000 4000 2000 0 Sequential Scan Data Blind Data Aware VA-File 1 10 100 Number of nearest neighbors 25 Incremental Load of Data 12 10 8 6 4 2 0 Data Aware Incremental load Data Aware full load 200k 300k 400k 500k Number of points CS695 April 13, 2007 Average page accesses vs number of points Avg page accesses Time In seconds Cumulative time vs number of points 150 100 Data Aware Incremental load 50 0 200k 300k 400k 500k Data Aware full load Number of points 26 Conclusions • Comparison of the data-sensitive and data-blind approach clearly highlights the importance of clustering data on storage for efficient similarity search. • Our approach can support exact similarity search while accessing only a small fraction of data. • The algorithm is very efficient in high dimensionalities and performs better than sequential scan and the VAFile technique. • The performance remains good even after incremental loads of data without re-clustering. CS695 April 13, 2007 27 Current and Future Work • Incorporate R-trees or A-trees in place of KDB-trees. • Provide facility for handling data with missing values. CS695 April 13, 2007 28 References 1. Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) 301-312 2. Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing space-partitioning schemes, Proc. 8th International Database Engineering & Applications Symposium IDEAS’04, (2004) 257-264 3. Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an efficient and effective data space reduction, Proc. ACM Conference on Information and Knowledge Management CIKM’05, (2005) 201-208 4. Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on Database Systems, Vol. 30, No. 2, (2005): 364-395. 5. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf., (1998) 194-205 6. Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf., (2000) 516-526 CS695 April 13, 2007 29 Questions ? [email protected] http://cs.iit.edu/~egalite CS695 April 13, 2007 30