Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NII International Internship Project: Effective Oracles for Fast Approximate Similarity Search Supervisor: Michael HOULE, Visiting Professor The effectiveness of indices for search and retrieval has traditionally been evaluated using such measures as precision, recall, and the F-measure, averaged over large numbers of queries of the data set. When the query is based at an object of the data set (query-by-example), the measured performance often depends on how well-related the object is with those most similar to it. If the object is a member of a large, wellformed data cluster, the performance of the query tends to be better than if the object is a noise element or outlier. For some applications of similarity search, it may be more important to achieve better performance when the query object belongs to a small cluster; an important example is nearest-neighbor clustering for data mining applications, where one seeks to discover small but important “nuggets” (clusters). Failure to generate accurate neighbor lists based at nugget members can inhibit the formation of small clusters while promoting the formation of only large (uninteresting) clusters. This project will investigate the application of the relevant-set correlation (RSC) clustering model [1,2,5] to the evaluation of the effectiveness of indices for similarity search within small clusters. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-byexample, that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [6]) as the oracle [3,4]. An RSC-based quality measure for the guidance of unsupervised feature selection, RSCF (“RSC for features”), has also been developed [5]. For a static choice of feature set & index, RSCF can evaluate the effectiveness of a similarity search index towards RSC-style clustering, with a computational cost of roughly the same order as that required for an explicit clustering. However, for large data sets and/or large feature candidate sets, the computational cost of repeated RSCF evaluations can be prohibitively high even for the simplest feature set generation strategies. The specific goals of this project are: To develop faster heuristics based on RSCF for the guidance of unsupervised feature selection, suitable for the detection of nugget-sized groupings of data sets. To implement and test the new heuristics, by performing feature selection tasks on large classified data sets. Suitable document data sets are available; other sets may be included in the evaluation as they become available. Conduct an experimental evaluation of feature sets generated by the new heuristics. The experiments should contrast search performance (both exact, and approximate using the SASH and other indices) for queries-by-example based at members of data classes of varying sizes. 3. References [1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), Atlanta, USA, April 2008, to appear. [2] M. E. Houle, "Clustering without data: the relevant-set correlation model", in Proc. International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp. 54-61, Sapporo, Japan, 2006. [3] M. E. Houle, "Clustering without data: the GreedyRSC heuristic", in Proc. International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp. 62-69, Sapporo, Japan, 2006. [4] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 547-552, Washington DC, USA, 2003. [5] M. E. Houle and N. Grira, "A correlation-based model for unsupervised feature selection", in Proc. 16th ACM Conference on Information and Knowledge Management (CIKM 2007), pp. 897-900, Lisboa, Portugal, 2007. [6] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.