Download Effective Oracles for Fast Approximate Similarity Search

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
NII International Internship Project:
Effective Oracles for Fast Approximate Similarity Search
Supervisor: Michael HOULE, Visiting Professor
The effectiveness of indices for search and retrieval has traditionally been evaluated
using such measures as precision, recall, and the F-measure, averaged over large
numbers of queries of the data set. When the query is based at an object of the data set
(query-by-example), the measured performance often depends on how well-related the
object is with those most similar to it. If the object is a member of a large, wellformed data cluster, the performance of the query tends to be better than if the object
is a noise element or outlier. For some applications of similarity search, it may be
more important to achieve better performance when the query object belongs to a
small cluster; an important example is nearest-neighbor clustering for data mining
applications, where one seeks to discover small but important “nuggets” (clusters).
Failure to generate accurate neighbor lists based at nugget members can inhibit the
formation of small clusters while promoting the formation of only large
(uninteresting) clusters.
This project will investigate the application of the relevant-set correlation (RSC)
clustering model [1,2,5] to the evaluation of the effectiveness of indices for similarity
search within small clusters. Developed at NII, RSC is a generic model for clustering
that requires no direct knowledge of the nature or representation of the data. In lieu of
such knowledge, the model relies solely on the existence of an oracle for queries-byexample, that accepts a reference to a data item and returns a ranked set of items
relevant to the query. In principle, the role of the oracle could be played by any
similarity search structure, or even a search engine whose internal ranking function
and relevancy scores are secret. The quality of cluster candidates, the degree of
association between pairs of cluster candidates, and the degree of association between
clusters and data items are all assessed according to the statistical significance of a
form of correlation among pairs of relevant sets and/or candidate cluster sets. Based
on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has
already been developed and demonstrated for very large, high-dimensional datasets,
using a fast approximate similarity search structure (the SASH [6]) as the oracle [3,4].
An RSC-based quality measure for the guidance of unsupervised feature selection,
RSCF (“RSC for features”), has also been developed [5]. For a static choice of feature
set & index, RSCF can evaluate the effectiveness of a similarity search index towards
RSC-style clustering, with a computational cost of roughly the same order as that
required for an explicit clustering. However, for large data sets and/or large feature
candidate sets, the computational cost of repeated RSCF evaluations can be
prohibitively high even for the simplest feature set generation strategies.
The specific goals of this project are:
 To develop faster heuristics based on RSCF for the guidance of unsupervised
feature selection, suitable for the detection of nugget-sized groupings of data sets.


To implement and test the new heuristics, by performing feature selection tasks
on large classified data sets. Suitable document data sets are available; other sets
may be included in the evaluation as they become available.
Conduct an experimental evaluation of feature sets generated by the new
heuristics. The experiments should contrast search performance (both exact, and
approximate using the SASH and other indices) for queries-by-example based at
members of data classes of varying sizes.
3. References
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), Atlanta, USA, April
2008, to appear.
[2] M. E. Houle, "Clustering without data: the relevant-set correlation model", in Proc.
International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp.
54-61, Sapporo, Japan, 2006.
[3] M. E. Houle, "Clustering without data: the GreedyRSC heuristic", in Proc.
International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp.
62-69, Sapporo, Japan, 2006.
[4] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp.
547-552, Washington DC, USA, 2003.
[5] M. E. Houle and N. Grira, "A correlation-based model for unsupervised feature
selection", in Proc. 16th ACM Conference on Information and Knowledge
Management (CIKM 2007), pp. 897-900, Lisboa, Portugal, 2007.
[6] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely
high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.