Download Cache-based Query Result Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
NII International Internship Project:
Cache-based Query Result Estimation
Supervisor: Michael HOULE, Visiting Professor
In several novel applications (such as search engines, recommender systems and
multimedia databases), the result of a query is a ranked list obtained by applying a
similarity measure to features of database objects. Generating ranked lists is typically
an expensive operation that often results in access latency. Caching of frequentlyaccessed data has been shown to have many useful applications for reducing stress on
limited resources and improving response time. However, traditional caching
techniques defined for exact match queries cannot be applied to ranked list queries. In
this paper, we propose an "active caching" technique for ranked list queries that not
only returns cached results, but also actively processes queries whose results are not
present in the cache, by aggregating those ranked list results stored in the cache for
related queries.
This project investigates the application of concepts from the relevant set correlation
(RSC) clustering model [1] to the problem of cache-based query result estimation.
Developed at NII, RSC is a generic model for clustering that requires no direct
knowledge of the nature or representation of the data. In lieu of such knowledge, the
model relies solely on the existence of an oracle for queries-by-example, that accepts
a reference to a data item and returns a ranked set of items relevant to the query. In
principle, the role of the oracle could be played by any similarity search structure, or
even a search engine whose internal ranking function and relevancy scores are secret.
The quality of cluster candidates, the degree of association between pairs of cluster
candidates, and the degree of association between clusters and data items are all
assessed according to the statistical significance of a form of correlation among pairs
of relevant sets and/or candidate cluster sets. A scalable, efficient clustering heuristic,
GreedyRSC, has been developed based on RSC [1], using the SASH approximate
similarity search structure for fast generation of relevant sets [2].
The ideal duration of this project is 6 months, although visits of as short as 4 months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application only the longest possible duration for their intended stay at
NII.
3. References
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta,
GA, USA, 2008.
[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely
high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.