Download Cache-based Query Result Estimation

NII International Internship Project: Cache-based Query Result Estimation Supervisor: Michael HOULE, Visiting Professor In several novel applications (such as search engines, recommender systems and multimedia databases), the result of a query is a ranked list obtained by applying a similarity measure to features of database objects. Generating ranked lists is typically an expensive operation that often results in access latency. Caching of frequentlyaccessed data has been shown to have many useful applications for reducing stress on limited resources and improving response time. However, traditional caching techniques defined for exact match queries cannot be applied to ranked list queries. In this paper, we propose an "active caching" technique for ranked list queries that not only returns cached results, but also actively processes queries whose results are not present in the cache, by aggregating those ranked list results stored in the cache for related queries. This project investigates the application of concepts from the relevant set correlation (RSC) clustering model [1] to the problem of cache-based query result estimation. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-by-example, that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. A scalable, efficient clustering heuristic, GreedyRSC, has been developed based on RSC [1], using the SASH approximate similarity search structure for fast generation of relevant sets [2]. The ideal duration of this project is 6 months, although visits of as short as 4 months will still be considered. Although it is possible to reduce the length of the internship after being accepted, it may be difficult to extend the duration beyond that which is stated in the candidate’s application. Therefore, candidates are strongly recommended to state in their application only the longest possible duration for their intended stay at NII. 3. References [1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta, GA, USA, 2008. [2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Cache-based Query Result Estimation