Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NII International Internship Project: Multimodal Data Clustering Supervisor: Michael HOULE, Visiting Professor With multimedia data such as image and video data, individual objects can be associated with several modes of information, each represented by collections of features appropriated to that mode. For example, an image could be represented by simultaneously by a collection of visual features that describe global color and texture characteristics, another collection of features restricted to points of interest, and a third collection pertaining to textual annotations of the image. With more than one mode of information available, it can be difficult to determine the appropriate weighting to give to each mode when using it for retrieval, data mining, or other analysis. The project will investigate the application of the relevant-set correlation (RSC) clustering model [1] to the clustering of data with multiple feature representations. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-by-example, that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. For this project, we will assume that several relevant-set oracles are available for the data set, each oracle serving one of the associated modes of information. Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [2]) as the oracle [1]. The features of GreedyRSC include: The ability to scale to large data sets, both in terms of the number of items and the size of the attribute sets. Genericity, in its ability to deal with different types of attributes (categorical, ordinal, spatial). Automatic determination of an appropriate number of clusters, with the user specifying as input parameters only the minimum desired cluster size and the maximum allowable correlation (proportion of overlap) between pairs of clusters. Robustness with respect to noisy data. The ability to identify clusters of any size (as small as three items). The specific goals of this project are: To extend the RSC clustering model to account for multiple relevant-set oracles. To adapt the GreedyRSC heuristic for the efficient clustering of data under the multi-oracular extension of RSC. To make the clustering tool freely available under the GNU public licence. The ideal duration of this project is 6 months, although visits of as short as 5 months will still be considered. Although it is possible to reduce the length of the internship after being accepted, it may be difficult to extend the duration beyond that which is stated in the candidate’s application. Therefore, candidates are strongly recommended to state in their application only the longest possible duration for their intended stay at NII. 3. References [1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta, GA, USA, 2008. [2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.