Download NII International Internship Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
NII International Internship Project:
Multimodal Data Clustering
Supervisor: Michael HOULE, Visiting Professor
With multimedia data such as image and video data, individual objects can be
associated with several modes of information, each represented by collections of
features appropriated to that mode. For example, an image could be represented by
simultaneously by a collection of visual features that describe global color and texture
characteristics, another collection of features restricted to points of interest, and a
third collection pertaining to textual annotations of the image. With more than one
mode of information available, it can be difficult to determine the appropriate
weighting to give to each mode when using it for retrieval, data mining, or other
analysis.
The project will investigate the application of the relevant-set correlation (RSC)
clustering model [1] to the clustering of data with multiple feature representations.
Developed at NII, RSC is a generic model for clustering that requires no direct
knowledge of the nature or representation of the data. In lieu of such knowledge, the
model relies solely on the existence of an oracle for queries-by-example, that accepts
a reference to a data item and returns a ranked set of items relevant to the query. In
principle, the role of the oracle could be played by any similarity search structure, or
even a search engine whose internal ranking function and relevancy scores are secret.
The quality of cluster candidates, the degree of association between pairs of cluster
candidates, and the degree of association between clusters and data items are all
assessed according to the statistical significance of a form of correlation among pairs
of relevant sets and/or candidate cluster sets. For this project, we will assume that
several relevant-set oracles are available for the data set, each oracle serving one of
the associated modes of information.
Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC,
has already been developed and demonstrated for very large, high-dimensional
datasets, using a fast approximate similarity search structure (the SASH [2]) as the
oracle [1]. The features of GreedyRSC include:
 The ability to scale to large data sets, both in terms of the number of items and
the size of the attribute sets.
 Genericity, in its ability to deal with different types of attributes (categorical,
ordinal, spatial).
 Automatic determination of an appropriate number of clusters, with the user
specifying as input parameters only the minimum desired cluster size and the
maximum allowable correlation (proportion of overlap) between pairs of clusters.
 Robustness with respect to noisy data.
 The ability to identify clusters of any size (as small as three items).
The specific goals of this project are:
 To extend the RSC clustering model to account for multiple relevant-set oracles.


To adapt the GreedyRSC heuristic for the efficient clustering of data under the
multi-oracular extension of RSC.
To make the clustering tool freely available under the GNU public licence.
The ideal duration of this project is 6 months, although visits of as short as 5 months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application only the longest possible duration for their intended stay at
NII.
3. References
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta,
GA, USA, 2008.
[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely
high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.