Download Classification Under the Relevant Set Correlation Model

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

NII International Internship Project:
Classification Under the Relevant Set Correlation Model
Supervisor: Michael HOULE, Visiting Professor
One of the most well-known classification methods in machine learning is that of knearest-neighbor (k-NN) classification, a voting strategy in which each object is
assigned to the class most common among its k closest neighbors within a training set
of examples. Despite its simplicity and effectiveness, the k-NN method has had the
reputation of being impractically slow for large data sets, due to the perceived cost of
generating neighbor sets. However, in recent years very efficient data structures have
been proposed for approximate similarity search [3], allowing useful approximations
of the k-NN lists to be generated even for large, high-dimensional training sets.
Despite its successes, k-NN classification suffers from a number of drawbacks. The
choice of parameter k greatly affects the performance of k-NN. Larger values of k
tend to reduce the influence of noise, at the risk of degrading the boundaries of the
classes produced. Fixing the value of k will favor the prediction of classes having
many instances among the training examples over classes with fewer instances,
particularly when the number of instances is smaller than k.
The goal of this project is to devise new “adaptive” variants of k-NN in which the
number of neighbors k contributing to the class prediction depends on the distribution
of the data in the vicinity of the object to be classified. In particular, we will consider
the relevant-set correlation (RSC) clustering model [1] as a means of determining the
most appropriate set of voting neighbors. Developed at NII, RSC is a generic model
for clustering that requires no direct knowledge of the nature or representation of the
data, but instead relies solely on the ranking of the data (the “relevant sets”) induced
by the neighborhood relationship. The quality of cluster candidates, the degree of
association between pairs of cluster candidates, and the degree of association between
clusters and data items are all assessed according to the statistical significance of a
form of correlation among pairs of relevant sets and/or candidate cluster sets. Based
on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has
already been developed and demonstrated for very large, high-dimensional datasets,
using a fast approximate similarity search structure (the SASH [3]) as the oracle [1,2].
The specific goals of this project are:
 To adapt the RSC cluster candidate quality measures to decide, for any object, the
size k of the most consistent neighborhood from within the training set. Several
variants should be proposed, some possibly making use of distance information
as well as neighbor rankings.
 To implement, evaluate and benchmark the new classification methods against
other prominent techniques, such as basic k-NN and SVM classification. The
experimentation will target (but not necessarily be limited to) large document and
image data sets.
The ideal duration of this project is 6 months, although visits of as short as 4 months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application the longest possible duration for their intended stay at NII.
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta,
GA, USA, 2008.
[2] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp.
547-552, Washington DC, USA, 2003.
[3] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely
high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.