Download NII International Internship Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Big data wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
NII International Internship Project:
Visualizing Relevant-Set Correlation Clustering Results
(VisRSC)
Supervisor: Michael HOULE, Visiting Professor
1. Background
Clustering is a powerful technique often applied in the analysis of large highdimensional data sets. For such data types as text documents, protein sequences, and
images, an individual data item can often contribute in a natural way to the formation
of several well-associated groups. Despite the popularity of partition-based or
hierarchical (agglomerative) clustering methods, such data types are often better
analyzed under a model that permits cluster overlap. Traditional “hard” clustering
models account only for the relationships between groups and member items, and (in
the case of hierarchical clustering) inclusion relationships between groups themselves.
Under “soft” clustering models that recognize cluster overlap, the more general
relationships among groups may provide useful insights into the underlying structure
of the data.
The aim of this project is to develop a means for visualizing the results of a “soft”
clustering of large, high-dimensional datasets, as appropriate for display on
conventional two-dimensional screens. The visualization must account the contents of
individual clusters in the form of attribute labelling, as well as the interrelationships
among clusters in the form of membership overlap.
2. Goals and Methods
The purpose of the project is to develop a visualization tool for the relevant-set
correlation (RSC) clustering model [1] and its greedy relevant-set correlation
(GreedyRSC) heuristic [2]. Developed at NII, RSC is a generic model for clustering
that requires no direct knowledge of the nature or representation of the data. In lieu of
such knowledge, the model relies solely on the existence of an oracle for queries-byexample that accepts a reference to a data item and returns a ranked set of items
relevant to the query. In principle, the role of the oracle could be played by any
similarity search structure, or even a search engine whose internal ranking function
and relevancy scores are secret. The quality of cluster candidates, the degree of
association between pairs of cluster candidates, and the degree of association between
clusters and data items are all assessed according to the statistical significance of a
form of correlation among pairs of relevant sets and/or candidate cluster sets.
Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC,
has already been developed and demonstrated for very large, high-dimensional
datasets, using a fast approximate similarity search structure (the SASH [4]) as the
oracle [2,3]. The features of GreedyRSC include:
 The ability to scale to large data sets, both in terms of the number of items and
the size of the attribute sets.





Genericity, in its ability to deal with different types of attributes (categorical,
ordinal, spatial).
Automatic determination of an appropriate number of clusters, with the user
specifying as input parameters only the minimum desired cluster size and the
maximum allowable correlation (proportion of overlap) between pairs of clusters.
Robustness with respect to noisy data.
The ability to identify clusters of any size (as small as three items).
The ability to produce a graph of the interrelationships among clusters with
significant degrees of correlation (overlap).
The clustering results will then be processed using the Geodesic Self-Organizing Map
(GeoSOM) technique to generate both global and local views of the high-dimensional
data sets. GeoSOM is a spherical Self-Organizing Map (SOM) developed at ViSLab –
The University of Sydney to visualize high dimensional data. Data points that have
similar attributes are placed in close proximity in the visualization space. For dataset
that have additional relationship information between the data points, GeoSOM can
position the data points by considering both the underlying graph structure and
attribute similarity information [5, 6].
We plan to extend the GeoSOM technique to:
 Determine the positions of clusters within the map, based on relational
correlation (overlap) information.
 Generate a sliding scale of visualizations for large high dimensional data. At
each chosen scale, clusters having size within certain minimum and maximum
thresholds relative to the scale are displayed. Cues such as position, area, color,
labeling, and edges will be used to convey the sizes, contents, and
interrelationships among clusters, as well as the relative significance of
association within and between clusters.
 Develop other interaction techniques such as selection, zooming and panning
to support drill-down navigation through different level of abstraction.
 Allow views of clustered query results on the dataset.
.
3. References
[1] M. E. Houle, "Clustering Without Data: the Relevant-Set Correlation Model", in
Proc. International Workshop on Data-Mining and Statistical Science (DMSS 2006),
pp. 54-61, Sapporo, Japan, 2006.
[2] M. E. Houle, "Clustering without data: the GreedyRSC heuristic", in Proc.
International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp.
62-69, Sapporo, Japan, 2006.
[3] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp.
547-552, Washington DC, USA, 2003.
[4] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely
high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.
[5] Yingxin Wu, Masahiro Takatsuka. "Spherical self-organizing map using efficient
indexed geodesic data structure", Journal of Neural Networks, pp. 900-910, Vol. 19,
Issue 6-7, July - August 2006.
[6] Yingxin Wu, Masahiro Takatsuka. "Visualizing multivariate network on the
surface of a sphere", in Proc. Asia-Pacific Symposium on Information Visualization
(APVIS2006), Tokyo, Japan, 2006