* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download NII International Internship Project
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
Business intelligence wikipedia , lookup
Data vault modeling wikipedia , lookup
NII International Internship Project: Visualizing Relevant-Set Correlation Clustering Results (VisRSC) Supervisor: Michael HOULE, Visiting Professor 1. Background Clustering is a powerful technique often applied in the analysis of large highdimensional data sets. For such data types as text documents, protein sequences, and images, an individual data item can often contribute in a natural way to the formation of several well-associated groups. Despite the popularity of partition-based or hierarchical (agglomerative) clustering methods, such data types are often better analyzed under a model that permits cluster overlap. Traditional “hard” clustering models account only for the relationships between groups and member items, and (in the case of hierarchical clustering) inclusion relationships between groups themselves. Under “soft” clustering models that recognize cluster overlap, the more general relationships among groups may provide useful insights into the underlying structure of the data. The aim of this project is to develop a means for visualizing the results of a “soft” clustering of large, high-dimensional datasets, as appropriate for display on conventional two-dimensional screens. The visualization must account the contents of individual clusters in the form of attribute labelling, as well as the interrelationships among clusters in the form of membership overlap. 2. Goals and Methods The purpose of the project is to develop a visualization tool for the relevant-set correlation (RSC) clustering model [1] and its greedy relevant-set correlation (GreedyRSC) heuristic [2]. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-byexample that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [4]) as the oracle [2,3]. The features of GreedyRSC include: The ability to scale to large data sets, both in terms of the number of items and the size of the attribute sets. Genericity, in its ability to deal with different types of attributes (categorical, ordinal, spatial). Automatic determination of an appropriate number of clusters, with the user specifying as input parameters only the minimum desired cluster size and the maximum allowable correlation (proportion of overlap) between pairs of clusters. Robustness with respect to noisy data. The ability to identify clusters of any size (as small as three items). The ability to produce a graph of the interrelationships among clusters with significant degrees of correlation (overlap). The clustering results will then be processed using the Geodesic Self-Organizing Map (GeoSOM) technique to generate both global and local views of the high-dimensional data sets. GeoSOM is a spherical Self-Organizing Map (SOM) developed at ViSLab – The University of Sydney to visualize high dimensional data. Data points that have similar attributes are placed in close proximity in the visualization space. For dataset that have additional relationship information between the data points, GeoSOM can position the data points by considering both the underlying graph structure and attribute similarity information [5, 6]. We plan to extend the GeoSOM technique to: Determine the positions of clusters within the map, based on relational correlation (overlap) information. Generate a sliding scale of visualizations for large high dimensional data. At each chosen scale, clusters having size within certain minimum and maximum thresholds relative to the scale are displayed. Cues such as position, area, color, labeling, and edges will be used to convey the sizes, contents, and interrelationships among clusters, as well as the relative significance of association within and between clusters. Develop other interaction techniques such as selection, zooming and panning to support drill-down navigation through different level of abstraction. Allow views of clustered query results on the dataset. . 3. References [1] M. E. Houle, "Clustering Without Data: the Relevant-Set Correlation Model", in Proc. International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp. 54-61, Sapporo, Japan, 2006. [2] M. E. Houle, "Clustering without data: the GreedyRSC heuristic", in Proc. International Workshop on Data-Mining and Statistical Science (DMSS 2006), pp. 62-69, Sapporo, Japan, 2006. [3] M. E. Houle, "Navigating massive data sets via local clustering", in Proc. 9th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 547-552, Washington DC, USA, 2003. [4] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005. [5] Yingxin Wu, Masahiro Takatsuka. "Spherical self-organizing map using efficient indexed geodesic data structure", Journal of Neural Networks, pp. 900-910, Vol. 19, Issue 6-7, July - August 2006. [6] Yingxin Wu, Masahiro Takatsuka. "Visualizing multivariate network on the surface of a sphere", in Proc. Asia-Pacific Symposium on Information Visualization (APVIS2006), Tokyo, Japan, 2006