Download Finding and Visualizing Subspace Clusters of High Dimensional

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai
Finding and Visualizing Subspace Clusters of
High Dimensional Dataset
Using Advanced Star Coordinates
Rajashri Kulkarni, A.J. Patankar, Sunita Jahirabadkar
what further experiments to conduct [2]. Visualization of such
subspace clusters in the environment, where large data is
available for analysis, can help in various application areas
such as Web Text mining, DNA analysis, and financial
analysis etc. For example, DNA microarray is the new field in
Biomedical. Microarrays consist of thousand of genes under
different conditions. If there are 50 cancer profiles with 1000
features, user can not analyze all the cancer subtypes on a
genetic level. Also a particular cancer is divided into more
than one set of characteristics. Identifying the specimens, one
set of genes is required while subtype based on cell division
would require different set of genes. In order to analyze such
complex cellular mechanism subspace clustering visualization
is very helpful. It extends the power of traditional clustering to
understand the meaningful subspaces and subspace clusters
[3].
In this paper, we propose the new ISC-ASC approach in
which the visualization of ISC (Intelligent subspace
clustering) [8] is done through ASC (Advanced star
coordinates)[10] which help users to detect and analyze the
clusters at different dimensionality level. In ISC, the algorithm
Rank gives the list of dimensions with descending order of
interestingness. Starting with two dimensions, ISC detects the
clusters in these dimensions. For this DBSCAN [4], robust
density based clustering algorithm is used. These clusters are
visualized using Advanced Star Coordinates. ASC is based on
the star coordinates [1] which is traditional data visualization
technique. In star coordinates approach, radius is used to
represent dimension axis, whereas, ASC uses diameter to
represent the dimension axis. It projects the high dimensional
data items on the dimension axis. The projection point is the
advanced star coordinates for the data point, setup using the
Cartesian coordinates. To improve the efficiency of the
algorithm, the projection point can be represented with polar
coordinates. Every high dimensional data object found in the
subspace clusters using ISC is displayed using ASC on the
screen. ISC-ASC approach helps to identify clusters of
different size, shape and density and visualizing it on 2dimension helps the user in in-depth analysis. It is beneficial
to take the decision about what further experiments to be
carried out to improve the quality of clusters.
Abstract— Analysis of high dimensional data is a research area
since many years. Analysts can detect similarity of data points within
a cluster. Subspace clustering detects useful dimensions in clustering
high dimensional dataset. Visualization allows a better insight of
subspace clusters. However, displaying such high dimensional
database clusters on the 2-dimensional display is a challenging task.
We proposed an ISC-ASC approach which first identifies subspace
clusters in a high dimensional dataset and then display these clusters
on a 2-dimensional display device. Algorithm ISC detects the
subspace clusters using a density notion of clustering. Algorithm
ASC visualizes these subspace clusters. In ASC instead of
considering all the dimensions, the dimensions which are taking part
in subspace clustering are considered to find the projection points.
ISC-ASC is beneficial for users to identify subspace clusters.
Visualizing these subspace clusters using ASC have efficient
knowledge discovery which helps to take decision about the quality
of subspace clusters.
Keywords—Subspace clustering, high dimensional data
subspace clustering, visualization
I. INTRODUCTION
A
S a data mining function, clustering is the process of
grouping the physical or abstract objects into the classes
of abstract objects. Analyzing these clusters helps in
understanding the distribution of data, identify the
characteristics of the clusters and focus on a particular set of
clusters for further analysis. Visualization helps the user by
representing information visually. Subspace clustering
identifies the subsets of attributes relevant for clustering.
Visualization of such high dimensional clusters is an
important subfield of scientific visualization. It allows the user
to explore data in different ways at different levels of
abstraction to find the right levels of details [5].
By visualizing these subspace clusters, user can find how
well defined the clusters are, which dimension is relevant and
Rajashri Kulkarni ,M.E.Student,Computer Engineering Department,
D.Y.Patil College Of Engineering, Akurdi, Pune University, Pune(India)
(email:[email protected]).
A.J. Patankar is working in Computer Engineering Department, D.Y.Patil
college of Engineering, Akurdi,Pune University,Pune(India) as Asst. Professor
(e-mail:[email protected])
Sunita Jahirabadkar is working in Computer Engineering Department of
Cummins College of Engineering, Pune University,Pune (India) as Asst.
Professor (e-mail: [email protected]).
81
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai
II. RELATED WORK
perimeter. Each dimension axis is associated with an attraction
factor as in an imaginary spring system, where there are
special end points. The equally spaced points are called
dimensional anchors. One end of spring is attached to the
dimension anchors. Other end of a spring attached to the data
point. Each line is associated with one attribute value. RadViz
give very good time complexity. However, in RadViz, similar
records in the n-dimensional space are projected close
together on the 2D space, favoring identification of clusters.
Also fact that very different records may be projected closed
together. Another popular visualization approach is Star
Coordinates [1]. It projects the high dimensional information
object in two dimensions on the screen. In this approach, each
dimension is represented as a vector radiating from the centre
of a unit circle in a two-dimensional plane. However, in this
approach, the angle between axes is equal and all axes have
the same length. Data points are scaled to the records
projected close together and dimension arrangement is more
complicated.
Advance Start Coordinate is the extension of Star
Coordinates [10].ASC uses diameter to represent the
dimension axis. Also dimension configuration strategy of
ASC helps to arrange the dimensions to avoid manual
arrangement of dimensions on the 2-dimensional display
device.ASC designs the dimension configuration strategy to
optimize order and angle of dimensions axes. The
visualization results of ASC are easily understandable and
express dimension distribution information effectively, which
is helpful for user to view high dimensional data and to
discover implicit information in knowledge discovery process.
In this paper, the major challenge described is to combine
the ISC (Intelligent Subspace Clustering) algorithm and ASC
(Advanced Star Coordinates) algorithm for High Dimensional
Database. Hence the work has to be reviewed at two levels,
Subspace Clustering approaches and visualization techniques.
A. Subspace Clustering approaches
Subspace clustering is an extension of traditional clustering
that seeks to find clusters in different subspaces within a high
dimensional database. While doing clustering on high
dimensional dataset, the number of data points becomes
sparser. A cluster is a dense region of data points. We can
recognize the clusters because each cluster has a typical
density of points. There are different subspaces clustering
algorithms. The first well known algorithm is CLIQUE [3]
algorithm combines density and grid based clustering and uses
an APRIORI style search technique to find dense subspaces.
ENCLUS [3] is based on entropy computation of a discrete
random variable. MAFIA [3] uses adaptive, variable-sized
grids in each dimension. All these approaches are grid based
approaches which are based on the positioning of the grids.
There are different Density based approaches to find the
clusters which give better results. The first algorithm DOC [3]
is a hybrid method which is a combination of grid based and
iterative improvement method from top down approaches. It is
based on the mathematical notion of an ‘Optimal projective
cluster’. SUBCLU [3] can detect arbitrarily shaped and
positioned clusters in subspaces.
Another efficient algorithm to apply subspace clustering on
high dimensional dataset is ISC (Intelligent subspace
clustering) [8]. ISC implements dynamic and adaptive
determination of meaningful clustering parameters using
hierarchical filtering approach. ISC detects the subspace
clusters at intermediate levels, by allowing modifying
parameters adaptively. It is based on the density notion of
clustering which helps to identify the clusters of different
shapes and sizes.
III. ISC-ASC APPROACH ALGORITHM
The ISC-ASC algorithm consists of the following two
major algorithms.
A. ISC (Intelligent subspace clustering)
The algorithm ISC (Intelligent Subspace Clustering)[8],
which is based on the density notion of Hierarchical Subspace
Clustering. The concept of hierarchy will be used in ISC at
dimension level.ISC finds low dimensional subspace clusters
and then try to extend these low dimensional subspace clusters
to form higher dimensional meaningful clusters. Objects will
be assigned to subspace clusters using the density notion of
Clustering. As the number of possible subspaces is
exponential in the number of dimensions, this is a challenging
task both with respect to efficient runtime of the algorithm as
well as to the typically enormous number of output clusters.
To cope with this, ISC computes the relevance of dimensions
and provide rankings to the dimensions according to the
interestingness. ISC considers highly ranked dimension, build
1-d clusters and then continue in the descending order of
ranking.
As the dimensionality increases, the subspace clusters
become sparser. Use of global density threshold to find a
dense area doesn’t give meaningful results. To eliminate this
problem, cluster quality is checked at each dimension level.
B. Visualization Techniques
For the visual data exploration , a number of visualization
techniques are available. Geometric projection techniques are
used to find informative projections. Different methods fall in
this category helps to find correlations among dimensions,
detects the outliers and work with high dimensional datasets
[9]. The first well-known technique is the Parallel coordinates
[6] where attributes are represented by parallel vertical axes
linearly scaled within their data range. Circular parallel
Coordinates visualization [6] is similar to parallel coordinates,
in which n lines emanate radically from the centre of the circle
and terminate at the perimeter. However visualizing the
dataset with large data items using Circular parallel
coordinates, polygon lines increase and it’s having serious
impact on the arrangement of dimensions.
In RadViz [6] which is based on Hooke’s law, where n
lines emanate from the centre of the circle and terminate on its
82
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai
used, where differentials are calculated. So representing the
point in polar coordinates increases the efficiency of the
algorithm.
Density threshold (µ) value will be changed adaptively. The
algorithm is repeated with new values of density threshold at
that dimensionality level, till we find the clusters of required
quality.
It has following important improvements over the state of
the art Subspace Clustering approaches
• Use of density approach for clustering data set, even at
each dimension level, allows building / finding clusters of any
size, shape and density.
• Detects subspace clusters at different dimensionality
levels.
• Build a hierarchy of nested subspace clusters.
• Users can interact for parameter settings at various
dimension level to find meaningful clusters.
To find clusters those are hidden in different subspaces,
parameters like density threshold, has to be set depending
upon number of dimensions considered for Clustering.
Additionally, the intermediate level clustering results are
helpful so that to change parameters for in-depth analysis and
interaction.
Fig.1 Representation of a data point using polar coordinates
ASC arrange the dimensions to avoid manual arrangement
of dimensions. The dimension configuration of ASC is based
on the ideas that find the correlation coefficient among all the
dimensions. Dimensions showing high correlations are
positioned next to each other. The dimensions are represented
as diameter of the circle. The direction of arrow indicates the
positive direction of the dimension. The angle between two
dimensions is measured as the angle between dimension
positive directions. ASC finds the correlation matrix, and
dimensions which are highly correlated, drop them. It finds
the largest value of correlation matrix and does circular
arrangement of dimensions. The angle between neighbor
dimensions is calculated. According to the angle between the
dimensions ASC makes the circular arrangement of
dimensions.
B. ASC(Advanced Star Coordinates)
While displaying a multidimensional object in a 2dimensional display device, it can be simply shown as an
point in multidimensional coordinates as in fig 1.
Let, A(F) is a k-dimensional dataset where
A(F) = {F1,F2....Fm}
m = total number of records in the dataset.
represents
the
multidimensional
information object. ASC represents the multidimensional data
item as a point
in Cartesian coordinates. The
projection lines from this point are perpendicular to every
dimension axis and the coordinates where projection line
intersects the dimensions axes are called as visual coordinates
. In ASC all the dimensions are taken into
Fi (
consideration while mapping the actual values to the visual
coordinates. The projection lines from the visual coordinates
converge to a point which is represented as
which
indicates the high dimensional data item represented as
in two dimensional coordinates on the screen.
Advanced Star Coordinates uses the diameter instead of the
radius to define the dimension axis. For this ASC coordinates
are setup in the Cartesian coordinate system. ASC find the
visual coordinates on every dimension axis. The projection
line from visual coordinates converges to a point
which is represented as advanced start coordinate. The object
function is constructed for each record in the dataset. In this
algorithm, pattern search method is used to solve the object
function. However, when the polar coordinates are setup in
Cartesian coordinates, the high dimensional data point is
as shown in fig.1. To solve the
represented as
object function in polar coordinates Quasi-Newton method is
C. ISC-ASC algorithm
The algorithm starts with two dimensions and iterate till k
dimensions. ISC-ASC starts with rank algorithm which gives
list of dimensions with descending order of interestingness.
The dimension configuration strategy of ASC gives
arrangement of dimensions. Advanced star coordinates are
setup in polar coordinates. In ASC all the dimensions are
taken into consideration, while finding the projection point.
But in ISC-ASC approach we considered only those
dimensions which have been taken part into subspace
clustering. From Rank algorithm first two dimensions which
are having higher interestingness are taken into consideration.
Starting with two dimensions the projection point is calculated
and go on calculating till k-dimensions. Also different
combinations of dimension will be done. User can take the
decision which dimensions give meaningful clusters. Starting
with two dimensions, the two dimension axes are drawn. At
each dimensionality level ISC selects the density threshold, εdistance is calculated, and DBSCAN will be applied
considering these parameters. The clusters found using ISC
are visualized on the screen using ASC coordinates. User can
change the density threshold; this will be helpful for in-depth
analysis.
83
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai
D.Algorithm ISC-ASC
1. Apply rank algorithm to select most interesting
dimensions
2. Apply DBSCAN, with input parameter density
threshold, to two dimensional dataset.
3. Set up advanced star coordinates in polar coordinates.
Starting with two dimensions, draw the two
dimension axis.
4. Find the unit vector of every dimension axis as –
clustering results. We started with two dimensions. Clusters
with two dimensions are displayed on the screen as shown in
figure 3. The density threshold considered is three and clusters
of blue, green and violet color is displayed on the screen.
Outliers are shown in red color. At first only two dimension
axis are drawn. There are different combinations of the
dimensions. User can adaptively change the density threshold
so that quality of the clusters can be observed. User can decide
as the most useful dimension in clustering. We continued the
algorithm further with three dimensions and later with four
dimensions. Clustering results with same density threshold as
three is shown in fig. 4.
k=total number of dimensions.
5. Compute the equation of line which is passing through
dimension starting point and vertical to the
dimension axis.
6. Construct the object function as –
min f(x ,y) =
(1)
=
r=
and
y=
7. Find the point
Fig.3 Visualization of clusters with two dimensions of Iris
dataset
( ,
) on every dimension axis
according to
And solve the object function (1) using Quasi Newton
method
8. User can change the density threshold and see the
quality of the clusters.
9. Repeat steps 2 to 6 so that it will iterate till k
dimensions.
The same algorithm can be repeated for various density
thresholds to get the most clear and visible clusters on the
screen.
Fig.4 Visualization of clusters with three dimensions of Iris
dataset
IV. EXPERIMENTAL EVALUATION
V. CONCLUSION
We implemented ISC-ASC algorithm in Matlab. All the
experiments were run on Microsoft Windows XP platform
with 2.0.GHz CPU and min 2.0 GB RAM. We evaluated this
algorithm using several synthetic datasets. The well known
Iris flower dataset with four dimensions was used to test the
Subspace clustering visualization has tremendous
applications in science, engineering and business decision
making. In this paper, we proposed need for the visualization
of the subspace clusters. Later we propose ISC-ASC approach
84
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai
which visualizes the subspace clusters formed using ISC by
ASC coordinates. Visualizing the clusters on the screen using
ISC-ASC algorithm, leads to better cluster formation using
high dimensional data. The experimental evaluation showed
that ISC-ASC approach helps to identify clusters with
different threshold. It will benefit large application domains
such as web information system in which huge amount of data
is available. It can be used in DNA microanalysis where
analyst has to deal with the huge amount of genes. ISC-ASC
algorithms help to analyze and understand complex cellular
mechanisms in DNA microarrays. Currently it is experimented
with four dimensions, in future work we will try to extend up
to 10 dimensions.
REFERENCES
[1]
E. Kandogan, “Star Coordinates: A high-dimensional visualization
technique with uniform treatment of dimensions,” In Proc. of the IEEE
Information Visualization Symposium, 2000, pp. 4-8.
[2] Ian Davidson†, “Visualizing clustering results’’, In Proceedings of the
Second SIAM International Conference on Data Mining, Arlington, VA,
USA, April 11-13, 2002. SIAM 2002, ISBN 0-89871-517-2.
[3] Lance Parsons, Ehtesham Haque, Huan Liu, “ Subspace clustering for
high dimensional data: A review ’’ Department of Computer Science
Engineering ,Arizona State University, Tempe, SIGKDD Explorations
2004,Volume 6,issue 1.
[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based
algorithm for discovering clusters in large spatial databases with Noise”,
In Proceedings of the 2nd ACM International Conference on Knowledge
Discovery and Data Mining (KDD), Portland, OR, 1996.
[5] P.C. Wong, and R.D. Bergeron, ‘30 Years of multidimensional
multivariate Visualization’’, In Scientific Visualization--Overviews,
Methodologies, and Techniques, Washington, IEEE Computer Society,
1997.
[6] P.E. Hoffman, “Table Visualizations: A Formal Model and its
Applications,” Doctoral Diss., Computer Science Department,
University of Massachusetts Lowell, 1999.
[7] Steinbach, M., Ertoz, L., & Kumar, V., “ Challenges of clustering high
dimensional data’’, In Wille, L. T. (Ed.), New Vistas in Statistical
Physics – Applications in Econophysics, Bioinformatics, and Pattern
Recognition. Springer-Verlag.
[8] Sunita Jahirabadkar, Parag Kulkarni; “ISC – Intelligent Subspace
Clustering, A Density based Clustering approach for High Dimensional
Dataset”, In World Congress on Science, Engineering & Technology
(WCSET – 09); July 29 -31, 2009; Oslo, Norway; Pg No. 69-73.
[9] Winnie Wing-Yi Chan, A survey on Multivariate Data visualization,
June 2006.
[10] Yang Sun, Jiuyang Tang, Daquan Tang, Weidong Xiao, “Advanced star
coordinate” , In Proceedings of WAIM '08 Proceedings of the 2008 The
Ninth International Conference on Web-Age Information Management,
IEEE Computer Society Washington, DC, USA ©2008[26] .
85