Download 05_iasse_vssd_cust - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Vertical Set Squared Distance based Clustering without prior knowledge of K
Amal Perera Taufik Abidin, , Masum Serazi, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
{amal.perera, taufik.abidin, , md.serazi, william.perrizo}@ndsu.edu
Abstract
Clustering is automated identification of groups of objects based on similarity. In Clustering
two major research issues are scalability and the requirement of domain knowledge to
determine input parameters. Most approaches suggest the use of sampling to address the
issue of scalability. Sampling does not guarantee the best solution. Most approaches also
require the use of domain knowledge, trial and error techniques, or exhaustive searching to
figure out the required input parameters.
In this paper we introduce a new clustering technique based on the set squared distance.
Cluster membership is determined based on the set squared distance to the respective cluster.
As in the case of mean for k-means and median for k-medoids the cluster is represented by
the entire cluster of points for each evaluation of membership. The set squared distance for
all n items can be computed efficiently in O(n) using a vertical data structure and a few precomputed values. Special ordering of the set squared distance is used to break the data into
the “natural” clusters compared to the need of a known k for k means / medoids type of
partition clustering.
Superior results are observed when the new clustering technique is compared with the
classical k means clustering. To prove the cluster quality and the resolution of the unknown
k, data sets with known classes such as the iris data, the uci_kdd network intrusion detection
data, and synthetic data are used. Scalability of the proposed technique is proved using a
large RSI data set.
1.0 Introduction
Clustering is a very important human activity. Built-in trainable clustering models are
continuously trained from early childhood allowing us to separate cats from dogs. Clustering
allows us to distinguish between different objects. Given a set of points in multidimensional
space, the goal of clustering is to compute a partition of these points into sets called clusters,
such that the points in the same cluster are more similar than points across different clusters.
Clustering allows us to identify dense and sparse regions and, therefore, discover overall
distribution of interesting patterns and correlations in the data. Automated clustering is very
valuable in analyzing large data, and thus has found applications in many areas such as data
mining, search engine indexing, pattern recognition, image processing, trend analysis and
many other areas. 00
Large number of clustering algorithms exists. These clustering algorithms are grouped into
four: partitioning methods, hierarchical methods, density-based (connectivity) methods and
grid-based methods in the KDD literature 00. In partitioning methods n objects in the original
data set is broken into k partitions iteratively, to achieve a certain optimal criterion. The most
classical and popular partitioning methods are k-means 0 and k-medoid 0. The k clusters are
represented by the gravity of the cluster in k-means or by a representative of the cluster in kmedoid. Each object in the space is assigned to the closest cluster in each iteration. All the
partitioning methods suffer from the requirement of providing the k (number of partitions)
prior to clustering, only able to identify spherical cluster, and having large genuine clusters
split in order to optimize cluster quality 0.
A hierarchical clustering algorithm produces a representation of the nested grouping
relationship among objects. If the clustering hierarchy is formed from bottom up, at the start
each data object is a cluster by itself, then small clusters are merged into bigger clusters at
each level of the hierarchy based on similarity until at the top of the hierarchy all the data
objects are in one cluster. The major difference between hierarchical algorithms is how to
measure the similarity between each pair of clusters. Hierarchical clustering algorithms
require the setting of a termination condition with some prior domain knowledge and
typically they have high computational complexity0.
Density based clustering methods attempts to separate the dense and sparse regions of objects
in the data space 0. For each point of a cluster the density of data points in the neighborhood
has to exceed some threshold0. Density based clustering techniques allow discovering
arbitrary shaped clusters. But they do suffer from the requirement of setting prior parameters
based on domain knowledge to arrive at the best possible clustering.
A grid-based approach divides the data space into a finite set of multidimensional grid cells
and performs clustering in each cell and then groups those neighboring dense cells into
clusters. Determination of the cell size and other parameters affect the final quality of the
clustering.
Two of the most demanding challenges in clustering are scalability and minimal requirement
of domain knowledge to determine the input parameters0. In this work we describe a new
clustering mechanism that is scalable and operates without the need of an initial parameter
that determines the expected number of clusters in the data set. We describe an efficient
vertical technique to compute the influence based on the set square distance of each data
point with respect to all other data points in the space. Natural partitions in the influence
values are used to initially partition the data set into clusters. Subsequently cluster
membership of each data point is confirmed or reassigned with the use of efficiently
recalculating the set square distance with respect to each cluster.
2.0 Related Work
Many clustering algorithms work well on small datasets containing fewer than 200 data
objects 0. For example, the NASA Earth Observing System will deliver close to a terabyte of
remote sensing data per day and it is estimated that this coordinated series of satellites will
generate peta-bytes of archived data in the next few years 000. For real world applications,
the requirement is to cluster millions of records using scalable techniques 0.
A general strategy to scale-up clustering algorithms is to draw a sample or to apply a kind of
data compression before applying the clustering algorithm to the resulting representative
objects. This may lead to biased results 00. CLARA 0 addresses the scalability issue by
choosing a representative sample of the data set and then continuing with the classical kmediod method. The effectiveness depends on the size of the sample. CLARANS 0 is an
example for a partition based clustering technique which uses a randomized and bounded
search strategy to achieve the optimal criterion. This is achieved by not fixing the sample to a
specific set from the data set for the entire clustering process. An exhaustive traversal of the
search space is not achieved in the final clustering. BIRCH 0 uses a tree structure that records
the sufficient statistics (summary) for subsets of data that can be compressed and represented
by the summery. Initial threshold parameters a required to obtain the best clustering and
computational optimality in BIRCH. Most of the clustering algorithms require the users to
input certain parameters 0. Clustering results are sensitive to the input parameters. For
example DENCLUE 0 requires the user to input the cell size to compute the influence
function. DBSCAN 0 needs the neighborhood radius and minimum number of points that are
required to mark a neighborhood as a core object with respect to density. To address the
problem of the requirement for parameters OPTICS 0 computes an augmented cluster
ordering for automatic and interactive cluster analysis. OPTICS stores sufficient additional
information enabling the user to extract any density based clustering without having to rescan the data set. Parameter-less ness comes at a cost. OPTICS has a time complexity of O (n
log n) when used with a spatial index that allows it to easily walk through the search space.
Less expensive partition based techniques suffer the requirement of specifying the number of
expected partitions (k) prior to clustering 000. X-means 0 attempts to find k by repeatedly
searching through a different k’s and testing it against a model based on Bayesian
Information Criterion (BIC). G-means 0 is another attempt to learn k using a repeated
division of the data set until each individual cluster demonstrates a Gaussian data distribution
within a user specified significance level. ACE 0 maps the search space to a grid using a
suitable weighting function similar to the particle-mesh method used in Physics and then uses
a few agents to heuristically search through the mesh to identify the natural clusters in the
data. Initial weighting costs only O(n), but the success of the techniques depends on the agent
based heuristic search and the size of the gird cell. The authors suggest a linear weighting
scheme based on neighborhood grid cells and a variable grid cell size to avoid the over
dependence on cell size for quality results. The linear weighting scheme adds more compute
time to the process.
3.0 Our Approach
Our approach attempts to address the problem of scalability in clustering using a partition
based algorithm and the use of a vertical data structure that aids fast computation of counts.
Three major inherent issues with partition based algorithms are the need to input K, need to
initialize the clusters that would lead to a optimal solution and, representation (prototype) and
computation of membership for each cluster. We solve the first two problems based on the
concept of being able to formally model the influence of each data point using a function first
proposed for DENCLUE 0 and the use of an efficient technique to compute the total
influence rapidly over the entire search space. Significantly large differences in the total
influence is used to identify the natural clusters in the data set. Data points with similar total
influence are initially put together as initial clusters to get a better initialization in search of a
optimal clustering in the subsequent iterative process. Each cluster is represented by the
entire cluster. With the use of a vertical data structure we show an efficient technique that
can compute the membership for each data item by comparing the total influence of each
item against each cluster.
The influence function can be interpreted as a function which describes the impact of a data
point within its neighborhood 0. Examples for influence functions are parabolic functions,
square wave function, or the Gaussian function. The influence function can be applied to
each data point. Indication of the overall density of the data space can be calculated as the
sum of the influence function of all data points 0. The density function which results from a
Gaussian function for a point ‘a’ in a neighborhood ‘xi‘ is
n
D
x, a    e
f Gausian

d ( a , xi ) 2
2 2
i 1
The Gaussian influence function is used in DENCLUE 0 and since it is O(n 2), for all n data
points they use a grid to locally compute the density. The influence function should be
radially symmetric about any point (either variable), continuous and differentiable. Some
other influence functions are:
n
m
i 1
j 1
D
j
2j
f Power
2 m  x, a    (1) W j d ( xi , a)
D
 x, a  
f Parabolic
n

i 1
2
 d ( xi , a ) 2
We note that the power 2m function is a truncation of the Gaussian Maclaurin series. The
following diagram shows the distribution of the density for the Gaussian and the Parabolic
influence function.
50
45
40
35
30
25
20
15
5
10
15
20
25
30
35
40
(a) Data Set
45
(b) Gaussian
(c) Parabolic
Next we show how the density based on the Parabolic influence function denoted Set Square
Distance could be efficiently computed using a vertical data structure.
Vertical data representation consists of set structures representing the data column-bycolumn rather than row-by-row (relational data). P-trees are one choice of vertical data
representation, which can be used for data mining instead of the more common sets of
relational records. P-trees 0 are a lossless, compressed, and data-mining-ready data structure.
This data structure has been successfully applied in data mining applications ranging from
Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree
Induction, to Association Rule Mining 000. A basic P-tree represents one attribute bit that is
reorganized into a tree structure by recursively sub-dividing, while recording the predicate
truth value regarding purity for each division. Each level of the tree contains truth-bits that
represent pure sub-trees and can then be used for fast computation of counts. This
construction is continued recursively down each tree path until a pure sub-division is reached
that is entirely pure (which may or may not be at the leaf level). The basic and complement
P-trees are combined using boolean algebra operations to produce P-trees for values, entire
tuples, value intervals, or any other attribute pattern. The root count of any pattern tree will
indicate the occurrence count of that pattern. The P-tree data structure provides the a
structure for counting patterns in an efficient manner.
Binary representation is intrinsically a fundamental concept in vertical data structures.
Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as:
0
2
x1b1  x10 
j
 x1 j
j b 1
xb1 and x0 are the highest and lowest order bits respectively.
Vertical Set Squared Distance (VSSD) is defined as follows: [need
to clean up the formula]
fVertSetSqrDist (a, X )
  x  a x  a 
xX

n
 xi  ai 
2
xX i 1


xX

n
n

 2  xi ai   ai2 
i 1
i 1

n
   x
i 1
2
i
n
 x
x X i 1
n
2
i
n
 2   xi ai   ai2
xX i 1
x X i 1
 T1  T2  T3
where
n
T1   xi2
xX i 1
n
=
0
 2
2j
 rc ( PX  Pij ) 
i 1 j b 1
n
T2  2   xi ai
x X i 1
2
k
 rc( PX  Pij  Pil )
k ( j*2)( j 1)&& j  0
l ( j 1)0&& j  0
n
 2 

0
   2
xX i 1
n
 2  
 j b 1
0
 2
j
i 1 j b 1 xX
n
 2  
0
2
i 1 j b 1
n
T3 

xX i i
ai 2 
j
0
j
 x ij
2
j
j b 1
 xij 
0
2
j
j b 1
 rc( PX  Pij ) 

 a ij 


 aij
0
2
j b 1
 0 j

  2  aij 



xX i 1  j b 1

n
j
 aij
2
2
n
n  0

 rc( PX )     2 j  aij   rc ( PX )   a i2
i 1
i 1  j b 1

Pi,j indicates the P-tree for the jth bit of ith attribute.
rc(P) denotes the root count of a P-tree. i.e. number of ‘1’ bit values present in the tree.
PX denotes the P-tree (mask) for the set X.
The advantage of VSSD is that root counts can be pre-computed and repeatedly used, as their
operations are obviously independent from a, thus allowing us to pre-compute them in
advance for the computation of VSSD for multiple number data points where the
corresponding data set (Set X) does not change. This observation provides us with a
technique to compute the VSSD influence based density for all data points in O(n). Further
for a given cluster, VSSD influence based density for each data point could be computed
efficiently to determine its’ cluster membership.
Algorithm (VSSDClus)
The algorithm has two phases. In the initial phase the VSSD is computed for the entire data
set. While computing the VSSD, they are placed using an appropriate hash function to aid in
the sorting for the next step. Next the VSSD values are sorted and differences between each
consecutive two items are computed. It is assumed that outlier differences will indicate a
separation between two clusters. Statistical outliers are identified by using the standard mean
+ 3 standard deviation formula on the differences. The data set is partitioned at the outlier
differences. [Sample 2D Graph showing clear speration]
[need to mathematically write algo ? ]
Compute VSSD for all points in data set and place in hash map
Sort VSSD within hash map.
Find the difference between VSSDi and VSSD i+1 (i and i+1 sorted order)
Identify difference > {mean(difference) + 3 x standard Deviation (difference)}
Break into clusters using large differences as partition boundaries.
In phase two of the algorithm each item in the data set is confirmed or re-assigned based on
the VSSD for with respect to each cluster. This is similar to the classical k-means algorithm.
In this case instead of the mean each cluster is represented by all the data points in the
cluster. And instead of the mean square distance to determine the cluster membership set
squared distance is used.
Iterate until max iteration OR no change in cluster sizes OR oscillation
Compute VSSD for all points, against each Cluster (Ci).
Re-assign clusters based on min[VSSD(a,Ci)]
4.0 Experimental Results
To show the practical relevance of the new clustering approach we show comparative
experimental results in this section. This approach is aimed at reducing the need for the
parameter K and the scalability with respect to the cardinality of the data. To show the
successful elimination of the K we use few different synthetic data sets and also actual real
world data sets with known clusters. We also compare the results with a classical K-means
algorithm to show relative difference in speed to obtain an optimal solution. To show the
linear scalability we use a large RSI image data set with our approach and show the actual
required computation time with respect to data size.
We use the following quality measure, extensively used in text mining to compare the quality
of the clustering. Note that this measure could only be used with know clusters.
Let C* = {C1*,…..Ci*……. Cl*} be the original cluster and C = {C1,…..Ci……. Ck} be
some clustering of the data set.
prec (i, j )  C j  Ci* / C j
recall (i, j )  C j  Ci* / Ci*
Fi , j 
2. prec(i, j )  recall (i, j )
prec(i, j )  recall (i, j )
Ci*
k
F 
 max j 1 Fi , j 
i 1 N
l
Note: F = 1 for a perfect clustering. F measure will also indicate if the selection of the
number of clusters is appropriate.
Synthetic data
The following table shows the results for a few synthetically generated cluster data sets. The
motivation is to show the capability of the algorithm to independently find the natural
clusters in the data set. The classical K-means clustering algorithm with given K is used as a
comparison. The number of Database scans required for (i.e iterations) required to achieve an
F measure of 1.0 is shown in the following table.
35
35
30
30
30
25
25
25
20
Data Set
20
20
15
10
5
15
15
10
10
5
5
0
0
0
0
10
20
30
0
40
5
10
15
20
25
0
30
VSSDClus Iterations
2
2
6
K-Means Iterations
8
8
14
5
10
15
20
25
30
35
Iris Data
[data desc. Req.]
Iterations
F-measure
VSSDClus
K-Means
-
K =3
K=4
K=5
5
0.84
16
0.80
38
0.74
24
0.69
KDD network Intrusion Data
[data desc. Req.]
With 6 Clusters
VSSD K- means
Clus
Input K
K=5 K=6
Iterations 7
10
12
F0.81
0.81 0.81
Measure
K=7
12
0.81
4 Clusters
VSSD K - means
Clus
K=3 K=4
9
16
12
0.80
0.80 0.80
K=5
16
0.79
2 Clusters
VSSD K Clus
means
K=2
3
6
0.90
0.90
RSI data
[OAKS data desc./Graph Req.]
5.0 Conclusion
Two major problems are scalability and the requirement of domain knowledge to determine
input parameters in unsupervised learning to identify groups of objects based on similarity in
large datasets. Most of the existing approaches suggest the use of sampling to address the
issue of scalability and require the use of domain knowledge, trial and error techniques, or
exhaustive searching to figure out the required input parameters.
In this paper we introduce a new clustering technique based on the set squared distance. This
technique is scalable and does not need prior knowledge of the existing (expected) number of
partitions. Efficient computation of the set squared distance using a vertical data structure
enables the above breakthrough. We show how a special ordering of the set squared distance
which is an indication of the density at each data point, can be used to break the data into the
“natural” clusters. We also show the effectiveness of determining cluster membership based
on the set squared distance to the respective cluster.
We prove the cluster quality and the resolution of the unknown k, of our new technique using
data sets with known classes. We show the scalability of the proposed technique with respect
to data set size by using a large RSI data set.
References
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.
K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs,
NJ, 1988.
Osmar R. Zaïane, Andrew Foss, Chi-Hoon Lee, Weinan Wang, On Data Clustering Analysis:
Scalability, Constraints and Validation, in Proc. of the Sixth Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD'02), pp 28-39, Taipei, Taiwan, May, 2002
MacQueen J. (1967) Some methods for classi_cation and analysis of multivariate
observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob..
Kaufman L. and Rousseeuw P. J. (1990) Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons.
Ng R. and Han J. (1994) Efficient and effective clustering method for spatial data mining. In
Proc. Conf. on Very Large Data Bases , pp 144{155.
T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient data clustering method for
very large databases, Proc. ACM-SIGMOD Intl. Conf. Management of Data, 1996, pp. 103–
114.
S. Guha, R. Rastogi, and K. Shim, CURE: an efficient clustering algorithm for large
databases, Proc. ACM-SIGMOD Intl. Conf. Management of Data, 1998, pp. 73–84.
A. Hinneburg and D. A. Keim, Optimal gridclustering: towards breaking the curse of
dimensionality in high-dimensional clustering, Proc. of 25th Intl. Conf. Very Large
DataBases, 1999, pp. 506–517.
A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases
with Noise. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press,
1998.
Markus M. Breunig, Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Data Bubbles: Quality
Preserving Performance Boosting for Hierarchical Clustering ACM SIGMOD 2001 May 2124, Santa Barbara, California USA.
Goddard Space Flight Center. NASA’s Earth Observing System. http://eospso.gsfc.nasa.gov.
A. Zomaya, T. El-Ghazawi, and O. Frieder. Parallel and distributed computing for data
mining. IEEE Concurrency, 7(4), 1999.
William Peter, John Chiochetti Clare Giardina, New Unsupervised Clustering Algorithm for
Large Datasets SIGKDD’03 August 24-27, 2003, Washington DC USA.
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
Ester M., Kriegel H.-P., Sander J. and Xu X. (1996) A density-based lgorithm for
discovering clusters in large spatial databases with noise. In Proc. ACM-SIGKDD Int. Conf.
Knowledge Discovery and Data Mining, pp 226-231.
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander OPTICS: Ordering
Points To Identify the Clustering Structure Proc. ACM SIGMOD’99 Int. Conf. on
Management of Data, Philadelphia PA, 1999.
G. Hamerly, C. Elkan, Learning the k in k-means. Seventeenth Annaul Conference on Neural
Information Processing Systems (NIPS) 2003, British Columbia, Canada.
Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra, Proceedings of the ACM
Symposium on Applied Computing, pp. 426-431, 2002.
J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San
Francisco, CA, 2001.
J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, NY, 1975.
M. Khan, Q. Ding, and W. Perrizo, K-Nearest Neighbor Classification of Spatial Data
Streams using P-trees, Proceedings of the PAKDD, pp. 517-528, 2002.
E.M. Knorr and R. T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets.
Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pp. 392403, 1998.
A. Perera, A. Denton, P. Kotala, W. Jockhec, W.V. Granda, and W. Perrizo, P-tree
Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2), pp.108-109, 2002.
W. Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001.
I. Rahal and W. Perrizo, An Optimized Approach for KNN Text Categorization using PTrees. Proceedings of ACM Symposium on Applied Computing, pp. 613-617, 2004.
Dan Pelleg , Andrew W. MooreX-means: Extending K-means with Efficient Estimation of
the Number of ClustersProceedings of the Seventeenth International Conference on Machine
Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2000, pp 727 – 734.
[1]
[2]
[3]
[4]