Download Searching for Centers: An Efficient Approach to the Clustering of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Searching for Centers: An Efficient Approach to the
Clustering of Large Data Sets Using P-trees
Abstract
With the ever-increasing data-set sizes in most data
mining applications, speed remains a central goal in
clustering. We present an approach that avoids both
the time complexity of partition-based algorithms and
the storage requirements of density-based ones, while
being based on the same fundamental premises as
standard partition- and density-based algorithms. Our
idea is motivated by taking an unconventional
perspective that puts three of the most popular
clustering algorithms, k-medoids, k-means, and
center-defined DENCLUE into the same context. We
suggest an implementation of our idea that uses Ptrees1 for efficient value-based data access.
1. Introduction
Many things change in data mining applications but
one fact is reliably staying the same, namely that next
year's problems will involve larger data sets and
tougher performance requirements than this year's.
The datasets that are available to biological
applications keep growing continuously, since there is
commonly no reason to remove old data and
experiments worldwide contribute new data [1].
Network applications are operating on massive
amounts of data, and there is no limit in sight to the
increase in network traffic, the increase in detail of
information that should be kept and evaluated, and
increasing demands on the speed with which data
should be analyzed. The World-Wide Web constitutes
another data mining area with continuously growing
"data set" sizes. The list could be continued almost
indefinitely.
It is therefore of ultimate importance to see where the
scaling behavior of standard algorithms may be
improved without losing their benefits, so as not to
make them obsolete over time. A clustering technique
that has caused much research in this direction is the
k-medoids [2] algorithm. Although it has a simple
1
Ptree technology is patented to North Dakota State
University
justification and useful clustering properties the
default scaling behavior for its time complexity as
being proportional to the square of the number of data
items makes it unsuited to large data sets. Many
improvements have been implemented [3,4,5], such as
CLARA [2] and CLARANS [3] but they don't address
the fundamental issue, namely that the algorithm
inherently depends on the combined choice of cluster
centers, and its complexity thereby must scale
essentially as the square of the number of investigated
sites. In this paper we analyze the origins of this
unfavorable scaling and see how it can be eliminate it
at a fundamental level. Our idea is to make the
criterion for a "good" cluster center independent of the
locations of all other cluster centers. At a fundamental
level this replaces the quadratic dependency on the
number of investigated sites by a linear dependency.
We note that our proposed solution is not entirely new
but can be seen as implemented in the density-based
clustering algorithm DENCLUE [6] albeit with a
different justification. This allows us to separate
representation issues from a more fundamental
complexity question when discussing the concept of
an influence function as introduced in DENCLUE.
2. Taking a Fresh Look at
Established Algorithms
Partition-based and density-based algorithms are
commonly seen as fundamentally and technically
distinct, and proposed combinations work on an
applied rather than a fundamental level [7]. We will
present three of the most popular techniques from both
categories in a context that allows us to see their
common idea independently of their implementation.
This will allow us to combine elements from each of
them and design an algorithm that is fast without
requiring any clustering-specific data structures.
The existing algorithms we consider in detail are the
k-medoids [2] and k-means [8] partitioning techniques
and the center-defined version of DENCLUE [6]. The
goal of these algorithms is to group a data item with a
cluster center that represents its properties well. The
clustering process has two parts that are strongly
related for the algorithms we review, but will be
separated for our clustering algorithm. The first part
consists in finding cluster centers while the second
specifies boundaries of the clusters. We first look at
strategies that are used to determine cluster centers.
Since the k-medoids algorithm is commonly seen as
producing a useful clustering, we start by reviewing its
definition.
for a minimum of the total energy of all cluster
centers. To understand this we now have to look at
how cluster boundaries would be modeled in the
analogous physical system. Since data items don't
attract cluster centers that are located outside of their
cluster, we model their potential as being quadratic
within a cluster and continuing as a constant outside a
cluster. Constant potentials are irrelevant for the
calculation of forces and can be ignored.
2.1. K-Medoids Clustering as a Search
for Equilibrium
A good clustering in k-medoids is defined through the
minimum of a cost function. The most common
choice of cost function is the sum of squared
Euclidean distances between each data item and its
closest cluster center. An alternative way of looking
at this definition borrows ideas from physics: We can
look at cluster centers as particles that are attracted to
the data points. The potential that describes the
attraction for each data item is taken to be a quadratic
function in the Euclidean distance as defined in the ddimensional space of all attributes. The energy
landscape surrounding a cluster center with position
X(m) is the sum of the individual potentials of data
items at locations X(i)
N
E( X
( m)
d
)   ( x  x
i 1 j 1
(i )
j
( m) 2
j
)
where N is the number of data items that are assumed
to influence the cluster center.
We defer the
discussion on the influence of cluster boundaries until
later. It can be seen that the potential that arises from
more than one data point will continue to be quadratic,
since the sum of quadratic functions is again a
quadratic function. We can calculate the location of
its minimum as follows:
N

( m)
E
(
X
)


2
( x (ji )  x (jm) )  0

x (jm)
i 1
Therefore we can see that the minimum of the
potential is the mean of coordinates of the data points
to which it is attracted.
x (jm ) 
1 N (i )
xj
N i 1
This result may surprise since it suggests that the
potential minima and thereby the equilibrium
positions for the cluster centers in the k-medoids
algorithm should be the mean, or rather the data item
closest to it, given the constraint that k-medoids
cluster centers must be data items. This may seem
surprising since the k-medoids algorithm is known to
be significantly more robust than k-means which
explicitly takes means as cluster centers. In order to
understand this seemingly inconsistent result we have
to remember that the k-medoids and k-means
algorithms look not only for an equilibrium position
of the cluster center within any one cluster, but rather
Figure 1: Energy landscape (black) and potential of
individual data items (gray) in k-medoids and k-means
Cluster boundaries are given as the points of equal
distances to the closest cluster centers. This means that
for the k-means and k-medoids algorithms the energy
landscape depends on the cluster centers. The
difference between k-means and k-medoids lies in the
way the system of all cluster centers is updated. If
cluster centers change, the energy landscape will also
change. The k-means algorithm moves cluster centers
to the current mean of the data points and thereby
corresponds to a simple hill-climbing algorithm for the
minimization of the total energy of all cluster centers.
(Note that the "hill-climbing" refers to a search for
maxima whereas we are looking for minima.) For the
k-medoids algorithm the attempt is made to explore
the space of all cluster-center locations completely.
The reason why the k-medoids algorithm is so much
more robust than k-means therefore can be traced to
their different update strategies.
2.2. Replacing a Many-Body Problem
by a Single-Body Problem
We have now seen that the energy landscape that
cluster centers feel depends on the cluster boundaries,
and thereby on the location of all other cluster centers.
In the physics language the problem that has to be
solved is a many-body problem, because many cluster
centers are simultaneously involved in the
minimization. Recognizing the inherent complexity of
many-body problems we consider ways of redefining
our problem such that we can look at one cluster
center at a time, i.e., replacing the many-body problem
with a single-body problem. Our first idea may be to
simply ignore all but one cluster center while keeping
the quadratic potential of all data points. Clearly this
is not going to provide us with a useful energy
landscape: If a cluster center feels the quadratic
potential of all data points there will only be one
minimum in the energy landscape and that will be the
mean of all data points - a trivial result. Let us
therefore analyze what caused the non-trivial result in
the k-medoids / k-means case: Each cluster center
only interacted with data points that were close to it,
namely in the same cluster. A natural idea is therefore
to limit the range of attraction of data points
independently of any cluster shapes or locations.
Limiting the range of an attraction corresponds to
letting the potential approach a constant at large
distances. We therefore look for a potential that is
quadratic for small distances and approaches a
constant for large ones. A natural choice for such a
potential is a Gaussian function.
proceed to describe our own algorithm. It is clear that
the computational complexity of a problem that can be
solved for each cluster center independently will be
significantly smaller than the complexity of
minimizing a function of all cluster centers. The state
space that has to be searched for a k-medoid based
algorithm must scale as the square of the number of
sites that are considered valid cluster centers because
each new choice of one cluster center will change the
cost or "energy" for all others. Decoupling cluster
centers immediately reduces the complexity to being
linear in the search space. Using a Gaussian influence
function or potential achieves the goal of decoupling
cluster centers while leading to results that have been
proven useful in the context of the density-based
clustering method DENCLUE.
3.1. Searching for Equilibrium
Locations of Cluster Centers
We view the clustering process as a search for
equilibrium in an energy landscape that is given by the
sum of the Gaussian influences of all data points X(i)
E ( X )   e

( d ( X , X ( i ) )) 2
2 2
i
where the distance d is taken to be the Euclidean
distance calculated in the d-dimensional space of all
attributes
Figure 2: Energy landscape (black) and potential of
individual data items (gray) for a Gaussian influence
function analogous to DENCLUE
The potential we have motivated can easily be
identified with a Gaussian influence function in the
density-based algorithm DENCLUE [6]. Note that the
constant shifts and opposite sign that distinguish the
potential that arise from our motivation from the one
used in DENCLUE do not affect the optimization
problem. Similarly, we can identify the energy with
the (negative of the) density landscape in DENCLUE.
This observation allows us to draw immediate
conclusions on the quality of cluster centers generated
by our approach: DENCLUE cluster centers have
been shown to be as useful as k-medoid ones [6]. We
would not expect them to be identical because we are
solving a slightly different problem, but it is not clear
apriori, which definition of a cluster center is going to
result in a better clustering quality.
3. Idea
Having motivated a uniform view of partition
clustering as a search for equilibrium of cluster centers
in an energy landscape of attracting data items we now
d ( X , X (i ) ) 
d
(x
j 1
j
 x (ji ) ) 2
It is important to note that the improvement in
efficiency by using this configuration-independent
function rather than the k-medoids cost function, is
unrelated to the representation of the data points.
DENCLUE takes the density-based approach of
representing data points in the space of their attribute
values. This design choice does not automatically
follow from a configuration-independent influence
function.
In fact one could envision a very simple
implementation of this idea in which starting points
are chosen in a random or equidistant fashion and a
hill-climbing method is implemented that optimizes all
cluster center candidates simultaneously using one
database scan in each optimization step. In this paper
we will describe a method that uses a general-purpose
data structure, namely a P-tree that gives us fast valuebased access to counts. The benefits of this
implementation are that starting positions can be
chosen efficiently, and optimization can be done for
one cluster center candidate at a time, allowing a more
informed choice of the number of candidates.
As a parameter for our minimization we have to
choose the width of the Gaussian function, . 
specifies the range for which the potential
approximates a quadratic function. Two Gaussian
functions have to be at least 2 apart to be separated
by a minimum. That means that the smallest clusters
we can get have diameter 2. The number of clusters
that our algorithm finds will be determined by  rather
than being predefined as for k-medoids / k-means.
For areas in which data points are more widely spaced
than 2 each data point would be considered an
individual cluster. This is undesirable since widely
spaced points are likely to be due to noise. We will
exclude them and group the corresponding data points
with the nearest larger cluster instead. In our
algorithm this corresponds to ignoring starting points
for which the potential minimum is not deeper than a
threshold of -.
3.2. Defining Cluster Boundaries
Our algorithm fundamentally differs from DENCLUE
in that we will not try to map out the entire space. We
will instead rely on estimates as to whether our
sampling of space is sufficient for finding all or nearly
all minima. That means that we replace the complete
mapping in DENCLUE with a search strategy. As a
consequence we will get no information on cluster
boundaries. Center-defined DENCLUE considers
data points as cluster members if they are density
attracted to the cluster center. This approach is
consistent in the framework of density-based
clustering but there are drawbacks to the definition.
For many applications it is hard to argue that a data
point is considered as belonging to a cluster other than
the one defined by the cluster center it is closest to. If
one cluster has many members in the data set used
while clustering it will appear larger than its neighbors
with few members. Placing the cluster boundary
according to the attraction regions would only be
appropriate if we could be sure that the distribution
will remain the same for any future data set. This is a
stronger assumption than the general clustering
hypothesis that the cluster centers that represent data
points well will be the same.
We will therefore keep the k-means / k-medoids
definition of a cluster boundary that always places
data points with the cluster center they are closest to.
Not only does this approach avoid the expensive step
of precisely mapping out the shape of clusters. It also
allows us to determine cluster membership by a simple
distance calculation and without the need of referring
to an extensive map.
3.3. Selecting Starting Points
An important benefit of using proximity as the
definition for cluster membership is that we can
choose starting points for the optimization that are
already high in density and thereby reduce the number
of optimization steps.
Our heuristics for finding good starting points is as
follows: We start by breaking up the n-dimensional
hyper space into 2n hyper cubes by dividing each
dimension into two equally sized sections. We select
the hypercube with the largest total count of data
points while keeping track of the largest counts we are
ignoring. The process is iteratively repeated until we
reach the smallest granularity that our representation
affords. The result specifies our first starting point.
To find the next starting point we select the largest
count we have ignored so far and repeat the iterative
process.
We compare total counts, which are
commonly larger at higher levels. Therefore starting
points are likely to be in different high-level hyper
cubes, i.e. well separated. This is desirable since
points that have high counts but are close are likely to
belong to the same cluster. We continue the process
of deriving new starting points until minimization
consistently terminates in cluster centers that have
been discovered previously.
4. Algorithm using P-Trees
We describe an implementation that uses P-trees, a
data structure that has been shown to be appropriate
for many data mining tasks [9,10,11].
4.1. A Summary of P-Tree Features
P-trees represent a non-key attribute in the domain
space of the key attributes of a relation. One P-tree
corresponds to one bit of one non-key attribute. It
maintains pre-computed counts in a hierarchical
fashion. Total counts for given non-key attribute
values or ranges can easily be computed by an "and"
operation on separate P-trees. This "and" operation
can also be used to derive counts at different levels
within the tree, corresponding to different values or
ranges of the key attribute.
Two situations can be distinguished: The keys of a
relation may either be data mining relevant themselves
or they may only serve to distinguish tuples.
Examples of relations in which the keys are not data
mining relevant are data streams where time is the key
of the relation but will often not be included as a
dimension in data mining. Similarly, in spatial data
mining, geographical location is usually the key of the
relation but is commonly not included in the mining
process. In other situations the keys of the relation do
themselves contain data mining relevant information.
Examples of such relations are fact tables in data
warehouses, where one fact, such as "sales", is given
in the space of all attributes that are expected to affect
it. Similarly, in genomics, gene expression data is
commonly represented within the space of parameters
that are expected to influence it. In this case the P-tree
based representation is similar to density-based
representations such as the one used in DENCLUE.
One challenge of such a representation is that it has
high demands on storage. P-trees use efficient
compression in which subtrees that consist entirely of
0 values or entirely of 1 values are removed.
Additional problems man arise if data is represented in
the space of non-key attributes: information may be
lost because the same point in space may represent
many tuples. For a P-tree based approach we only
represent data in the space of the data mining relevant
attributes if these attributes are keys. In either case we
get fast value-based access to counts through "and"
operations.
P-trees not only give us access to counts based on the
values of attributes in individual tuples, they also
contain information on counts at any level in a
hierarchy of hyper cubes where each level corresponds
to a refinement by a factor of two in each attribute
dimension. Note that this hierarchy is not necessarily
represented as the tree structure of the P-tree. For
attributes that are not themselves keys to a relation,
the values in the hierarchy are derived as needed by
successive "and" operations, starting with the highest
order bit, and successively "and"ing with lower order
bits.
4.2. The Algorithm
Our algorithm has two steps that are iterated for each
possible cluster center. The goal of the first step is to
find a good initial starting point by looking for an
attribute combination with a high count in an area that
has not previously been used. Note that in a situation
where the entire key of the relation is among the data
mining relevant attributes, the count of an individual
attribute combination can only be 0 or 1. In such a
case we stay at a higher level in the hierarchy when
determining a good starting point. In order to find a
point with a high count, we start at the highest level in
the hierarchy for the combination of all attributes that
we consider. At every level we select the highest
count while keeping track of other high counts that we
ignore. This gives us a suitable starting point in b
steps where b is the number of levels in the hierarchy
or the number of bits of the respective attributes. We
directly use the new starting point to find the nearest
cluster center.
As minimization step we evaluate neighboring points,
with a distance (step size) s, in the energy landscape.
This requires the fast evaluation of a superposition of
Gaussian functions. We intervalize distances and then
calculate counts within the respective intervals. We
use equality in the HOBit distance [11] to define
intervals. The HOBit distance between two integer
coordinates is equal to the number of digits by which
they have to be right shifted to make them identical.
The number of intervals that have distinct HOBit
distances is equal to the number of bits of the
represented numbers. For more than one attribute the
the HOBit distance is defined as the maxium of the
individual HOBit distances of the attributes.
The range of all data items with a HOBit distance
smaller than or equal to dH corresponds to a ddimensional hyper cube that is part of the concept
hierarchy that P-trees represent. Therefore counts can
be efficiently calculated by "and"ing P-trees. The
calculation is done in analogy to a Podium Nearest
Neighbor classification algorithm using P-trees [12].
Once we have calculated the weighted number of
points for a given location in attribute space as well as
for 2 neighboring points in each dimension (distance
s) we can proceed in standard hill-climbing fashion.
We replace central point with the point that has the
lowest energy, assuming it lowers the energy. If no
point has lower energy, the step size s is reduced. If
the step size is already at its minimum we consider the
point a cluster center. If the cluster center has already
been found in a previous minimization we ignore it. If
we repeatedly rediscover old cluster centers we stop.
5. Conclusions
We have shown how the complexity of the k-medoids
clustering algorithm can be addressed at a
fundamental level and proposed an algorithm that
makes use of our suggested modification. In the kmedoids algorithm a cost function is calculated in
which the contribution of any one cluster center
depends on every other cluster center.
This
dependency can be avoided if the influence of faraway data items is limited in a configurationindependent fashion. We suggest using a Gaussian
function for this purpose and identify it with the
Gaussian influence in the density-based clustering
algorithm DENCLUE. Our derivation allows us to
separate the representation issues that distinguish
density-based algorithms from partition-based ones
from the fundamental complexity issues that follow
from the definition of the minimization problem. We
suggest an implementation that uses the most efficient
aspects of both approaches. Using P-trees in our
implementation allows us to further improve
efficiency.
References
[7] N. Goodman, S. Rozen, and L. Stein, "A glimpse
at the DBMS challenges posed by the human genome
project", 1994
http://citeseer.nj.nec.com/goodman94glimpse.html.
[1] L. Kaufman and P.J. Rousseeuw, "Finding Groups
in Data: An Introduction to Cluster Analysis", New
York: John Wiley & Sons, 1990.
[8] R. Ng and J. Han, "Efficient and effective
clustering method for spatial data mining", In Proc.
1994 Int. Conf. Very Large Data Bases (VLDB'94), p.
144-155, Santiago, Chile, Sept. 1994.
[9] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu,
"Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification",
In Proc. 4th Int. Symp. Large Spatial Databases
(SSD'95), p 67-82, Portland, ME, Aug. 1995.
[10] P. Bradley, U. Fayyard, and C. Reina, "Scaling
clustering algorithms to large databases", In Proc.
1998 Int. Conf. Knowledge Discovery and Data
Mining (KDD'98), p. 9-15, New York, Aug. 1998
[3] A. Hinneburg and D. A. Keim, "An efficient
approach to clustering in large multimedia databases
with noise", In Proc. 1998 Int. Conf. Knowledge
Discovery and Data Mining (KDD'98), p. 58-65, New
York, Aug. 1998.
[11] M. Dash, H. Liu, X. Xu, "1+1>2: Merging
distance and density based clustering",
http://citeseer.nj.nec.com/425805.html.
[2] J. MacQueen, "Some methods for classification
and analysis of multivariate observations", Prc. 5th
Berkeley Symp. Math. Statist. Prob., 1:281-297, 1967.
[4] Qin Ding, Maleq Khan, Amalendu Roy, and
William Perrizo, "P-tree Algebra", ACM Symposium
on Applied Computing (SAC'02), Madrid, Spain,
2002.
[5] Qin Ding, Qiang Ding, William Perrizo,
"Association Rule Mining on remotely sensed images
using P-trees", PAKDD-2002, Taipei, Taiwan, 2002.
[6] Maleq Khan, Qin Ding, William Perrizo, "KNearest Neighbor classification of spatial data streams
using P-trees", PAKDD-2002, Taipei, Taiwan, May,
2002.
[12] Willy Valdivia-Granda, Edward Deckard,
William Perrizo, Qin Ding, Maleq Khan, Qiang Ding,
Anne Denton, "Biological systems and data mining for
phylogenomic expression profiling", submitted to
ACM SIGKDD 2002.