Searching for Centers: An Efficient Approach to the Clustering of Large Data Sets Using P-trees Abstract With the ever-increasing data-set sizes in most data mining applications, speed remains a central goal in clustering. We present an approach that avoids both the time complexity of partition-based algorithms and the storage requirements of density-based ones, while being based on the same fundamental premises as standard partition- and density-based algorithms. Our idea is motivated by taking an unconventional perspective that puts three of the most popular clustering algorithms, k-medoids, k-means, and center-defined DENCLUE into the same context. We suggest an implementation of our idea that uses Ptrees1 for efficient value-based data access. 1. Introduction Many things change in data mining applications but one fact is reliably staying the same, namely that next year's problems will involve larger data sets and tougher performance requirements than this year's. The datasets that are available to biological applications keep growing continuously, since there is commonly no reason to remove old data and experiments worldwide contribute new data . Network applications are operating on massive amounts of data, and there is no limit in sight to the increase in network traffic, the increase in detail of information that should be kept and evaluated, and increasing demands on the speed with which data should be analyzed. The World-Wide Web constitutes another data mining area with continuously growing "data set" sizes. The list could be continued almost indefinitely. It is therefore of ultimate importance to see where the scaling behavior of standard algorithms may be improved without losing their benefits, so as not to make them obsolete over time. A clustering technique that has caused much research in this direction is the k-medoids  algorithm. Although it has a simple 1 Ptree technology is patented to North Dakota State University justification and useful clustering properties the default scaling behavior for its time complexity as being proportional to the square of the number of data items makes it unsuited to large data sets. Many improvements have been implemented [3,4,5], such as CLARA  and CLARANS  but they don't address the fundamental issue, namely that the algorithm inherently depends on the combined choice of cluster centers, and its complexity thereby must scale essentially as the square of the number of investigated sites. In this paper we analyze the origins of this unfavorable scaling and see how it can be eliminate it at a fundamental level. Our idea is to make the criterion for a "good" cluster center independent of the locations of all other cluster centers. At a fundamental level this replaces the quadratic dependency on the number of investigated sites by a linear dependency. We note that our proposed solution is not entirely new but can be seen as implemented in the density-based clustering algorithm DENCLUE  albeit with a different justification. This allows us to separate representation issues from a more fundamental complexity question when discussing the concept of an influence function as introduced in DENCLUE. 2. Taking a Fresh Look at Established Algorithms Partition-based and density-based algorithms are commonly seen as fundamentally and technically distinct, and proposed combinations work on an applied rather than a fundamental level . We will present three of the most popular techniques from both categories in a context that allows us to see their common idea independently of their implementation. This will allow us to combine elements from each of them and design an algorithm that is fast without requiring any clustering-specific data structures. The existing algorithms we consider in detail are the k-medoids  and k-means  partitioning techniques and the center-defined version of DENCLUE . The goal of these algorithms is to group a data item with a cluster center that represents its properties well. The clustering process has two parts that are strongly related for the algorithms we review, but will be separated for our clustering algorithm. The first part consists in finding cluster centers while the second specifies boundaries of the clusters. We first look at strategies that are used to determine cluster centers. Since the k-medoids algorithm is commonly seen as producing a useful clustering, we start by reviewing its definition. for a minimum of the total energy of all cluster centers. To understand this we now have to look at how cluster boundaries would be modeled in the analogous physical system. Since data items don't attract cluster centers that are located outside of their cluster, we model their potential as being quadratic within a cluster and continuing as a constant outside a cluster. Constant potentials are irrelevant for the calculation of forces and can be ignored. 2.1. K-Medoids Clustering as a Search for Equilibrium A good clustering in k-medoids is defined through the minimum of a cost function. The most common choice of cost function is the sum of squared Euclidean distances between each data item and its closest cluster center. An alternative way of looking at this definition borrows ideas from physics: We can look at cluster centers as particles that are attracted to the data points. The potential that describes the attraction for each data item is taken to be a quadratic function in the Euclidean distance as defined in the ddimensional space of all attributes. The energy landscape surrounding a cluster center with position X(m) is the sum of the individual potentials of data items at locations X(i) N E( X ( m) d ) ( x x i 1 j 1 (i ) j ( m) 2 j ) where N is the number of data items that are assumed to influence the cluster center. We defer the discussion on the influence of cluster boundaries until later. It can be seen that the potential that arises from more than one data point will continue to be quadratic, since the sum of quadratic functions is again a quadratic function. We can calculate the location of its minimum as follows: N ( m) E ( X ) 2 ( x (ji ) x (jm) ) 0 x (jm) i 1 Therefore we can see that the minimum of the potential is the mean of coordinates of the data points to which it is attracted. x (jm ) 1 N (i ) xj N i 1 This result may surprise since it suggests that the potential minima and thereby the equilibrium positions for the cluster centers in the k-medoids algorithm should be the mean, or rather the data item closest to it, given the constraint that k-medoids cluster centers must be data items. This may seem surprising since the k-medoids algorithm is known to be significantly more robust than k-means which explicitly takes means as cluster centers. In order to understand this seemingly inconsistent result we have to remember that the k-medoids and k-means algorithms look not only for an equilibrium position of the cluster center within any one cluster, but rather Figure 1: Energy landscape (black) and potential of individual data items (gray) in k-medoids and k-means Cluster boundaries are given as the points of equal distances to the closest cluster centers. This means that for the k-means and k-medoids algorithms the energy landscape depends on the cluster centers. The difference between k-means and k-medoids lies in the way the system of all cluster centers is updated. If cluster centers change, the energy landscape will also change. The k-means algorithm moves cluster centers to the current mean of the data points and thereby corresponds to a simple hill-climbing algorithm for the minimization of the total energy of all cluster centers. (Note that the "hill-climbing" refers to a search for maxima whereas we are looking for minima.) For the k-medoids algorithm the attempt is made to explore the space of all cluster-center locations completely. The reason why the k-medoids algorithm is so much more robust than k-means therefore can be traced to their different update strategies. 2.2. Replacing a Many-Body Problem by a Single-Body Problem We have now seen that the energy landscape that cluster centers feel depends on the cluster boundaries, and thereby on the location of all other cluster centers. In the physics language the problem that has to be solved is a many-body problem, because many cluster centers are simultaneously involved in the minimization. Recognizing the inherent complexity of many-body problems we consider ways of redefining our problem such that we can look at one cluster center at a time, i.e., replacing the many-body problem with a single-body problem. Our first idea may be to simply ignore all but one cluster center while keeping the quadratic potential of all data points. Clearly this is not going to provide us with a useful energy landscape: If a cluster center feels the quadratic potential of all data points there will only be one minimum in the energy landscape and that will be the mean of all data points - a trivial result. Let us therefore analyze what caused the non-trivial result in the k-medoids / k-means case: Each cluster center only interacted with data points that were close to it, namely in the same cluster. A natural idea is therefore to limit the range of attraction of data points independently of any cluster shapes or locations. Limiting the range of an attraction corresponds to letting the potential approach a constant at large distances. We therefore look for a potential that is quadratic for small distances and approaches a constant for large ones. A natural choice for such a potential is a Gaussian function. proceed to describe our own algorithm. It is clear that the computational complexity of a problem that can be solved for each cluster center independently will be significantly smaller than the complexity of minimizing a function of all cluster centers. The state space that has to be searched for a k-medoid based algorithm must scale as the square of the number of sites that are considered valid cluster centers because each new choice of one cluster center will change the cost or "energy" for all others. Decoupling cluster centers immediately reduces the complexity to being linear in the search space. Using a Gaussian influence function or potential achieves the goal of decoupling cluster centers while leading to results that have been proven useful in the context of the density-based clustering method DENCLUE. 3.1. Searching for Equilibrium Locations of Cluster Centers We view the clustering process as a search for equilibrium in an energy landscape that is given by the sum of the Gaussian influences of all data points X(i) E ( X ) e ( d ( X , X ( i ) )) 2 2 2 i where the distance d is taken to be the Euclidean distance calculated in the d-dimensional space of all attributes Figure 2: Energy landscape (black) and potential of individual data items (gray) for a Gaussian influence function analogous to DENCLUE The potential we have motivated can easily be identified with a Gaussian influence function in the density-based algorithm DENCLUE . Note that the constant shifts and opposite sign that distinguish the potential that arise from our motivation from the one used in DENCLUE do not affect the optimization problem. Similarly, we can identify the energy with the (negative of the) density landscape in DENCLUE. This observation allows us to draw immediate conclusions on the quality of cluster centers generated by our approach: DENCLUE cluster centers have been shown to be as useful as k-medoid ones . We would not expect them to be identical because we are solving a slightly different problem, but it is not clear apriori, which definition of a cluster center is going to result in a better clustering quality. 3. Idea Having motivated a uniform view of partition clustering as a search for equilibrium of cluster centers in an energy landscape of attracting data items we now d ( X , X (i ) ) d (x j 1 j x (ji ) ) 2 It is important to note that the improvement in efficiency by using this configuration-independent function rather than the k-medoids cost function, is unrelated to the representation of the data points. DENCLUE takes the density-based approach of representing data points in the space of their attribute values. This design choice does not automatically follow from a configuration-independent influence function. In fact one could envision a very simple implementation of this idea in which starting points are chosen in a random or equidistant fashion and a hill-climbing method is implemented that optimizes all cluster center candidates simultaneously using one database scan in each optimization step. In this paper we will describe a method that uses a general-purpose data structure, namely a P-tree that gives us fast valuebased access to counts. The benefits of this implementation are that starting positions can be chosen efficiently, and optimization can be done for one cluster center candidate at a time, allowing a more informed choice of the number of candidates. As a parameter for our minimization we have to choose the width of the Gaussian function, . specifies the range for which the potential approximates a quadratic function. Two Gaussian functions have to be at least 2 apart to be separated by a minimum. That means that the smallest clusters we can get have diameter 2. The number of clusters that our algorithm finds will be determined by rather than being predefined as for k-medoids / k-means. For areas in which data points are more widely spaced than 2 each data point would be considered an individual cluster. This is undesirable since widely spaced points are likely to be due to noise. We will exclude them and group the corresponding data points with the nearest larger cluster instead. In our algorithm this corresponds to ignoring starting points for which the potential minimum is not deeper than a threshold of -. 3.2. Defining Cluster Boundaries Our algorithm fundamentally differs from DENCLUE in that we will not try to map out the entire space. We will instead rely on estimates as to whether our sampling of space is sufficient for finding all or nearly all minima. That means that we replace the complete mapping in DENCLUE with a search strategy. As a consequence we will get no information on cluster boundaries. Center-defined DENCLUE considers data points as cluster members if they are density attracted to the cluster center. This approach is consistent in the framework of density-based clustering but there are drawbacks to the definition. For many applications it is hard to argue that a data point is considered as belonging to a cluster other than the one defined by the cluster center it is closest to. If one cluster has many members in the data set used while clustering it will appear larger than its neighbors with few members. Placing the cluster boundary according to the attraction regions would only be appropriate if we could be sure that the distribution will remain the same for any future data set. This is a stronger assumption than the general clustering hypothesis that the cluster centers that represent data points well will be the same. We will therefore keep the k-means / k-medoids definition of a cluster boundary that always places data points with the cluster center they are closest to. Not only does this approach avoid the expensive step of precisely mapping out the shape of clusters. It also allows us to determine cluster membership by a simple distance calculation and without the need of referring to an extensive map. 3.3. Selecting Starting Points An important benefit of using proximity as the definition for cluster membership is that we can choose starting points for the optimization that are already high in density and thereby reduce the number of optimization steps. Our heuristics for finding good starting points is as follows: We start by breaking up the n-dimensional hyper space into 2n hyper cubes by dividing each dimension into two equally sized sections. We select the hypercube with the largest total count of data points while keeping track of the largest counts we are ignoring. The process is iteratively repeated until we reach the smallest granularity that our representation affords. The result specifies our first starting point. To find the next starting point we select the largest count we have ignored so far and repeat the iterative process. We compare total counts, which are commonly larger at higher levels. Therefore starting points are likely to be in different high-level hyper cubes, i.e. well separated. This is desirable since points that have high counts but are close are likely to belong to the same cluster. We continue the process of deriving new starting points until minimization consistently terminates in cluster centers that have been discovered previously. 4. Algorithm using P-Trees We describe an implementation that uses P-trees, a data structure that has been shown to be appropriate for many data mining tasks [9,10,11]. 4.1. A Summary of P-Tree Features P-trees represent a non-key attribute in the domain space of the key attributes of a relation. One P-tree corresponds to one bit of one non-key attribute. It maintains pre-computed counts in a hierarchical fashion. Total counts for given non-key attribute values or ranges can easily be computed by an "and" operation on separate P-trees. This "and" operation can also be used to derive counts at different levels within the tree, corresponding to different values or ranges of the key attribute. Two situations can be distinguished: The keys of a relation may either be data mining relevant themselves or they may only serve to distinguish tuples. Examples of relations in which the keys are not data mining relevant are data streams where time is the key of the relation but will often not be included as a dimension in data mining. Similarly, in spatial data mining, geographical location is usually the key of the relation but is commonly not included in the mining process. In other situations the keys of the relation do themselves contain data mining relevant information. Examples of such relations are fact tables in data warehouses, where one fact, such as "sales", is given in the space of all attributes that are expected to affect it. Similarly, in genomics, gene expression data is commonly represented within the space of parameters that are expected to influence it. In this case the P-tree based representation is similar to density-based representations such as the one used in DENCLUE. One challenge of such a representation is that it has high demands on storage. P-trees use efficient compression in which subtrees that consist entirely of 0 values or entirely of 1 values are removed. Additional problems man arise if data is represented in the space of non-key attributes: information may be lost because the same point in space may represent many tuples. For a P-tree based approach we only represent data in the space of the data mining relevant attributes if these attributes are keys. In either case we get fast value-based access to counts through "and" operations. P-trees not only give us access to counts based on the values of attributes in individual tuples, they also contain information on counts at any level in a hierarchy of hyper cubes where each level corresponds to a refinement by a factor of two in each attribute dimension. Note that this hierarchy is not necessarily represented as the tree structure of the P-tree. For attributes that are not themselves keys to a relation, the values in the hierarchy are derived as needed by successive "and" operations, starting with the highest order bit, and successively "and"ing with lower order bits. 4.2. The Algorithm Our algorithm has two steps that are iterated for each possible cluster center. The goal of the first step is to find a good initial starting point by looking for an attribute combination with a high count in an area that has not previously been used. Note that in a situation where the entire key of the relation is among the data mining relevant attributes, the count of an individual attribute combination can only be 0 or 1. In such a case we stay at a higher level in the hierarchy when determining a good starting point. In order to find a point with a high count, we start at the highest level in the hierarchy for the combination of all attributes that we consider. At every level we select the highest count while keeping track of other high counts that we ignore. This gives us a suitable starting point in b steps where b is the number of levels in the hierarchy or the number of bits of the respective attributes. We directly use the new starting point to find the nearest cluster center. As minimization step we evaluate neighboring points, with a distance (step size) s, in the energy landscape. This requires the fast evaluation of a superposition of Gaussian functions. We intervalize distances and then calculate counts within the respective intervals. We use equality in the HOBit distance  to define intervals. The HOBit distance between two integer coordinates is equal to the number of digits by which they have to be right shifted to make them identical. The number of intervals that have distinct HOBit distances is equal to the number of bits of the represented numbers. For more than one attribute the the HOBit distance is defined as the maxium of the individual HOBit distances of the attributes. The range of all data items with a HOBit distance smaller than or equal to dH corresponds to a ddimensional hyper cube that is part of the concept hierarchy that P-trees represent. Therefore counts can be efficiently calculated by "and"ing P-trees. The calculation is done in analogy to a Podium Nearest Neighbor classification algorithm using P-trees . Once we have calculated the weighted number of points for a given location in attribute space as well as for 2 neighboring points in each dimension (distance s) we can proceed in standard hill-climbing fashion. We replace central point with the point that has the lowest energy, assuming it lowers the energy. If no point has lower energy, the step size s is reduced. If the step size is already at its minimum we consider the point a cluster center. If the cluster center has already been found in a previous minimization we ignore it. If we repeatedly rediscover old cluster centers we stop. 5. Conclusions We have shown how the complexity of the k-medoids clustering algorithm can be addressed at a fundamental level and proposed an algorithm that makes use of our suggested modification. In the kmedoids algorithm a cost function is calculated in which the contribution of any one cluster center depends on every other cluster center. This dependency can be avoided if the influence of faraway data items is limited in a configurationindependent fashion. We suggest using a Gaussian function for this purpose and identify it with the Gaussian influence in the density-based clustering algorithm DENCLUE. Our derivation allows us to separate the representation issues that distinguish density-based algorithms from partition-based ones from the fundamental complexity issues that follow from the definition of the minimization problem. We suggest an implementation that uses the most efficient aspects of both approaches. Using P-trees in our implementation allows us to further improve efficiency. References  N. Goodman, S. Rozen, and L. Stein, "A glimpse at the DBMS challenges posed by the human genome project", 1994 http://citeseer.nj.nec.com/goodman94glimpse.html.  L. Kaufman and P.J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis", New York: John Wiley & Sons, 1990.  R. Ng and J. Han, "Efficient and effective clustering method for spatial data mining", In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB'94), p. 144-155, Santiago, Chile, Sept. 1994.  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification", In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95), p 67-82, Portland, ME, Aug. 1995.  P. Bradley, U. Fayyard, and C. Reina, "Scaling clustering algorithms to large databases", In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), p. 9-15, New York, Aug. 1998  A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise", In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), p. 58-65, New York, Aug. 1998.  M. Dash, H. Liu, X. Xu, "1+1>2: Merging distance and density based clustering", http://citeseer.nj.nec.com/425805.html.  J. MacQueen, "Some methods for classification and analysis of multivariate observations", Prc. 5th Berkeley Symp. Math. Statist. Prob., 1:281-297, 1967.  Qin Ding, Maleq Khan, Amalendu Roy, and William Perrizo, "P-tree Algebra", ACM Symposium on Applied Computing (SAC'02), Madrid, Spain, 2002.  Qin Ding, Qiang Ding, William Perrizo, "Association Rule Mining on remotely sensed images using P-trees", PAKDD-2002, Taipei, Taiwan, 2002.  Maleq Khan, Qin Ding, William Perrizo, "KNearest Neighbor classification of spatial data streams using P-trees", PAKDD-2002, Taipei, Taiwan, May, 2002.  Willy Valdivia-Granda, Edward Deckard, William Perrizo, Qin Ding, Maleq Khan, Qiang Ding, Anne Denton, "Biological systems and data mining for phylogenomic expression profiling", submitted to ACM SIGKDD 2002.