Download 05_signmod_kmeanspreproc

Vertical Set Inner Product (VSIP) Technology with Predicate-trees William Perrizo Computer Science Department North Dakota State University Fargo, ND, 58105, USA [email protected] ABSTRACT In this paper, a Set Inner Product construct is used to identify high quality centroids for partitioning clustering methods such as kmeans or k-medoids clustering (or as a very fast clustering method in and of itself). A strong advantage is that number of centroids (k) need not be pre-specified, but is determined effectively within the algorithm. The method can also be used to identify outliers. The method is fast and scales well to very large data sets. The method applies to vertical data sets and uses vertical structures called Predicate-trees or P-trees. We show that the method can identify high-quality centroids (which may need no further refinement) and outliers. It is also shown that the method is fast and scalable. General Terms Algorithms, Performance. Keywords Clustering, Vertical Set Inner Products, P-tree. (n is large); the third is that the initial points are chosen randomly. In case of random selection, if the points selected are far away from the mean or the real centroid points, both the quality of the clusters and efficiency of the process deteriorate significantly (t is large). In this paper, we introduce the concept of the Set Inner Product to cope with these problems. Our new approach is founded on the vertical construction of the Set Inner Product. In our approach, through construction of the Set Inner Product, high quality centroids are identified for the partitioning clustering methods. Therefore, users do need to pre-define k ahead at all. The second advantage of our method is that the construct of the Set Inner Product itself is a process of clustering and the process can detect outliers as well. The calculation of the Set Inner Product is based on a vertical structure, P-Trees, which makes our method work efficiently and scale well with very large datasets. We successfully applied our algorithms to several real-world datasets, showing the validity of producing high-quality centroids and its superior performance over existing approaches in term of efficiency and scalability. 1. INTRODUCTION One of the primary data mining tasks is clustering, which intends to discover and understand the natural structure or group in a data set [8]. The goal of clustering is to collect similar objects mutually exclusively and collectively exhaustively, achieving minimal dissimilarity within one cluster and maximal dissimilarity among clusters [4]. Many useful clustering methods, such as partitioning, hierarchical, density-based, grid-based, and model-based methods, were proposed in the last decade [9][4]. This paper focuses on partitioning clustering methods. In a partitioning clustering problem, the aim is to partition a given set of n points in m dimensional space into k groups, called clusters, so that points within each cluster are near each other. The most well known portioning methods are k-means and k-medoids, and their variants. These methods are successful when the clusters are compact clouds that are rather well separated from one another. However, this approach has three main shortcomings. The first shortcoming is the necessity for users to specify k, the number of clusters, in advance, which is not applicable for many real application because users may have no prior knowledge of the distribution of the datasets; the other disadvantage is that the computation complexity of the methods are O (nkt), where n is the number of objects, k is the number of clusters, and t is the number of iterations, which are not efficient and scalable for large datasets The remainder of the paper is organized as follows. Section 2 presents a brief introduction to Predicate Trees (P-Trees). Section 3 presents inner product formulas and examples. Section 4 presents an algorithm for calculating the Set Inner Product. Set Inner Products are very useful in determining the local density of points and therefore in choosing centroids and identifying outliers. The Set Inner Product calculation is very fast and scalable, using vertical data structures. Section 5 discusses performance evaluation. Section 6 presents conclusions and future work. 2. PREDICATE TREES (P-TREES) The P-Tree is a vertically partitioned, lossless compressed representation of binary data [2] [7] that is well suited for representing data sets that are normally stored in a horizontal fashion. As seen in the references [5][6] P-Trees have been used successfully in data mining applications for K-Nearest Neighbor classification. Any data set can be represented as a relation R with attributes A1 thru An and denoted R(A1, A2, … An) with the key normally being A1. In this explanation data will be byte based and the P-tree will convert this into a bit based representation. It should be noted that there is no limitation on the data type size, 8 bits is used for simplicity and can be expanded to multiple bytes without loss of generality. Each 8 bit byte will be converted to 8 P-trees that will recursively give the count of one bits in that bit position in the attribute. For a table with 3 attributes that each have 8 bits of data we will generate 24 P-trees. A P-tree will recursively sub-divide the bit band into halves until we reach the bit level (purity) and counts the number of ones bits in that attribute and considers the predicate all bits are 1. This is also called bit sequential format (bSQ) which is vertically partitioning all the bits in a byte into separate bit vectors which are represented by level 0 of a P-tree. The following example will illustrate this idea for one bit slice of an attribute Ai with bits numbered from 7 down to 0 with bit 7 being most significant. 3. VERTICAL SET INNER PRODUCT Vertical set inner products primarily measures a total variation of set of points in class X about a point. The formula is defined as follows: X  a X  a   x  a   x  a  xX n  x  xX i 1   ai  i n n   2  xi ai   ai2  i 1 i 1  n    x  xX i 1 2 i n n n    xi2  2  xi ai   ai2 xX i 1 Given A17, which we would read as, attribute 1 bit slice 7 Figure 1 shows the conversion to P-Tree. A17 2 xX i 1  T1  T2  T3 where n 1 P17 0 L3 T1   xi2 xX i 1 1 0 1 0 L2 n = 0 0 2j  rc ( PX  Pij )  i 1 j b 1 1 0 0 0 2 L1 k  rc( PX  Pij  Pil ) k ( j*2)( j 1)&& j 0 l ( j 1)0&& j 0 0 0 0  2 1 0 0 1 L0 1 n T2  2   xi a i x X i 1 n Figure 1 P-Tree of attribute A17  2  xX i 1 n We can see that at Level 3 (L3) the tree represents the predicate all 1 and since the root is 0 we conclude the bit slice is not all 1’s. We denote a half as pure when it contains 2level 1-bits with the root having the highest level and the deepest leaf having level 0. We can see the compression at L1 where there are two branches that are not expanded. When a sub-tree is pure we do not need to expand that branch. Reading the leaves of the tree we see that we have 11100001 at the leaf level and this is recovered by the formula (leaf value) * 2L and read from left to right. P-trees allow for efficient processing of data at the bit level and can be used to perform fast and, or, and exclusive or operations on data. Several optimization techniques have been developed and can be found in the references.  0    2  2    j b 1 0  2 j i 1 j b 1 xX n  2   0 j  xij  0 n i 1 0 2 j b 1 j j b 1 j j   a ij     aij 0 2 j b 1  rc( PX  Pij ) n 2 i xX i i  0      2 j  aij  xX i 1  j b 1  n 0 2  2 j  rc( PX  Pij )   2   ai   a 2 j b 1 i 1 j b 1 T3   x ij 2 j  aij xX i 1  0   rc( PX )     2 j  aij  i 1  j b 1  n a)o(X-a) are further examined using the other columns of the table to determine their nature. 2 the total number of points in X, essentially measures total variation of a set of points in class X about a. The table can be sorted on the Set Inner Product column. Then the table can be used to quickly reveal a set select a high-quality value for k (the number of centroids). Those centroids can be used to cluster the data set (putting each non-centroid point with it’s closest centroid point) or, if we wish to perform k-means clustering (with, likely very few iterations), the method can be used as a pre-processing k-means or k-medoid clustering step. The method can also be used to identify outliers (contents of the disks centered at the bottom rows in the table – until ~ 98.5% of the set has been identified as outliers), or detect cluster boundary points by examining the outer disk where the variation changes. 4. ALGORITHM DESIGN 4.1 Algorithm n  rc ( PX )   a i2 i 1 The X  a X  a measures the sum of vector length connecting X and a. Thus,  X  a    X  a  where N refers to N The Set Inner Product [1] algorithm produces a table of highquality information, which can be used for clustering and outlier analysis. Creating this table is fast and efficient using vertical data structuring. Our choice of vertical data structure, the Predicate-Tree or P-tree [2][7], allows us to build out rectangular neighborhoods of increasing radius until an upturn or downturn in density is discovered and to do so in a scalable manner. A vector, a, is first, randomly selected from the space, X. The Set Inner Product of X about a is then calculated and inserted into a table. The table contains columns for the selected points, a, the Set Inner Product about each a (which we denote as (X-a)o(X-a) for reasons which will become clear later), the radii to which the point, a, can be built-out before the local density begins to change significantly, those local build-out densities, and the direction of change in that final local build-out density (up or down). A very important part of this algorithm is the pruning of X before iteratively picking the next value, a (prune off the maximum builtout disk for which the density remains roughly constant). This pruning step facilitates the building of a table containing the essential local density information we need, without requiring a full scan of X. The table structure is as follows: 1. 2. 3. 4. 5. 6. 7. Select an a  X. build out disk D(a,r) of radius r about a, increasing r until the density changes significantly (this significance level is an input parameter). Calculate the Set Inner Product about a, the build-out radii r, the build out densities, and the direction of change of the final build-out density. Enter all of these into the table. Prune all points in the largest disk of common density (these points have approximately the same local density as a). Select another point from what remains of X, Repeat from step 1 until X is empty. Sort the table on Set Inner Product column. O ut lie r Deep Cluster Data Set Figure 2. An Example of the Algorithm. Table 1. Local density information table. a a7 a3 a101 (X-a)o(X-a) 38 38 38 44 44 46 46 46 Build-out Radius 5 10 15 5 10 5 10 15 Build-out Density 10 11 3 9 1 18 17 38 +/variation - + Lower values of (X-a)o(X-a) are associated with points, a, that are “deep cluster points” (deep within clusters), while higher values of (X-a)o(X-a) are associated with outliers. Middle values of (X- Figure 2 shows an example of the algorithm pictorially. The clusters will be built based on the total variation. The disks will be built based on the density threshold with increasing radius r. As we build out the disks, when a major change in density of the disks is observed, the information is added to the table and the disk is pruned. For example, when there is a data point in the first disk and as we build out the disks by increasing the radius r, if there are no data points in the next disk, then we consider that point as an outlier, store the information in the table and prune the disk. We do not further build out disks around that point. In case of boundary points, we store the information in the table and prune the points based on the change in variation of the disk. 4.2 ??? In this section we report the experiments we conducted to evaluate the algorithm. It can be clearly observed from the algorithm that most of the compute time for the algorithm is spent on computing the set inner product. In this section we show that the use of the vertical P-tree data structure for computing the set inner product will scale with respect to the data size. We compare the execution time for the calculation of the set inner product employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with horizontal approach (horizontal data structure and vertical scan operation). We show the results of experiments of execution time with respect to the size of the data set. Performance of both algorithms was observed under different machine specifications, including an SGI Altix CC-NUMA machine. Table 2 summarizes the different types of machines used for the experiments. Table 2. The specification of machines used. Machine AMD1GB P42GB SGI Altix Specification AMD Athlon K7 1.4GHz, 1GB RAM Intel P4 2.4GHz processor 2GB RAM SGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM). 24 : Out of memory   28.80 0.00010 Time to Compute Set Inner Product 35.00 Time (Seconds) 5. EXPERIMENTAL RESULTS 30.00 25.00 20.00 15.00 10.00 5.00 0.00 0 2 4 6 8 10 12 14 16 18 20 22 24 26 Data Set Size (x1024^2) Horz-AMD-1G Horz-P4-2G Virt-AMD-1G out of Mem Horz-SGI-48G Figure 3. Average time running under different machines. The experimental data was generated based on a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude are 970 42'18"W, taken in 1998. The image contains three bands: red, green, and blue reflectance values. We use the original image of size 1024x1024 pixels (having cardinality of 1,048,576). Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental evaluations. Combining of all bands and synchronized data, we obtained a dataset with 6 dimensions. Additional datasets with different sizes were synthetically generated based on the original data sets to study the timing and scalability. We observe the timing with respect to scalability when executing on machines with different hardware configurations. We were forced to increase the hardware configuration to accommodate the horizontal version of the set inner product calculation on the larger data sets. We were are able to compute the set inner product for the largest data set using the vertical ptree data structure on the smallest machine (AMD Athlon with 1 GB memory). Table 3 presents the average time to compute the set inner product using the two different techniques under different machines and figure further illustrates performance with respect to scalability. Table 3. Average time different hardware configurations. Dataset Size in 10242 1 2 4 8 16 Average Time to Compute Inner Product (Seconds) Horizontal Vertical AMDP4SGI Altix AMD1GB 2GB 12x4GB 1GB 0.55 0.46 1.37 0.00008 1.10 0.91 2.08 0.00008 2.15 1.85 3.97 0.00010 3.79 8.48 0.00010  16.64 0.00010   As the figure shows the horizontal approach is very sensitive to the available memory in the machine. For the vertical approach, it only requires 0.0001 seconds on average to complete the calculation on all data sets datasets, very much less than Horizontal approach. This significant improvement in computation time is due to the use of similar root count values, pre-computed P-trees creation. Although various vectors for different data points are fed during calculation the pre-computed root counts can be used repeatedly. This allows us to pre-compute these once and use their values repeatedly regardless how many inner product calculations are computed as long as the dataset does not change. Notice also that Vertical approach tend to have a constant execution time even though datasets size is expanded. One may argue that pre-calculation of root count makes this comparison fallacious. However, notice the time required for loading vertical data structure to memory and one time root count operations for vertical approach, and loading horizontal records to memory given on table 4. The performance with respect to time of the vertical approach is comparable to the horizontal approach. There is a slight increase for time required to load horizontal records than to load P-trees and to compute root counts as presented in table 4. This illustrates the ability of the P-tree data structure to efficiently load and compute the simple counts. These timing were tested on a P4 with 2GB of memory. Table 4. Time for computing root count and loading dataset. Dataset Size Size in 10242 1 2 4 8 Time (Seconds) Vertical Horizontal Root Count PreHorizontal Dataset Computation and Loading P-trees Loading 3.900 4.974 8.620 10.470 18.690 19.914 38.450 39.646 6. CONCLUSION ???. [6] Perera, A., Denton, A., Kotala,P., Jockheck,W., Granda, W. V., and Perrizo, W., (2002). P-tree Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2): 108-109. [7] Perrizo, W. (2001). Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1. 7. REFERENCES [1] Abidin, T. and Perrizo, W., Vertical Set Inner Products Formula. http://midas.cs.ndsu.nodak.edu/ ~abidin/research/PSIPs.pdf [2] Ding, Q., Khan, M., Roy, A., and Perrizo, W., (2002). The Ptree Algebra, Proceedings of the ACM Symposium on Applied Computing. 426-431. [3] Eric W. Weisstein et al. Total Variation. From MathWorld – A Wolfram Web Resource. http://mathworld.wolfram.com/TotalVariation.html [4] Han, J., and Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA., Morgan Kaufmann. [5] Khan, M., Ding, Q., and Perrizo, W., (2002). K-Nearest Neighbor Classification of Spatial Data Streams using Ptrees, Proceedings of the PAKDD. 517-528. [8] Baumgartner, C., Plant, C., Kailing, K., Kriegel, H., Kroger, P., (2004) Subspace Selection for Clustering HighDimensional Data, proceedings of 4th IEEE International Conference on Data Mining. [9] Bohm,C., Kailing, K., Kriegel,H., Kroger,P., Density Connected Clustering with Local Subspace Preferences, proceedings of 4th IEEE International Conference on Data Mining. [10] Tomas,F., Greene,D., “Optimal Algorithm for Approximate Clustering”, the proceeding of the 20th ACM Symp. Theory of computing, pages 434--444, Chicago, Illinois, 1988. [11]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 05_signmod_kmeanspreproc