Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pillar PK-means Clustering for Big Data Dr. William Perrizo, ?, ? North Dakota State University ([email protected]) ABSTRACT: This paper describes an approach for the data mining technique called clustering using vertically structured data and k-means partitioning methodology. The partitioning methodology is based on the scalar product with judiciously chosen unit vectors. In any k-means clustering method, choosing a good value for k and a good set of initial cluster centroid points is very important. Typically the user is left to guess at k and then stick with it, resulting in subset being considered one cluster when they are really several, or a cluster being divided up into many clusters to meet that there be k clusters. In this work we do not predefine k (the best choice of k is revealed as the algorithm progresses). Also, in the process of determining an appropriate k, we naturally get a good set of initial centroid points (one for each of the k clusters). By good centroid points, we simply mean ones that cause the kmeans algorithm to converge rapidly. For so-called “big data”, the speed at which a clustering method can be trained is a critical issue. Many very good algorithms are unusable in the big data environment due to the fact that they take an unacceptable amount of time. Therefore, algorithm speed is very important. To address the speed issue, for choosing a good set of initial centroids and for the rest of the algorithm, we use horizontal processing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data. We use pTree, bit level, vertical data structuring. pTree technology represents and processes data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices which may or may not be compressed into tree structures) and the columns are processed horizontally (typically across a few to a few hundred pTrees), while in horizontal technologies, data is structured row-wise and those rows are processed vertically (often down billions of rows). ranging from classification [1,2,3], clustering [4,7], association rule mining [9], as well as other data mining algorithms. Introduction Unsupervised Machine Learning or Clustering is an important data mining technology for mining information out of new data sets, especially very large datasets. The assumption is usually that there is essentially nothing yet known about the data set (therefore the process is “unsupervised”). The goal is often to partition the data set into subset of “similar” or “correlated” records [4, 7]. There may be various levels of supervision available and, of course, that additional information should be used to advantage during either the clustering process. For instance, it may be known that there are exactly k clusters or similarity subsets, in which case, a method such as kmeans clustering may be a very productive method. To mine a RGB image for, say red cars, white cars, grass, pavement and bare ground, k would be five. It would make sense to use that supervising knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85), grass=(0,150,0), etc. That is to say, it is productive to view the level of supervision available to us as a continuum and not just the two extremes. pTrees are lossless, compressed and data-mining ready data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates faster processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. pTree vertical data structures have been exploited in various domains and data mining algorithms, In this paper, we assume there is no supervising knowledge. We assume the data set is a table of numbers with n columns and N rows and that two rows are close if there are close in the Euclidean sense (e.g., Rn). More general assumptions could be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance or some correlation-based similarity) but we feel that would only obscure the main points we want to make. that is, ||s◦d –t◦d|| ≤ ||s–t||. This follows since ||s◦d –t◦d|| = We structure the data vertically into columns of bits ||(s–t)◦d|| ≤ ||s–t|| ||d || = ||s–t||, since ||d|| = 1. (possibly compressed into tree structures), called predicate Trees or pTrees. The simplest example of pTree The background algorithm we consider in this research is structuring of a non-negative integer table is to slice it the very popular k-means clustering algorithm. Assume vertically into its bit-position slices. The main reason we k initial centroid points for the k clusters have been chosen do that is so that we can process across the [usually (possibly at random). relatively few] vertical pTree structures rather than 1. processing down the [usually a very, very high number of] rows. Very often these days, data is called Big Data it (this assumes some method of breaking ties). 2. because there are many, many rows (billions or even trillions) while the number of columns, by comparison, is Place each point with the centroid point closest to Calculate the centroid (e.g., mean point) of each of these k new clusters. 3. If some stopping condition is realized (such as the relatively small (tens or hundreds, thousands, multiple density of all k clusters is sufficiently high), stop, thousands). Therefore processing across (bit) columns otherwise, goto 1. rather than down the rows has a clear speed advantage, Formulas provided that the column processing can be done very In each round after the initial round, the calculation of the efficiently. That is where the advantage of our approach centroid points (we’ll assume centroid=mean) and the lies, in devising very efficient (in terms of time taken) assignment of all points to their proper cluster is done algorithms for horizontal processing of vertical (bit) using pTree horizontal processing of vertically structured structures. Our approach also benefits greatly from the data (pTrees). fact that, in general, modern computing platforms can do logical processing of (even massive) bit arrays very Assuming the dataset has been vertically sliced into quickly [9,10]. column sets and that those column sets have been further vertically sliced into bit slices (which is the basic pTree Horizontal Clustering of Vertical Data Algorithms structuring minus the compression into trees, which we will ignore for simplicity in this paper), the mean or We employ a distance-dominating functional (a functional average function, which computes the average of one assigns a non-negative integer to each row). By distance complete table column by horizontally (ANDs and Ors) dominating, we simply mean that the distance between any the vertical bit slices horizontally, goes as shown here. two output functional values is always dominated by the The COUNT function is probably the simplest and most distance between the two input vectors. A class of useful of all these Aggregate functions. It is not necessary distance dominating functionals we have found productive to write special function for Count because a pTree are based on the dot or scalar product with a unit vector. RootCount function which efficiently counts the number of 1-bits in the full slice, provided the mechanism to We let s◦t denote the dot or inner product siti of two implement it. Given a pTree Pi, RootCount(Pi) returns the vectors s, t Rn, and ||s|| = √ si2, the Euclidean norm. number of 1-bits in Pi. Count is the Note that ||s◦t|| ≤ ||s|| ||t||, and the distance between s and t is RootCount(ORi=0…n(Pi)). ||s-t||. We assert that the dot product function applied to a vector, s, with a unit vector, d, is “distance dominating”, The Sum Aggregate function can total a column of numerical values. Evaluating sum with pTrees total 0 for i = 0 to n do total total + 2i * RootCount(Pi) endfor Return total The Average Aggregate, Mean or Expectation will show the average value of a column and is calculated from Count and Sum. A point, m1, found in this manner is called a non-outlier pillar of X wrt a, or nop(X,a) ) Let m2 nop(X,m1) In general, if non-outlier pillars m1..mi-1 have been chosen, choose mi from nop(X,{m1,...,mi-1}) 2 (i.e., mi maximizes k=1..i-1distance (X,mk) and is a non-outlier). Pillar pkmeans clusterer: Average Sum/Count The calculation of the Vector of Medians, or simply the Assign each (object, class) a ClassWeightReals (all CW are initially set at 0.). median of any column, goes as follows. Median returns Classes numbered as they are revealed. the median value in a column. As we identify pillar, mj's, compute Lm = Xo(mj-mj-1) j Rank (K) returns the value that is the kth largest value in a field. Therefore, for a very fast and quite accurate determination of the median value use K = Roof(TableCardinality/2). Evaluating Median with pTrees median 0, pos N/2, c 0 for i = n to 0 do c RootCount (Pc AND Pi) if (c >= pos) then median median + 2i Pc Pc AND Pi else pos pos - c Pc Pc AND NOT (Pi) endif endfor return median Finding the Pillars of X (Choose the k in kmeans intelligently.) Let m1 be a point in X maximizing 2 distance (X,a)=(X-a)o(X-a) where aAverageVectorX If m1 is an outlier, repeat until m1 is a non-outlier. 1. For the next larger PCI in Ld(C), left-to-right. -1 1.1a If followed by PCD, CkAvg(Ld [PCI,PCD]). If Ck is center of a sphere-gap (or barrel gap), declare Classk and mask off. 1.1b If followed by another PCI, declare next Classk= sphere gapped set around -1 Ck=Avg( Ld [ (3PCI1+PCI2)/4,PCI2) ). Mask it off. 2. For the next smaller PCD in Ld from the left side. 2.1a If preceded by a PCI, declare next Class k= -1 subset of Ld [PCI, PCD] sphere-gapped around Ck=Average. Mask it off. 2.1b If preceded by another PCD declare next Class k= subset of same, sphere-gapped around -1 Ck=Avg(Ld ( [PCD2,(PCD1+PCD2)/4] ). Mask it off. REFERENCES [1] Engineers (IEEE) International Conference on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, T. Abidin and W. Perrizo, “SMART-TV: A Fast 2004. and Scalable Nearest Neighbor Based Classifier for Data Mining,” Proceedings of the 21st Association of Computing Machinery [7] E. Wang, I. Rahal, W. Perrizo, “DAVYD: an iterative Density-based Approach for clusters Symposium on Applied Computing (SAC-06), with Varying Densities". International Society of Dijon, France, April 23-27, 2006. Computers and their Applications (ISCA) [2] International Journal of Computers and Their T. Abidin, A. Dong, H. Li, and W. Perrizo, Applications, V17:1, pp. 1-14, March 2010. “Efficient Image Classification on Vertically Decomposed Data,” Institute of Electrical and Electronic Engineers (IEEE) International [8] Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine Association Rules Conference on Multimedia Databases and Data from Spatial Data" Institute of Electrical and Management (MDDM-06), Atlanta, Georgia, Electronic Engineering (IEEE) Transactions of April 8, 2006. Systems, Man, and Cybernetics, Volume 38, [3] Number 6, ISSN 1083-4419), pp. 1513-1525, M. Khan, Q. Ding, and W. Perrizo, “K-nearest December, 2008. Neighbor Classification on Spatial Data Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data [9] Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, Mining (PAKDD 02), pp. 517-528, Taipei, Association of Computing Machinery, Taiwan, May 2002. Management of Data (ACM SIGMOD 04), Paris, [4] France, June 2004. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based [10] Clustering without Prior Knowledge of K,” Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite International Conference on Intelligent and 300, Annapolis, Maryland 21401, Adaptive Systems and Software Engineering http://www.treeminer.com (IASSE-05), pp. 72-77, Toronto, Canada, July 2022, 2005. [11] H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford University Press, Oxford, 1965. [5] Rahal, D. Ren, W. Perrizo, “A Scalable Vertical Model for Mining Association Rules,” Journal of Information and Knowledge Management (JIKM), V3:4, pp. 317-329, 2004. [12] M. S. Bazaraa, H. D. Sherali, C. M. Shetty, Nonlinear Programming, Theory and Algorithms, Third Edition, John Wiley and Sons, Inc, Hoboken, NJ, 2006. [6] D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method using Vertical Data Representation,” Proceedings of the 4th Institute of Electrical and Electronic [13] H. Stapleton, Linear Statistical Models, Second Edition John Wiley and Sons, Inc, Hoboken, NJ, 2009. [14] W. Perrizo, Functional Analytic Unsupervised and Supervised Data Mining Technology, International Conference on Computer Applications in Industry and Engineering, Los Angeles, September, 2013. Appendix Predicate tree or P-tree is the vertical data representation that represents the data column-by-column rather than row- by-row (which is relational data representation). It was initially developed for mining spatial data [2][4]. Since then it has been used for mining many other types of data [3][5]. The creation of P-tree is typically started by converting a relational table of horizontal records to a set of vertical, compressed P- trees by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). Such vertical partitioning guarantees that the information is not lost. Figure: Construction of P-trees from attribute A1. The P-trees are built from the top down, stopping any branch as soon as purity (either purely 1-bits or purely 0- bits) is reached (giving the compression). For example, let R be a relational table consists of three numeric attributes R(A1, A2, A3). To convert it into P-tree we have to convert the attribute values into binary then take vertical bit-slices of every attribute and store them in separate files. Each bit slice is considered as a P-tree, which indicates the predicate if a particular bit position is zero or one. This bit slice may be compressed dividing it into binary trees recursively. Figure 2 depicts the conversion of a numerical attribute, A1, into P-trees. □