Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Linear, Spherical and Radial Functional Classification on Vertically Structured Data Dr. William Perrizo North Dakota State University ([email protected]) ABSTRACT: In this paper we describe an approach for the data mining technique called classification or prediction using vertically structured data and functional partitioning methodology. The partitioning methodology is based on three different functional approaches, the linear or scalar product functional, the spherical functional and the radial distance from a line functional. In all three functional approaches, we are mapping an n-dimensional vector space onto a 1-dimensional line in a variety of ways, each of which is “distance dominating”, in some sense. By “distance dominating”, we mean that the separation between two functional images, f(x) and f(y) is less or equal to the distance between x and y. By using such functionals, we are guaranteed that a gap of sufficient size in the functional values (on the line) reveals a gap of at least that size on the vector space. This fact allows us to separate classes by applying the easily computed functionals. The functionals are easily computed because they are applied to vertically structured data. A prominent gap in the functional values of classified training set points provides good cut points for separation hyperplane, spherical and tubular segments which will form a hull model to be used in the classification of future unclassified samples.. For so-called “big data”, the speed at which a classification model can be trained is a critical issue. Many very good classification algorithms are unusable in the big data environment due to the fact that the training step takes an unacceptable amount of time. Therefore, speed of training is very important. To address the speed issue, in this paper, we use horizontal processing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data. We use pTree, bit level, vertical data structuring. PTree technology represents and processes data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices) and the columns are processed horizontally (typically across a few to a few hundred bit level columns), while in horizontal technologies, data is structured row-wise and those rows are processed vertically (often down millions, even billions of rows). P-trees are lossless, compressed and datamining ready data structures [9][10]. pTrees are lossless because the vertical bitwise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit within a tree structure. This compression saves a considerable amount of space, but more importantly facilitates faster processing. PTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. PTree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classification [1,2,3], clustering [4,7], association rule mining [9], as well as other data mining algorithms. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. In this paper, we evaluate the speed of functional-based data mining algorithms when using pTree technology. Introduction Supervised Machine Learning, Classification or Prediction is one of the important data mining technologies for mining information out of large data sets. The assumption is usually that there is a very large table of data in which the “class” of each instance is given (called the training data set) and there is another data set in which the class are not known (called the test data set). The task is predict the class of each test set object based on class information found in the training data set (therefore “supervised” prediction). [1, 2, 3]. Unsupervised Machine Learning or Clustering is also an important data mining technology for mining information out of new data sets. The assumption in clustering is usually that there is essentially nothing yet known about the data set (therefore it is “unsupervised”). The goal of clustering is to partition the data set into subset of “similar” or “correlated” objects [4, 7], often so that the data set can be used as a classification training set for classifying future unclassified objects. We note here that there may be various additional levels of supervision available in either classification or clustering and, of course, that additional information should be used to advantage during the machine learning process. That is to say, often the problem is not a purely supervised nor purely unsupervised. For instance, it may be known that there are exactly k similarity subsets, in which case, a method such as kmeans clustering may be a productive method. To mine a RGB image for, say red cars, white cars, grass, pavement, bare ground, and other, k would be six. It would make sense to use that supervising knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85), grass=(0,150,0), etc. That is to say, we should view the level of supervision available to us as a continuum and not just the two extremes. The ultimate in supervising knowledge is a very large training set, which has enough class information in it to very accurately assign predicted classes to all test instances. We can think of a training set as a set of records that have been “classified” by an expert (human or machine) into similarity classes (and assigned a class or label). In this paper we assume there is an existing training set of classified object which is large enough to fully characterize classes. We will assume the training set is a subset of a vector space, consisting of non-negative integers with n columns and N rows and that two rows are similar if there are close in the Euclidean sense. More general assumptions could be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance or some correlation-based similarity) but we feel that would only obscure the main points we want to make by generalizing. We structure the data vertically into columns of bits (possibly compressed into tree structures), called predicate Trees or pTrees. The simplest example of pTree structuring of a non-negative integer table is to slice it vertically into its bit-position slices. The main reason we do that is so that we can process across the (usually relatively few) vertical pTree structures rather than processing down the (usually very numerous) rows. Very often these days, data is called Big Data because there are many, many rows (billions or even trillions) while the number of columns, by comparison, is relatively small (tens or hundreds, sometimes thousands, but seldom more than that). Therefore processing across (bit) columns rather than down the rows has a clear speed advantage, provided that the column processing can be done very efficiently. That is where the advantage of our approach lies, in devising very efficient (in terms of time taken) algorithms for horizontal processing of vertical (bit) structures. Our approach also benefits greatly from the fact that, in general, modern computing platforms can do logical processing of (even massive) bit arrays very quickly [9,10]. LSR: The Linear, Spherical, Radial Functional Algorithm for Horizontal Classification of Vertical Data Algorithm In this algorithm, we build a separate decision tree for each of a series of unit vectors, d, used in the dot product linear and radial functionals. The more unit vectors used, the better (the more hull segments we use to approximate classes and therefore the fewer false positives we get). So, given a unit vector, d, and a point, p, in the vector space, X = (X1,…,Xn), and letting xoy denote the dot product of vectors, x and y, we define the linear functional Ld,p = Xop where Xop stands for the column of dot product results, one for each row, x, in X. The three functionals then are: Ld,p (X-p)od= Xod-pod= Ld-pod Sp (X-p)o(X-p)= XoX+Xo(-2p)+pop 2 Rd,p Sp-L d,p 2 2 =XoX+Xo(-2p)+pop-L d-2pod*Xod+pod 2 2 = L-2p-(2pod)d+pop+pod +XoX-L d The first, “Linear” functional, can be viewed as a mapping of Rn to R1 via the dot product with d, which maps a vector point to its “shadow” on the d-line made through perpendicular projection. In the pTree sense, we view this as a mapping of the PTreeSet for X (bit slices for n columns) onto the ScalarPTreeSet of dot products (bit slices of the one derived column of dot products). The second, “Spherical” functional, produces the (bit slices of the) column of square distances from the point, p. The third, “Radial” functional, produces the (bit slices of the) column of radial distances from the d-line through the point p. Assuming X is “Big Data” (say, with a trillion rows), the formulas derived above show that vertical structuring into pTrees can have tremendous benefit since the Scalar PTreeSet, XoX can be precomputed. Then for each d and p chosen, the only calculations required are the scalar calculations, pod and pop, and the ScalarPTreeSet calculations, Xod, Xo(-2p), L2 and S – L2. The LSR algorithm builds a decision tree for each unit vector d, as follows: LSR Decision Tree algorithm. Build a decision tree for each ek (also for some/all ek arithmetic combinations?). Build branches until we have 100% True Positives (no class duplication exits). Then y isa class=C if y isa class=C in every d-decision_tree, else y isa Other. We note that, for every node we build a branch for each pair of classes in each interval. At the root of any d-decision_tree we calculate the minimum and maximum of linear projection values of Ld,p(X) within each class, Ck. Note that we have a bit-map mask for each class, Ck and the L computation is done entirely through horizontal processing across the pTrees (or bit slices) of X. This ends the LSR Decision Tree algorithm. We note that we can improve the accuracy with respect to false positives (FPs) by forming intervals, not just with minimums and maximum values at each stage, but with all Precipitous Count Changes (PCCs). A PCC is a value whose count changes at least some fixed percentage from its predecessor value’s count. We used 25% as the threshold. PCCs can be either increases or decreases. We use PCI for a PCC that is in fact an increase by 25% or more and PCD for a decrease. Finally, we note, for convex classes (roundish classes) the mathematical convex hull of the class points is the optimal hull model to use (fewest false positives), however often one or more of the classes is not convex, e.g., the “horseshoe shaped class of points in 2-space indicated by “@”’s below. @ @ At the next level of any d-decision_tree, we build a branch on every interval formed by those minimums and maximums, using both S and R. At the next level of any d-decision_tree, we build a branch on every interval formed by minimums and maximums of the previous node, using L with the unit vector that runs between each class mean pair and with the class mean of the first class pair as p. At all succeeding pairs of levels of any ddecision_tree, we build branches using the same pattern of L followed by S and R until there is purity (only one class). @ @ Here, the convex hull is definitely not the optimal model for the class, since any unclassified sample in the interior of the “horseshoe” would be a false positives (classified as being in the “@” class falsely). @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ radial functionals with respect to that d. Therefore, certainly, we recommend using all of the dimensional unit vectors, ek=(00..010..00) in the standard basis for the vector space, as well as all sums and differences of those unit vectors. Ultimately, it would be best to include an entire covering grid of unit vectors for the vector space (covering all angles up to some level of approximation). There would be, of course, considerable overlap in doing this but always the potential for an increase in the accuracy of the classifier. Implementation. Convex hull On the other hand, with the serial application of the LSR decision tree algorithm, a much better fit is possible. minL = pci L @ @ @ @ @ @1@ @ @ @ @ @ @ @ @ @ @ @ pcd @ @L @ @ @ @ @ @ 1 @ @ d @ @ @ @ @ @ @ @ @ @ @ @ pci2L @@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ maxL = pcd2L @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ To @ facilitate this happening, we recommend using as many unit vectors, d, as possible, since each one contributes another set of @ edge segments to the hull around each class and therefore (potentially) eliminates more @ positive classifications. The only false additional cost of an additional unit vector, d, @ the cost of calculating the dot product and is @ @ @ In this section we develop the decision trees on a datasets taken from the University of California Irvine Machine Learning Repository called IRIS, which consists of 4 measurements of iris flower samples, pedal length, pedal width, sepal length and sepal width, along with the class of iris as one of 3 classes, setosa irises (we will use S), versicolor irises (we will use E) and Virginica irises (we will use I) . This datasets was selected because it is very commonly used for such in the literature and because it provide “supervision”, that is, samples are already classified. For d=e1=(1,0,0,0) the root pseudo code is if elseif elseif elseif else 43 49 59 70 < < L1000(y)=y1 < L1000(y)=y1 L1000(y)=y1 L1000(y)=y1 {y 49 {y isa 58 {y isa 70 {y isa 79 {y isa isa Other S } SEI}1 EI}2 I} } The {y isa SEI)1 recursive step pseudo code: if 0 R1000,AvgS(y) 99 {y isa S} elseif 99 < R1000,AvgS(y)< 393 {y isa O} elseif 393< R1000,AvgS(y) 1096 {y isa E} elseif 1096<R1000,AvgS(y)< 1217 {y isa O} elseif 1217R1000,AvgS(y 1826 {y isa I} else {y isa Other} The {y isa EI)2 recursive step pseudo code: if 270R1000,AvgS(y)<792 {y isa I} elseif 792R1000,AvgS(y)1558 {y isa EI}3 elseif 1558R1000,AvgS(y)2568 {y isa I} else {y isa Other} The {y isa EI}3 recursive step: if 5.7 LAvE-AvI(y)<13.6{y isa E } elseif 13.6LAvE-AvI(y)15.9{y isa EI}4 elseif 15.9<LAvE-AvI(y)16.6{y isa I} else {y isa Other} [7] The {y if elseif elseif else [8] isa EI)4 recursive step pseudo code: 22RAvE-AvI,AvgE(y)<31{y isa E } 31RAvE-AvI,AvgE(y)35{y isa EI}5 35RAvE-AvI,AvgE(y)54{y isa I} {y isa O} The {y isa EI}5 recursive step: if 1=LAvgE-AvgI,origin(y) {y isa E} elseif 6=LAvgE-AvgI,origin(y) {y isa I} else {y isa O} [9] [10] REFERENCES [1] T. Abidin and W. Perrizo, “SMARTTV: A Fast and Scalable Nearest Neighbor Based Classifier” Proc. ACM Sym. on Applied Comp (SAC), Dijon, France, April 23-27, 2006. [2] T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on Vertical Data,” IEEE Int’l Conf. on Multimedia Databases and Data Mgmt (MDDM), Atlanta, Georgia, April 8, 2006. [3] M. Khan, Q. Ding, and W. Perrizo, “KNN Classification on Spatial Data Stream,” Proc. Pacific-Asia Conf. on Knowledge Dis. and DM (PAKDD), pp. 517-528, Taipei, Taiwan, 2002. [4] Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based Clustering without Prior Knowledge of K,” Int’l Conf. on Intel. Sys and SE (IASSE), pp.72-77, Toronto, July 20-22, 2005. [5] Rahal, D. Ren, W. Perrizo, “Scalable Vertical Model for ARM,” Journal of Info Knowledge Mgmt, V3:4, pp., 04. [6] D. Ren, B. Wang, and W. Perrizo, “RDF: Density Outlier Detection Method using Vertical Data Rep” IEEE Int’l Conf on Data Mining (ICDM), pp. 503-506, Nov, 2004. [11] [12] [13] E. Wang, I. Rahal, W. Perrizo, “DAVYD: Density-based Approach for clusters w Varying Densities". ISCA Int’l Journal of Computers and Applics, V17:1, March 2010. Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm for ARM on Spatial Data" IEEE Trans of Systems, Man, and Cybernetics, V38:6, pp. 1513-1525, Dec, 2008. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, ACM SIGMOD 04, Paris, France, June 2004. Treeminer Inc., 175 Admiral Cochrane Drive, Suite 300, Annapolis, Maryland 21401, http://www.treeminer.com H. Wilkinson, Algebraic Eigenvalue Problem, Oxford U Press, 1965. M. S. Bazaraa, H. D. Sherali, C. M. Shetty, Nonlinear Prog. Theory and Algs, John Wiley, Hoboken, NJ, 06 H. Stapleton, Linear Stat Models, John Wiley, Hoboken, NJ, 2009.