Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera, Taufik Abidin, Masum Serazi, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 USA {amal.perera, taufik.abidin, md.serazi, william.perrizo}@ndsu.edu Abstract Clustering is automated identification of groups of objects based on similarity. In clustering two major research issues are scalability and the requirement of domain knowledge to determine input parameters. Most approaches suggest the use of sampling to address the issue of scalability. However, sampling does not guarantee the best solution and can cause significant loss in accuracy. Most approaches also require the use of domain knowledge, trial and error techniques, or exhaustive searching to figure out the required input parameters. In this paper we introduce a new clustering technique based on the set square distance. Cluster membership is determined based on the set squared distance to the respective cluster. As in the case of mean for k-means and median for k-medoids, the cluster is represented by the entire cluster of points for each evaluation of membership. The set square distance for all n items can be computed efficiently in O(n) using a vertical data structure and a few pre-computed values. Special ordering of the set square distance is used to break the data into the “natural” clusters compared to the need of a known k for k-means or k-medoids type of partition clustering. Superior results are observed when the new clustering technique is compared with the classical k-means clustering. To prove the cluster quality and the resolution of the unknown k, data sets with known classes such as the iris data, the uci_kdd network intrusion data, and synthetic data are used. The scalability of the proposed technique is proved using a large RSI data set. Keywords Vertical Set Square Distance, P-trees, Clustering. 1. INTRODUCTION Clustering is a very important human activity. Built-in trainable clustering models are continuously trained from early childhood allowing us to separate cats from dogs. Clustering allows us to distinguish between different objects. Given a set of points in multidimensional space, the goal of clustering is to compute a partition of these points into sets called clusters, such that the points in the same cluster are more similar than points across different clusters. Clustering allows us to identify dense and sparse regions and, therefore, discover overall distribution of interesting patterns and correlations in the data. Automated clustering is very valuable in analyzing large data, and thus has found applications in many areas such as data mining, search engine indexing, pattern recognition, image processing, trend analysis and many other areas [1][2]. Large number of clustering algorithms exists. In the clustering literature these clustering algorithms are grouped into four: partitioning methods, hierarchical methods, density-based (connectivity) methods and grid-based methods [1][3]. In partitioning methods n objects in the original data set is broken into k partitions iteratively, to achieve a certain optimal criterion. The most classical and popular partitioning methods are k-means [4] and k-medoid [5]. The k clusters are represented by the gravity of the cluster in k-means or by a representative of the cluster in kmedoid. Each object in the space is assigned to the closest cluster in each iteration. All the partition based methods suffer from the requirement of providing the k (number of partitions) prior to clustering, only able to identify spherical clusters, and having large genuine clusters split in order to optimize cluster quality [3]. A hierarchical clustering algorithm produces a representation of the nested grouping relationship among objects. If the clustering hierarchy is formed from bottom up, at the start each data object is a cluster by itself, then small clusters are merged into bigger clusters at each level of the hierarchy based on similarity until at the top of the hierarchy all the data objects are in one cluster. The major difference between hierarchical algorithms is how to measure the similarity between each pair of clusters. Hierarchical clustering algorithms require the setting of a termination condition with some prior domain knowledge and typically they have high computational complexity [3][8]. Density based clustering methods attempts to separate the dense and sparse regions of objects in the data space [1]. For each point of a cluster the density of data points in the neighborhood has to exceed some threshold [10]. Density based clustering techniques allow discovering arbitrary shaped clusters through a linking phase. But they do suffer from the requirement of setting prior parameters based on domain knowledge to arrive at the best possible clustering. A grid-based approach divides the data space into a finite set of multidimensional grid cells and performs clustering in each cell and then groups those neighboring dense cells into clusters [1]. Determination of the cell size and other parameters affect the final quality of the clustering. In general, two of the most demanding challenges in clustering are scalability and minimal requirement of domain knowledge to determine the input parameters [1]. In this work we describe a new clustering mechanism that is scalable and operates without the need of an initial parameter that determines the expected number of clusters in the data set. We describe an efficient vertical technique to compute the density based on influence using the set square distance of each data point with respect to all other data points in the space. Natural partitions in the density values are used to initially partition the data set into clusters. Subsequently cluster membership of each data point is confirmed or reassigned with the use of efficiently recalculating the set square distance with respect to each cluster. 2. RELATED WORK Many clustering algorithms work well on small datasets containing fewer than 200 data objects [1]. The NASA Earth Observing System will deliver close to a terabyte of remote sensing data per day and it is estimated that this coordinated series of satellites will generate peta-bytes of archived data in the next few years [12][13][14]. For real world applications, the requirement is to cluster millions of records using scalable techniques [11]. A general strategy to scale-up clustering algorithms is to draw a sample or to apply a kind of data compression before applying the clustering algorithm to the resulting representative objects. This may lead to biased results [1][14]. CLARA [1] addresses the scalability issue by choosing a representative sample of the data set and then continuing with the classical k-mediod method. The effectiveness depends on the size of the sample. CLARANS [6] is an example for a partition based clustering technique which uses a randomized and bounded search strategy to achieve the optimal criterion. This is achieved by not fixing the sample to a specific set from the data set for the entire clustering process. An exhaustive traversal of the search space is not achieved in the final clustering. BIRCH [7] uses a tree structure that records the sufficient statistics (summary) for subsets of data that can be compressed and represented by the summery. Initial threshold parameters are required to obtain the best clustering and computational optimality in BIRCH. Most of the clustering algorithms require the users to input certain parameters [1]. Subsequently the clustering results are sensitive to the input parameters. For example DENCLUE [10] requires the user to input the cell size to compute the influence function. DBSCAN [15] needs the neighborhood radius and minimum number of points that are required to mark a neighborhood as a core object with respect to density. To address the requirement for parameters OPTICS [16] computes an augmented cluster ordering for automatic and interactive cluster analysis. OPTICS stores sufficient additional information enabling the user to extract any density based clustering without having to re-scan the data set. Parameter-less-ness comes at a cost. OPTICS has a time complexity of O (n log n) when used with a spatial index that allows it to easily walk through the search space. Less expensive partition based techniques suffer the requirement of specifying the number of expected partitions (k) prior to clustering [1][3][26]. Xmeans [26] attempts to find k by repeatedly searching through for different k values and testing it against a model based on Bayesian Information Criterion (BIC). G-means [17] is another attempt to learn k using a repeated top down division of the data set until each individual cluster demonstrates a Gaussian data distribution within a user specified significance level. ACE [14] maps the search space to a grid using a suitable weighting function similar to the particle-mesh method used in physics and then uses a few agents to heuristically search through the mesh to identify the natural clusters in the data. Initial weighting costs only O(n), but the success of the techniques depends on the agent based heuristic search and the size of the gird cell. The authors suggest a linear weighting scheme based on neighborhood grid cells and a variable grid cell size to avoid the over dependence on cell size for quality results. The linear weighting scheme adds more compute time to the process. 3. OUR APPROACH Our approach attempts to address the problem of scalability in clustering using a partition-based algorithm using a vertical data structure (P-tree1) that aids fast computation of counts. Three major inherent issues with partition-based algorithms are: Need to input K, Need to initialize the clusters that would lead to a optimal solution and, Cluster representation (prototype) and computation of membership for each cluster. We solve the first two problems based on the concept of being able to formally model the influence of each data point using a function first proposed for DENCLUE [10] and the use of an efficient technique to compute the total influence rapidly over the entire search space. Significantly large differences in the total influence are used to identify the natural clusters in the data set. Data points with similar total influence are initially put together as initial clusters to get a better initialization in search of an optimal clustering in the subsequent iterative process. Each cluster is represented by the entire cluster. The above third issue is solved with the use of a vertical data structure. We show an efficient technique that can compute the membership for each data item by comparing the total influence of each item against each cluster. 1 Patents are pending on the P-tree technology. This work was partially supported by GSA Grant ACT#: K96130308. 3.1 Influence and Density The influence function can be interpreted as a function, which describes the impact of a data point within its neighborhood [10]. Examples for influence functions are parabolic function, square wave function, or the Gaussian function. The influence function can be applied to each data point. Indication of the overall density of the data space can be calculated as the sum of the influence function of all data points [10]. The density function which results from a Gaussian function for a point ‘a’ in a neighborhood ‘xi‘ is n f D Gausian x, a e d ( a , xi ) 2 2 2 i 1 The Gaussian influence function is used in DENCLUE and since it is O(n 2), for all n data points they use a grid to locally compute the density[9][10]. The influence function should be radially symmetric about any point (either variable), continuous and differentiable. Some other influence functions are: n m i 1 j 1 D j 2j f Power 2 m x, a (1) W j d ( xi , a) D x, a f Parabolic n i 1 2 d ( xi , a ) 2 We note that the power 2m function is a truncation of the Gaussian Maclaurin series. Figure 1 shows the distribution of the density for the Gaussian (b) and the Parabolic (c) influence function. Next we show how the density based on the Parabolic influence function here after denoted by Set Square Distance could be efficiently computed using a vertical data structure. Vertical data representation consists of set structures representing the data column-by-column rather than row-by-row (relational data). Predicate-trees (P-tree) are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. P-trees [23] are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [18][20][22][24][25]. A basic Ptree represents one attribute bit that is reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees and can then be used for fast computation of counts. This construction is continued recursively down each tree path until a pure sub-division is reached that is entirely pure (which may or may not be at the leaf level). The basic and complement P-trees are combined using boolean algebra operations to produce P-trees for values, entire tuples, value intervals, or any other attribute pattern. The root count of any pattern tree will indicate the occurrence count of that pattern. The P-tree data structure provides the structure for counting patterns in an efficient manner. Binary representation is intrinsically a fundamental concept in vertical data structures. Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as: 0 x1b1 x10 2 j x1 j j b 1 xb1 and x0 are the highest and respectively. density density 50 45 40 35 30 25 20 15 5 10 15 20 25 30 35 (a) Data Set 40 45 (b) Gaussian (c) Parabolic Figure 1. Distribution of density based on influence function lowest order bits Vertical Set Square Distance (VSSD) for a point ‘a’ in a data set ‘X’ is defined as follows [24]: x a x a f VertSetSqr Dist (a , X ) 2 d xi ai x X i 1 d x xX x X 2 2 i d d xX i 1 d xi2 2 i 1 i 1 d xi ai a i 1 2 i d x a a i i xX i 1 2 i T1 T2 T3 xX i 1 d d = 2 2 j rc ( PX Pij ) 2k rc( PX Pij Pil ) d d 2 xi ai d 0 2 j xij ai 2 i 1 j b 1 xX d xX i i ai 2 d 1 2 rc( PX ) x X i 1 11 21 31 41 51 61 71 81 91 Figure 2. Distribution of sorted Set Square Distance. 0 j rc ( PX Pij ) ai d d ai 2 200 0 0 2 j x ij a i j b 1 i 1 j b 1 d 300 100 xX i 1 xX i 1 T3 Sorted Set Square Distance 400 i 1 j b 1 k ( j*2 )( j 1)& j 0 l ( j 1)0& j 0 2 700 500 0 xi2 xX i 1 T2 2 Identify difference > { mean(difference) + 3 x StandardDeviation (difference)} 4. Break dataset into clusters using large differences as partition boundaries. Following figure shows sorted set square distances (VSSD) for 100 data points. Largest differences in VSSDs are observed at cluster boundaries. This characteristic is used to find the natural partitions in the data set. 600 where T1 3. i 1 ai 2 rc( PX ) a 2 i i 1 Pi,j indicates the P-tree for the jth bit of ith attribute. rc(P) denotes the root count of a P-tree (number of truth bits). PX denotes the P-tree (mask) for the subset X. In the above computation count operations are independent from ‘a’, thus allowing us to pre-compute them in advance for the computation of VSSD for multiple number of data points where the corresponding data set (Set X) does not change. This observation provides us with a technique to compute the VSSD influence based density for all data points in O(n). Further for a given cluster, VSSD influence based density for each data point could be computed efficiently to determine its’ cluster membership. 3.2 Algorithm (VSSDClust) The algorithm has two phases. In the initial phase the VSSD is computed for the entire data set. While computing the VSSD, they are placed on a heap (sorted). Next differences between each consecutive two VSSD values are computed. It is assumed that outlier differences will indicate a separation between two clusters. Statistical outliers are identified by using the standard mean + 3 standard deviation formula on the VSSD difference values. The ordered (based on VSSD) data set is partitioned at the outlier differences to arrive at the initial clustering. Initial phase can be re-stated in the following steps: 1. Compute VSSD(a,DataSet) for all points in data set and place in sorted heap 2. Find the difference between VSSD(a,DataSet)i and VSSD(a,DataSet)i+1 (i and i+1 sorted order) In phase two of the algorithm each item in the data set is confirmed or re-assigned based on the VSSD with respect to each cluster. This step is similar to the classical K-means algorithm. In this case instead of using the mean each cluster is represented by all the data points in the cluster. And instead of the mean square distance to determine the cluster membership set squared distance is used. Phase 2: Iterate until (max iteration) or (no change in cluster sizes) or (oscillation) 1. Compute VSSD for all points, against each Cluster Ci i.e VSSD(a,Ci) 2. Re-assign clusters membership for ‘a’ based on min{VSSD(a,Ci)}. In order to maintain the scalability, it is important to use non compute intensive algorithm termination criterion. Cluster sizes recorded for each iteration is used to identify the progress of the algorithm. If there is no change in the number of data points in each cluster for subsequent iterations, that is an indication of the clustering arriving at a stable solution. If there is any oscillation in the cluster results it is important to terminate. In order to avoid having the algorithm run infinitely we also check for max iteration. 4. EXPERIMENTAL RESULTS To show the practical relevance of the new clustering approach we show comparative experimental results in this section. This approach is aimed at reducing the need for the parameter K and the scalability with respect to the cardinality of the data. To show the successful elimination of K we use few different synthetic data sets and also actual real world data sets with known clusters. We also compare the results with a classical K-means algorithm to show relative difference in speed to obtain an optimal solution. To show the linear scalability we use a large RSI image re(i, j ) C j Ci* / Ci* pr (i, j ) C j Ci* / C j Note: F = 1 for a perfect clustering. F measure will also indicate if the selection of the number of clusters is appropriate. Fi , j l 2. pr (i, j ) re(i, j ) C* k F i max j 1 Fi , j N pr (i, j ) re(i, j ) i 1 Synthetic data: The following table shows the results for a few synthetically generated cluster data sets. The motivation is to show the capability of the algorithm to independently find the natural clusters in the data set. The classical K-means clustering algorithm with given K is used as a comparison. The number of database scans required for (i.e iterations) to achieve an F-measure of 1.0 (i.e. perfect clustering) is shown in the following table. Data 30 35 25 30 35 30 25 25 20 20 20 15 Set 15 15 10 10 10 5 5 5 0 0 0 0 10 20 30 40 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 VSSD 2 iterations 2 iterations 6 iterations Kmeans 8 iterations 8 iterations 14 iterations Table 1. Synthetic data clustering results Iris Plant Data: This data set is from the Iris Plant Data set from the UCI machine learning repository [27]. This data set was originally introduced by Fisher [28] which is frequently used as an example of the problem of pattern recognition. This contains four dimensional patterns (sepal length, sepal width, petal length, petal width) mapped into one of three classes (iris setosa, iris versicolor, and iris virginica), and 50 sample patterns per class, totaling 150 sample patterns. The following table shows the results of the comparison. The k means algorithm is executed for different Ks in an attempt to obtain better results. It is clearly observable that the VSSD based clustering can obtain comparable results at a lower computational cost. VSSD K-Means K =3 K=4 K=5 5 16 38 24 Iterations 0.84 0.80 0.74 0.69 F-measure Table 2. Iris Plant data clustering results KDD-99 Network Intrusion Data: This is the data [27] set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 conference. This dataset includes a wide variety of intrusions simulated in a military network environment. The original dataset contains 31 attributes of information from a TCP dump. We used the 22 numeric attributes. Each data item identifies the categorical type of attack or intrusion such as Satan, Smurf, IpSweep, PortSweep, Neptune, etc or if it is Normal. We randomly sampled 3 datasets from the original dataset to include 2,4, and 6 clusters based on the intrusion type. Comparison results are shown in the following table. Once again the results clearly show a clear advantage in using the VSSD to obtain comparable results at a much lower cost compared to the classical k-means algorithm. With 6 Clust. VD K- means 4 Clust. VD K means 2 Clu. VD Km 5 6 7 - 3 4 5 - 2 K= 7 10 12 12 9 16 12 16 3 6 Iter. .81 .81 .81 .81 .80 .80 .80 .79 .90 .90 F= Table 3. Network Intrusion data clustering results RSI data: This RSI data set was generated based on a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude are 970 42'18"W, taken in 1998. The image contains three bands: red, green, and blue reflectance values. We use the original image of size 1024x1024 pixels (cardinality=1,048,576 pixels). Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental evaluations to obtain a dataset with 6 dimensions. Additional datasets with different sizes were synthetically generated based on the original datasets to study the timing and scalability of VSSD technique presented in this paper. The following graph shows a plot of the clustering time with respect to the dataset size for 3 different types of machines. Main observation is the linear scalability of the approach up to 25 million rows with 6 columns of data. VSSD Clustering Time Time (Seconds) data set with our approach and show the actual required computation time with respect to data size. We use the following quality measure, extensively used in text mining to compare the quality of the clustering. Note that this measure could only be used with known clusters and it is a computationally expensive. Let C* = {C1*,…..Ci*……. Cl*} be the original cluster and C = {C1,…..Ci……. Ck} be some clustering of the data set. 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 Data Set Size ( x1,000,000 rows x6 columns) AMD-Ath-1G Intl-P4-2G SGI-4G Figure 3. Scalability results for VSSD Clustering It is important to note that this new clustering technique will fail to identify the natural clusters when the data points are distributed symmetrically through out the entire space. In most of the real world datasets we do not see this property. Thus it can be argued that the clustering technique is applicable to real world data sets. In case of complete symmetry the entire data set will be clustered as one partition. One possible solution is to ask the user for an upper bound for the number of clusters and to randomly assign the original data points to the clusters and continue with the second phase of the proposed clustering process. Also a grid-based approach could be very easily wrapped on top of the new clustering technique described in this paper. Vertical P-tree data structure used for this work inherently has a built-in index to the attribute space. This can be used easily to compute the density values for each cell and could be followed by a step that could link the clusters to form global clusters. [6] R. Ng and J. Han Efficient and effective clustering method for 5. CONCLUSION [12] Goddard Space Flight Center. http://eospso.gsfc.nasa.gov, 2004. [13] A. Zomaya, T. El-Ghazawi, and O. Frieder. Parallel and distributed Two major problems are scalability and the requirement of domain knowledge to determine input parameters in unsupervised learning to identify groups of objects based on similarity in large datasets. Most of the existing approaches suggest the use of sampling to address the issue of scalability and require the use of domain knowledge, trial and error techniques, or exhaustive searching to figure out the required input parameters. spatial data mining. In Proc. Conf. on VLDB, pp 144-155, 1994. [7] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient data clustering method for very large databases, Proc. ACMSIGMOD Intl. Conf. Management of Data, pp. 103–114, 1996. [8] S. Guha, R. Rastogi, and K. Shim, CURE: an efficient clustering algorithm for large databases, Proc. ACM-SIGMOD Intl. Conf. Management of Data, pp. 73–84, 1998. [9] A. Hinneburg and D. A. Keim, Optimal gridclustering: towards breaking the curse of dimensionality in high-dimensional clustering, Proc. of 25th Intl. Conf. Very Large DataBases, pp. 506–517, 1999. [10] A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Multimedia Databases with Noise. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1998. [11] M. M. Breunig, H.-P. Kriegel, P. Kröger, J. Sander, Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering ACM SIGMOD, Santa Barbara, California, 2001. computing for data mining, IEEE Concurrency, Vol 7(4), 1999. [14] W. Peter, J. Chiochetti C. Giardina, New Unsupervised Clustering Algorithm for Large Datasets SIGKDD, WashingtonDC USA, 2003. [15] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, In Proc. ACM-SIGKDD, pp 226-231, 1996. [16] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. S. OPTICS: Ordering Points To Identify the Clustering Structure Proc. ACM SIGMOD’99 Int. Conf. on Management of Data, Philadelphia PA, 1999. [17] G. Hamerly, C. Elkan, Learning the k in k-means. Seventeenth In this paper we introduce a new clustering technique based on the set square distance. This technique is scalable and does not need prior knowledge of the existing (expected) number of partitions. Efficient computation of the set square distance using a vertical data structure enables the above breakthrough. We show how a special ordering of the set square distance which is an indication of the density at each data point, can be used to break the data into the “natural” clusters. We also show the effectiveness of determining cluster membership based on the set square distance to the respective cluster. We prove the cluster quality and the resolution of the unknown k, of our new technique using data sets with known classes. We show the scalability of the proposed technique with respect to data set size by using a large RSI data set. 6. REFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [2] K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988. [3] O. R. Zaïane, A. Foss, C. Lee, W. Wang, On Data Clustering Analysis: Scalability, Constraints and Validation, in Proc. of the Sixth PAKDD'02, Taipei, Taiwan, pp 28-39, May 2002. [4] J. MacQueen Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob. 1967. [5] L. Kaufman and P. J.Rousseeuw, Finding Groups in Data: an Intro. to Cluster Analysis. J. Wiley & Sons. New York, NY, 1990. Annaul Conference on Neural Information Processing Systems (NIPS), British Columbia, Canada, 2003. [18] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra, Proceedings of the ACM Sym. on App. Comp., pp. 426-431, 2002. [19] J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, NY, 1975. [20] M. Khan, Q. Ding, and W. Perrizo, K-Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD, pp. 517-528, 2002. [21] E.M. Knorr and R. T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pp. 392-403, 1998. [22] A. Perera, A. Denton, P. Kotala, W. Jockhec, W.V. Granda, and W. Perrizo, P-tree Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2), pp.108-109, 2002. [23] W. Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001. [24] T. Abidin, A. Perera, M. Serazi, W. Perrizo, Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets, CATA-2005 New Orleans, 2005. [25] I. Rahal and W. Perrizo, An Optimized Approach for KNN Text Categorization using P-Trees. Proceedings of ACM Symposium on Applied Computing, pp. 613-617, 2004. [26] D. Pelleg, A. W. Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters, Proc. of the 17th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, pp 727 – 734, 2000. [27] UCI Machine Learning Data Repository. http://www.ics.uci.edu/~mlearn/MLSummary.html, 2004. [28] R. A. Fisher, The Use of Multiple Measurements in Axonomic Problems. Annals of Eugenics 7, pp 179-188, 1936.