Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Vertical Approach to Computing Set Squared Distance Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 USA {taufik.abidin, amal.perera, md.serazi, william.perrizo}@ndsu.edu Abstract Recent advances in computer power, network, information storage, and multimedia have lead to a proliferation of stored data in various applications such as bioinformatics, pattern recognition and image analysis, World Wide Web, networking, banking, and market basket research. This explosive growth of data poses great challenges and generates urgent needs for efficient and scalable algorithms that can deal with massive datasets. In this paper, we propose a vertical approach (algorithm) to computing set squared distance that measures the total variation of a set about a given point. Total variation analysis can be used in classification, clustering, and outlier detection algorithms. The proposed algorithm employs P-tree1 vertical data structure, one choice of vertical representation that has been experimentally proven to address the curse of scalability. The experimental results show that the proposed approach is fast and scalable to compute total variation in very large datasets when compared to the conventional horizontal record-based approach. Keywords Vertical Data Structure, Vertical Set Squared Distance. 1 P-tree technology is patented by North Dakota State University, United States Patent number 6,941,303. 1 1. INTRODUCTION Distance (closeness) between data objects is frequently measured in data mining algorithms. For example, in distance-based classification such as k-nearest neighbors, the class label of the unclassified objects is determined based on the plurality of class labels of the knearest neighbors. These k-nearest neighbors are searched based on the proximity of the unclassified object and the objects in the training set [2][6][9]. Similarly in partitioning-based clustering algorithms, such as k-means and k-medoids, the assignment of cluster membership of data points is based on the distance between the data points and the cluster representatives [5]. Distance is also measured in outlier analysis [7]. Several distance functions have been introduced in the literature such as Minkowski, Manhattan, Euclidian, Max, and the like. Minkowski distance is defined as 1/ q d q d q ( x, y) xi yi i 1 where q is a positive integer. This distance is the generalization of d Manhattan, Euclidian, and Max distances [4]. Manhattan distance, d1 ( x, y ) xi y i , is i 1 Minkowski distance with q=1, Euclidian distance, d 2 ( x, y) d (x i 1 i yi ) 2 , is Minkowski d distance with q=2, and Max distance, d ( x, y) max xi yi , is Minkowski distance with q=. i 1 In this paper, we proposed a vertical approach to computing set squared distance. Set d 2 squared distance, defined as xi ai , measures the cumulative squared separation of xX i 1 points in set X about a fixed point a or the total variation of X about a. We denote the set squared distance as TV(X, a) in this paper. Figure 1 shows the hyper-parabolic graph of the total 2 variations of equally distributed data points. The hyper-parabolic graph of the total variations will always minimize at the mean. 1200 1000 800 600 400 200 60 50 40 40 30 20 20 0 10 0 Figure 1. The hyper-parabolic graph of the total variations on equally distributed data. In this paper, we focus our work on scalability of the algorithm. Because data is growing at disastrous rate, scalability becomes a major issue, and new developed algorithms should scale to very large datasets. To ensure the scalability, the propose algorithm employs P-tree vertical data structure that organizes data vertically and processes it horizontally through fast and efficient logical operations AND, OR, or NOT. Using P-tree vertical data structure, the need for repeated dataset scans, as commonly done in horizontal record-based approach, can be eliminated. We demonstrate the scalability of the proposed algorithm empirically through several experiments. Throughout the paper, we will use the term VSSD to refer to Vertical Set Squared Distance, a vertical approach to computing set squared distance, and use the term HSSD to refer to Horizontal Set Squared Distance, a horizontal approach to computing set squared distance. It is worth noting here that the horizontal approach, implemented for comparison, uses a scanning 3 approach, which scans the entire dataset every time a total variation is computed. Scanning is a common approach used in conventional horizontal record-based structure. In addition, the use of indexes for the horizontal approach will not help much in this context. The rest of the paper is organized as follows: in Section 2, we discuss about P-tree vertical data representation; in Section 3, we discuss the derivation of vertical set squared distance and a strategy to retain count values; in Section 4, we report the experimental results, and in Section 5, we conclude the work and give some directions for future works. 2. VERTICAL DATA REPRESENTATION Vertical data representation represents and processes data differently from horizontal data representation. In vertical data representation, the data is structured column-by-column and processed horizontally through ANDing or ORing, while in horizontal data representation, the data is structured row-by-row and processed vertically through scanning or using some indexes. P-tree vertical data structure [11] is one choice of vertical data representation. It is a lossless, compressed, and data mining ready data structure. P-tree is lossless because the vertical bit-wise partitioning guarantees that the original data values can be retained completely. The compressed property is obtained when the vertical bit sequences are constructed into tree structures and there exist segments of bit sequences that are either pure-1 or pure-0. P-tree is data-mining ready because it addresses the curse of cardinality or the curse of scalability problem, one of the major issues in data mining. P-tree vertical data structure has been used in various data mining algorithms [1][6][8][9][10] and has been experimentally proven to have great potential to address the curse of scalability. 4 The construction of P-tree vertical data structure is started by converting the dataset, normally arranged in a relation R(A1, A2,…, Ad) of horizontal records, into binary. Each attribute in the relation is vertically partitioned (projected), and for each bit position in the attribute, vertical bit sequences (containing 0s and 1s) are subsequently created. During partitioning, the relative order of the data is retained to ensure convertibility. In 0-dimensional P-trees, the vertical bit sequences are left uncompressed and are not constructed into predicate trees. The size of 0dimensional P-trees is equal to the cardinality of the dataset. In 1-dimensional compressed Ptrees, the vertical bit sequences are constructed into predicate trees. In this compressed form, ANDing operations can be accelerated. The 1-dimensional P-trees are constructed by recursively halving the vertical bit sequences and recording the truth of “purely 1-bits” predicate in each half. A predicate 1 indicates that the bits in that half are all 1s, and a predicate 0 indicates otherwise. Consider Figure 2 to get some insights on how 1-dimensional P-trees of a single attribute A1 are constructed. A1 A1 4 100 1 2 010 Converted to binary 0 0 0 1 0 0 1 0 1 1 1 A2 010 2 The projection of each bit position into bit sequences 7 111 5 101 1 0 1 1 001 0 0 1 6 110 1 1 0 3 011 0 1 1 P11 P12 0 0 0 P13 P12 P11 0 0 0 1 1 P13 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 Figure 2. The illustration of P-tree vertical data structure. 5 In P-tree data structure, range queries, values, or any other patterns can be obtained using a combination of Boolean algebra operators AND, OR, and NOT. Beside those operators, COUNT operation is also very important in this structure, which counts the number of 1-bits in the basic or complement P-tree. For example, when using P-trees P11, P12, and P13 from Figure 2, the count values resulted from COUNT(P11), COUNT(P12), and COUNT(P13) are 4, 5, and 4 respectively. Refer to [3] for complete information about P-tree algebra and its operations. 3. THE PROPOSED APPROACH 3.1. Vertical Set Squared Distance Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value v of attribute Ai can be written in b bits binary representation as follows: vib1 vi 0 0 2 j b 1 j vij , where vij can either be 0 or 1 (1) The first subscript corresponds to the attribute to which v belongs, and the second subscript corresponds to the bit order. The summation in the right-hand side of the equation is equal to the numerical value of v in base 10. Now let x be a vector in d-dimensional space. The binary representation of x in b bits can be written as follows: x ( x1(b1) x10 , x2(b1) x20 , , xd (b1) xd 0 ) (2) Let X be a set of vectors in a relation R, xX, and a be a vector being examined. The total variation of X about a, denoted as TV(X, a), measures the cumulative squared separation of the 6 vectors in X about a. The total variation can be measured quickly and scalably using vertical set squared distance, derived as follows: 2 d TV ( X , a) ( X a) ( X a) x a x a xi ai xX (3) xX i 1 d d d xi2 2 xi ai ai2 xX i 1 i 1 i 1 d d d x 2 xi ai ai2 xX i 1 2 i xX i 1 xX i 1 T1 2T2 T3 d T1 xX i 1 0 2 j xij i 1 j b 1 d xi2 2 xX (4) 0 0 j 2 xij 2 k xik xX i 1 j b1 k b1 d 0 0 j k 2 xij xik xX i 1 j b1 k b1 d Commuting the summation of xX further inside to first process all vectors that belong to X vertically and then process each attribute horizontally, we get: 0 0 j k 2 xij xik i 1 j b1 k b1 xX d (5) Let PX be a P-tree mask of set X that can quickly identify data points in X. PX is a bit pattern containing 1s and 0s in which bit 1 indicates that a point at that bit position belongs to X, while 0 indicates otherwise. Using this mask, equation (5) can be simplified by substituting 7 x xX ij xik with COUNT ( PX Pij Pik ) . Recall that COUNT operation will count the number of 1-bits in the pattern, the simplified form of equation (5) can be written as: 0 0 j k T1 2 COUNT ( PX Pij Pik ) i 1 j b1 k b1 d (6) Similarly for terms T2 and T3, using the mask, the derivation will be straight forward as shown in equation (7) and (8) respectively. 0 0 j d T2 xi ai ai 2 xij ai 2 j COUNT ( PX Pij ) i 1 x X i 1 j b1 xX i 1 j b1 d d d (7) T3 ai COUNT ( PX ) ai2 2 xX i 1 d (8) i 1 Therefore, VSSD is defined to be: d 0 0 TV ( X , a) ( X a) ( X a) 2 j k COUNT ( PX Pij Pik ) i 1 j b 1 k b 1 d 2 ai i 1 0 2 j j b 1 COUNT ( PX Pij ) (9) d COUNT ( PX ) ai2 i 1 Observe that the COUNT operations in equation (9) are independent from a. These d include the COUNT of single operand COUNT(PX), 0 COUNT ( PX P ) , and three i 1 j b1 ij 0 0 operands, COUNT ( PX Pij Pik ) . For a dataset that is rarely changed i 1 j b1 k b1 d 8 (static), such as a training set used in classification, the independency of COUNT operations is an advantage. It allows us to precompute the count values in advance, retain them, and use them repeatedly during the computation of total variations. The reusability of count values expedites the computation of total variation significantly. 3.2. A Strategy for Retaining Count Values For illustration, consider a dataset containing 10 data points as shown in Table 1. The dataset has two feature attributes: X and Y, and a class attribute containing two values: C1 and C2. The binary values of each data point are included in the table for clarity. The last two columns on the right-side of the table are the masks of each class, denoted as PX1 and PX2. In this example, we want to measure the total variations of each class about a fixed point, and thus, the dataset is subdivided into two sets. The first set is a collection of data points in class C1, and the second set is a collection of data points in class C2. If the entire dataset is considered as a single set without subdividing them into classes, we do not need P-tree mask because there is only one set. In such a case, the resulting count of COUNT(PX) can be replaced with the cardinality of the dataset, d 0 COUNT ( PX Pij ) can be simplified to be i 1 j b1 d 0 COUNT ( P ) ij i 1 j b1 0 0 COUNT ( PX Pij Pik ) i 1 j b1 k b1 , and d becomes 0 0 COUNT ( Pij Pik ) . i 1 j b1 k b1 d Table 1. Example dataset. X Y CLASS 7 6 C1 X in Binary 111 Y in Binary 110 PX1 PX2 1 0 9 2 6 3 3 7 7 4 1 6 6 3 3 4 5 2 5 4 5 C2 C1 C2 C2 C1 C1 C2 C2 C2 010 110 011 011 111 111 100 001 110 110 011 011 100 101 010 101 100 101 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 The count values that need to be retained are the values returned by those COUNT operations. They are stored separately in three files. The first file holds the count values returned by d COUNT(PX), the second file holds the count values returned by 0 COUNT ( PX P ) , and the last file holds the count values returned by ij i 1 j b1 0 0 COUNT ( PX P P ) ij ik . The count values in each file are organized i 1 j b1 k b1 d accordingly in appropriate order. For example, for COUNT(PX) operation, the count value of the first set is placed first in the file, followed by the count value of the second set, and so forth. The d count values of 0 COUNT ( PX P ) are also organized accordingly, i.e. the count ij i 1 j b1 values of the first set are stored first, followed by the count values of the next set. The number of d count values resulted from 0 COUNT ( PX P ) i 1 j b1 ij for each set is equal to the summation of bit length of each dimension. This number is needed to correctly retrieve the count values of the set when a total variation is computed. The same strategy is also used when storing 10 0 0 COUNT ( PX P P ) count values of ij ik . The number of count values i 1 j b1 k b1 d returned by this operation for each set is equal to the summation of squared bit length of each dimension. Table 2 summarizes the count values of class C1 and C2. When the sets are not static, retaining the count values in advance is not really beneficial since the members of the sets are changing dynamically. However, retaining the count values can also be done in such a case, although it does not give much advantage as when the sets are static. Table 2. The count values of each class. CLASS C1 i j k 1 2 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 1 0 2 2 1 0 COUNT PX^Pij^Pik PX^Pij 4 4 4 3 4 4 4 3 3 3 3 3 2 2 1 1 1 3 3 1 1 2 1 2 PX 4 CLASS i j k C2 1 2 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 1 0 2 2 1 0 COUNT PX^Pij^Pik PX^Pij 2 2 1 0 1 4 4 2 0 3 2 3 5 5 1 2 1 2 2 1 2 3 1 3 PX 6 Subscript i represents index of attributes, j and k represent bit-position. 3.3. Complexity Analysis The cost of VSSD lies in the computation of count values. When datasets are subdivided into several subsets, the complexity is O(kdb2) where k is the number of subsets, d is the 11 number of feature attributes, and b is the average bit length. However, when the entire dataset is considered as a single set, the complexity is reduced to O(db2). The choice whether to consider the entire dataset as a single set or subdivide it into several subsets depends on the situation. It is important to note that the cost to compute count values is more expensive when the dataset is subdivided into several subsets. The complexity to compute total variation using VSSD is O(1). This constant complexity is obtained because the same count values can be reused for any input point. The computation of total variation is a just matter of taking the right count values and solving equation (9) but without the COUNT operations. For example, let a = (2, 3) be the point being examined and the count values of class C1 and C2 are as listed in Table 2. The total variation of class C1 about a, denoted as TV(C1, a), can be computed quickly as follows: 2 0 0 TV (C1 , a) 2 j k COUNT ( PX Pij Pik ) i 1 j 2 k 2 2 0 2 i 1 j 2 i 1 2 ai 2 j COUNT ( PX Pij ) COUNT ( PX ) ai2 = 24 4 + 23 4 + 22 3 + 23 4 + 22 4 + 21 3 + 22 3 + 21 3 + 20 3 + 24 2 + 23 1 + 22 1 + 23 1 + 22 3 + 21 1 + 22 1 + 21 1 + 20 2 – 2 (2 (22 4 + 21 4 + 20 3) + 3 (22 2 + 21 3 + 20 2)) + 4 (22 + 32) = 105 Similarly, the total variation of class C2 about a, denoted as TV(C2, a), can be computed as follows: 12 0 0 j k TV (C2 , a) 2 COUNT ( PX Pij Pik ) i 1 j 2 k 2 2 2 0 2 i 1 j 2 i 1 2 ai 2 j COUNT ( PX Pij ) COUNT ( PX ) ai2 = 24 2 + 23 1 + 22 0 + 23 1 + 22 4 + 21 2 + 22 0 + 21 2 + 20 3 + 24 5 + 23 1 + 22 2 + 23 1 + 22 2 + 21 1 + 22 2 + 21 1 + 20 3 – 2 (2 (22 2 + 21 4 + 20 3) + 3 (22 5 + 21 2 + 20 3)) + 6 (22 + 32) = 42 4. EXPERIMENTAL RESULTS In this section, we report the experimental results. The objective is to compare the efficiency and scalability between VSSD employing a vertical approach (vertical data structure with horizontal bitwise operations) and HSSD utilizing a horizontal approach (horizontal data structure with vertical scans). HSSD is defined as shown in equation (3). Both VSSD and HSSD were implemented using C++ programming language. The P-treeAPI [12] was also used in the implementation of VSSD. The performance of the algorithms was observed under several different machine specifications, including an SGI Altix CC-NUMA machine. The specification of the machines is listed in Table 3. Table 3. The specification of the machines. Machine Name Specification Memory Size AMD AMD Athlon K7 1.4GHz processor 1GB P4 Intel P4 2.4GHz processor 3.8GB SGI Altix SGI Altix CC-NUMA 12 processors Shared Memory (12 x 4 GB) 13 4.1. Datasets The datasets were taken from a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude are 970 42'18"W, taken in 1998. The images contain three bands: red, green, and blue reflectance values. The values are between 0 and 255, which in binary number can be represented using 8 bits. The original image is the image of size 1024x1024 pixels (having cardinality of 1,048,576), depicted in Figure 3. Corresponding synchronized data for soil moisture, soil nitrate, and crop yield were also used. The crop yield was selected to be a class attribute. Combining all bands and synchronized data, we obtained a dataset with 6 dimensions (five feature attributes and one class attribute). Additional datasets with different cardinality were synthetically generated from the original dataset to study the speed and scalability of the methods. Both speed and scalability were evaluated with respect to dataset size. Due to the small number of cardinality obtained from the original dataset (1,048,576 records), we super sampled the dataset using a simple image processing tool on the original dataset to produce five other larger datasets, each having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096 pixels), and 25,160,256 (5016x5016 pixels). We categorized the crop yield attribute into four different categories to simulate various subsets in the datasets. The categories are: low yield having intensity between 0 and 63, medium low yield having intensity between 64 and 127, medium high yield having intensity between 128 and 191, and high yield having intensity between 192 and 255. 14 Figure 3. The original image of the dataset. 4.2. Computation Time and Scalability Our first observation was to evaluate the performance of VSSD and HSSD when running on different machines. We discovered that VSSD is significantly faster than HSSD on all machines. VSSD takes only 0.0004 seconds on an average to compute the total variations of each set (low yield, medium low yield, medium high yield, and high yield) about 5 tested points. As discussed before in Section 3.3, the cost for VSSD lies in the computation of count values. However, this computation is extremely fast because the COUNT operations are simply counting the number of 1-bits in the bit patterns. We discovered that for each dataset, the COUNT operations were executed 1,280 times, derived from 4 x 5 x 82, or equal to the complexity of computing count values O(kdb2). Table 4 summarizes the amount of time needed for VSSD to run all COUNT operations on different datasets and machines. Notice that when running on AMD machine, VSSD only needs 0.4 seconds on an average to finish a single COUNT operation for the dataset of size 25,160,256, while on P4 machine, VSSD only needs 0.15 seconds on an average to finish a single COUNT operation for the same dataset. The COUNT operation was 15 even faster when VSSD was running on SGI Altix. It takes 183.81 seconds to complete all 1,280 COUNT operations, or on an average, it takes only 0.14 seconds to complete a single COUNT operation. Table 4. Time for VSSD to compute all count values. Dataset Sizes 1,048,576 2,097,152 4,194,304 8,388,608 16,777,216 25,160,256 The computation of total Time (Seconds) AMD P4 SGI Altix 14.57 5.05 4.39 36.32 11.05 9.19 75.89 24.03 21.73 147.79 49.69 50.25 305.22 97.59 121.73 513.98 192.07 183.81 variation is really fast for VSSD once the count values are obtained. It is a matter of taking the appropriate count values and completing the summation as shown in equation (9) but without the COUNT operations. We only report the time to compute the total variations when VSSD was running on AMD machine because the same time was also found on the other machines. In contrast, HSSD takes more time to compute the total variations. The time is linear to the dataset sizes and difference on every machine. For example, on AMD machine, HSSD takes 79.86 and 132.17 seconds on average to compute the total variations for the datasets of size 16,777,216 and 25,160,256 respectively. On P4 machine, HSSD takes 98.62 and 155.06 seconds on average to compute the total variations for the same datasets. The same phenomenon was also found when HSSD was running on SGI Altix machine. The average time to compute the total variations for the dataset of size 16,777,216 is twice the time to compute the total variations for the dataset of size 8,388,608. Table 5 shows the average time to compute the total variations on 16 different machines, and Figure 4 illustrates the time trend. The time in the table shows a clear advantage in using the proposed approach. Table 5. The average time to compute the total variations under different machines. Average Time to Compute the Total Variations (Seconds) Dataset Sizes HSSD 1,048,576 2,097,152 4,194,304 8,388,608 16,777,216 25,160,256 VSSD AMD P4 SGI Altix AMD 5.30 10.58 18.40 36.85 79.86 132.17 6.14 12.27 24.73 50.15 98.62 155.06 6.79 13.84 27.64 55.10 109.76 164.95 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 180 160 Time (Seconds) 140 120 VSSD on AMD 100 HSSD on AMD 80 HSSD on P4 60 HSSD on SGI Altix 40 20 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Number of Tuples (x 1024^2) Figure 4. Time trend to compute total variations. It is important to note that the significant disparity of time to compute the total variations between VSSD and HSSD is due to the capability of VSSD to reuse the same count values once the count values are computed. As a result, VSSD tends to have a constant time when computing 17 total variations even when the dataset sizes are varied. On the other hand, HSSD must scan the datasets each time the total variations are computed. Thus, the time to compute a total variation is linear to the cardinality of the datasets. Our second observation was to compare the time to load the datasets to memory. We discovered that when the datasets are organized in P-tree vertical structure, the time to load the datasets is more efficient than when the datasets are organized in horizontal structure, see Table 6. The reason is that when the dataset are organized in P-tree vertical structure, they are stored in binary. Hence, they can be loaded efficiently. On the other hand, when the datasets are organized in horizontal structure, which is not in binary format, it takes more time to load them to memory. Table 6. Loading time comparison. Dataset Sizes Average Loading Time (Seconds) Loading P-trees and Count Values Loading Horizontal Datasets 1,048,576 2,097,152 4,194,304 8,388,608 16,777,216 25,160,256 AMD P4 0.11 0.25 0.47 0.95 1.87 3.33 0.04 0.09 0.16 0.35 0.67 0.95 SGI Altix AMD P4 SGI Altix 0.04 0.06 0.12 0.26 0.55 0.82 31.65 63.22 118.61 243.84 489.59 784.57 24.12 48.21 97.87 202.69 389.96 588.27 25.92 51.96 103.98 208.61 415.43 625.33 5. CONCLUSION We have introduced a vertical approach to computing set squared distance (total variation) and evaluated its performance. The empirical results clearly show that VSSD is fast and scalable to very large datasets. The independency of COUNT operations to the data points being examined is the major factor that makes the computation of total variation using VSSD is extremely fast. The proposed approach is scalable due to the use of P-tree vertical data structure. 18 For future works, we will apply the total variation analysis in classification and clustering algorithms. The efficiency to compute total variation using vertical approach provides a window of opportunities to develop fast and scalable data mining techniques. 6. REFERENCES [1] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining,” Proceedings of the 21st ACM Symposium on Applied Computing (SAC-06), Dixon, France, April 23-27, 2006. [2] T.M. Cover and P.E. Hart. “Nearest Neighbor Pattern Classification”, Journal IEEE Transactions on Information Theory, IT 13, 1967, 21-27. [3] Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The P-tree Algebra”, Proceedings of the ACM Symposium on Applied Computing, pp. 426-431, 2002. [4] J. Han, and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, San Francisco, CA, 2001. [5] J. A. Hartigan, “Clustering Algorithms”, John Wiley & Sons, New York, NY, 1975. [6] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data Stream Using P-trees”. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 517-528, Taipei, Taiwan, May 2002. [7] E.M. Knorr and R. T. Ng. “Algorithms for Mining Distance-Based Outliers in Large Datasets”. Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pp. 392-403, 1998. [8] A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Square Distance Based Clustering without Prior Knowledge of K”. Proceedings of the 14th International 19 Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05), July 20-22, 2005, Toronto Canada. [9] A. Perera, et al., “P-tree Classification of Yeast Gene Deletion Data”. SIGKDD Explorations, 4(2), pp. 108-109, 2002. [10] I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association Rules.” Journal of Information and Knowledge Management (JIKM), Vol.3, No. 4, pp. 317-329, 2004. [11] W. Perrizo, “Peano Count Tree Technology”, Technical Report NDSU-CSOR-TR-01-1, 2001. [12] P-tree Application Programming Interface (P-tree API) Documentation, http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/ 20