Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RDF: A Density-based Outlier Detection Method using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105, USA [email protected] Abstract Outlier detection can lead to discovering unexpected and interesting knowledge, which is critical important to some areas such as monitoring of criminal activities in electronic commerce, credit card fraud, etc. In this paper, we developed an efficient density-based outlier detection method for large datasets. Our contributions are: a) We introduce a relative density factor (RDF); b) Based on RDF, we propose an RDF-based outlier detection method which can efficiently prune the data points which are deep in clusters, and detect outliers only within the remaining small subset of the data; c) The performance of our method is further improved by means of a vertical data representation, P-trees. We tested our method with NHL and NBA data. Our method shows an order of magnitude speed improvement compared to the contemporary approaches. 1. Introduction The problem of mining rare event, deviant objects, and exceptions is critically important in many domains, such as electronic commerce, network, surveillance, and health monitoring. Outlier mining is drawing more and more attentions. The current outlier mining approaches can be classified as five categories: statistic-based [1], distance-based[2][3][4],density-based[5][6], clusteringbased [7], deviation-based [8][9]. Density-based outlier detection approaches are attracting most attentions for KDD in large database. Breunig et al. proposed a density-based approach to mining outliers over datasets with different densities and arbitrary shapes [5]. Their notion of outliers is local in the sense that the outlier degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. The method does not suffer from local density problem, so it can mine outliers over non-uniform distributed datasets. However, the method needs three scans and the computation of neighborhood search costs highly, which makes the method inefficient. Another density-based approach was introduced by Papadimitriou & Kiragawa [6] using local correlation integral (LOCI). This method selects a point as an outlier if its multi-granularity deviation factor (MDEF) deviates three times from the standard deviation of MDEF in a neighborhood. However, the cost of computing of the standard deviation is high. In this paper, we propose an efficient density-based outlier detection method using a vertical data model PTrees1. We introduce a novel local density measurement, relative density factor (RDF). RDF indicates the degree at which the density of the point P contrasts to those of its neighbors. We take RDF as an outlierness measurement. Based on RDF, our method prunes the data points that are deep in clusters, and detect outliers only within the remaining small subset of the data, which makes our method efficient. Also, the performance of our algorithm is enhanced significantly by means of P-Trees. Our method was tested over NHL and NBA datasets. Experiments show that our method has an order of magnitude of speed improvement with comparable accuracy over the current state-of-the-art density-based outlier detection approaches. 2. Review of P-trees In previous work, we proposed a novel vertical data structure, the P-Trees. In the P-Trees approach, we decompose attributes of relational tables into separate files by bit position and compress the vertical bit files using a data-mining-ready structure called the P-trees. Instead of processing horizontal data vertically, we process these vertical P-trees horizontally through fast logical operations. Since P-trees remarkably compress the data and the P-trees logical operations scale extremely well, this vertical data structure has the potential to address the non-scalability with respect to size. In this section, we briefly review some useful features, which will 1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308. be used in this paper, of P-Tree, including its optimized logical operations. Given a data set with d attributes, X = (A1, A2 … Ad), and the binary representation of the j th attribute Aj as bj.m, bj.m-1,..., bj.i, …, bj.1, bj.0, we decompose each attribute into bit files, one file for each bit position [10]. Each bit file is converted into a P-tree. Logical AND, OR and NOT are the most frequently used operations of the Ptrees, which facilitate efficient neighborhood search, pruning and computation of RDF. Calculation of Inequality P-trees Px≥v and Px<v : Let x be a data point within a data set X, x be an m-bit data, and Pm, Pm-1, …, P0 be P-trees for vertical bit files of X and P’m, P’m-1,… P’0 be the complement set for the vertical bit files of X. Let v = bm…bi…b0, where bi is ith binary bit value of v, then Px≥v = Pm opm … Pi opi Pi-1 … op1 P0, i = 0, 1 … m (a) Px<v = P’mopm … P’i opi P’i-1 … opk+1P’k, k≤i≤ m (b) In (a), opi is =1, op is In (b) opi i i is ; In both (a) and (b), i=0, opi is stands for OR and for AND; the operators are right binding, which means operators are associated from right to left, e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)). High Order Bit Metric (HOBit): The HOBit metric [12] is a bitwise distance function. It measures distance based on the most significant consecutive matching bit positions starting from the left. Assume Ai is an attribute in tabular data sets, R (A1, A2, ..., An) and its values are represented as binary numbers, x, i.e., x = x(m)x(m-1)--x(1)x(0).x(-1)---x(-n). Let X and Y are Ai of two tuples/samples, the HOBit similarity between X and Y is defined by m (X,Y) = max {i | xi⊕yi }, where xi and yi are the ith bits of X and Y respectively, and ⊕denotes the XOR (exclusive OR) operation. Correspondingly, the HOBit dissimilarity is defined by (note that Nbit is the number of bits for the attribute) dm (X,Y) = Nbit - m. 3. RDF-based Outlier Detection Using Ptrees In this section, we first introduce some definitions related to outlier detection. Then propose a RDF-based outlier detection method. The performance of the algorithm is enhanced significantly by means of the bitwise vertical data structure, P-trees, and its optimized logical operations. 3.1. Outlier Definitions From the density view, a point P is an outlier if it has much lower density than those of its neighbors. Based on this intuition, we propose some definitions related to outliers. Definition 1 (neighborhood) The neighborhood of a data point P with the radius r is defined as a set Nbr (P, r) = {x X | |P-x| r}, where |P-x| is the distance between P and x. It is also called rneighborhood. The points in this neighborhood are called neighbors of P, or direct r-neighbors of P. The number of neighbors of P is defined as N (Nbr (P, r)); Indirect neighbors of P are those points that are within the rneighborhood of the direct neighbors of P but not include direct neighbors of P. Definition 2 (density factor) Given a data point P and the neighborhood radius r, Density factor (DF) of P is a measurement for local density around P, denoted as DF (P,r). It is defined as (note that d is the number of dimension), d DF ( P, r ) N ( Nbr ( P, r )) / r . (1) Neighborhood density factor of the point P, denoted as DFnbr (P, r), is the average density factor of the neighbors of P. DF ( P, r ) nbr N ( Nbr ( P , r )) DF ( q , r ) / N ( Nbr ( P , r )), i 1 i where qi is the neighbors of P, i = 1, 2, …, N(Nbr(P,r)). Relative Density Factor (RDF) of d DF(P,r) | Nbr(P,r) | /r . the point P, denoted as RDF (P, r), is the ratio of neighborhood density factor of P over its density factor (DF). RDF(P,r) DF (P,r) / DF(P, r) nbr (2) RDF indicates the degree at which the density of the point P contrasts to those of its neighbors. We take RDF as an outlierness measurement, which indicates the degree to which a point can be an outlier in the view of the whole dataset. Definition 3 (outliers) Based on RDF, we define outliers as a subset of the dataset X with RDF > t, where t is a RDF threshold defined by case. The outlier set is denoted as Ols(X, t) = {xX | RDF(x) > t}. 3.2. RDF-based Pruning Outlier Detection with Given a dataset X and a RDF threshold t, the RDFbased outlier detection is processed in two phases: “zoomout” procedure and “zoom-in” procedure. The detection process starts with the “zoom-out” procedure, which calls “zoom-in” procedure when necessary. On the other hand the “zoom-in” procedure also calls “zoom-out” procedure by case. “Zoom-out” process: The procedure starts with an arbitrary point P and a small neighborhood radius r, and calculates RDF of the point. There are three possible local data distributions with regard to the value of RDF, which are shown in figure 1, where α is a small value number, while β is a large value number. In our experiments, we choose α < 0.3 and β > 12, which leads to a good balance between accuracy and pruning speed. be in a denser cluster, and call “zoom-in” procedure further to prune off all points in the dense cluster. As we can see, our method detects outliers using “zoom-out” process for small candidate outlier sets, the boundary points and the outliers. This subset of data as a whole is much smaller than the original dataset. This is where the performance of our algorithm lies in. Both the “zoom-in” and “zoom-out” procedures can be further improved by using the P-trees data structure and its optimal logical operations. The speed improvement lies in: a) P-trees make the “zoom-in” process on fly using HOBit metric; b) P-trees are very efficient for neighborhood search by its logical operations; c) P-tree can be used as a self-index for unprocessed dataset, clustered dataset and outlier set. Because of it, pruning is efficiently executed by logical operations of Ptrees. Zoom-out (a) RDF = 1 ± α (b) RDF ≤ 1/β Zoom-in (c) RDF ≥ β Figure 1. Three different local data distributions Figure 2. “Zoom-in” Process followed by “Zoom-out” In case (a), it is observed that neither the point P is an outlier, nor the direct and indirect neighbors of P are. The local neighbors are distributed uniformly. The “zoomin” procedure will be called to quickly reach points located on the boundary or outlier points. In case (b), the point P is highly likely to be a center point of a cluster. We prune all neighbors of P, while calculating RDF for each of the indirect r-neighbors. In case RDF of one point larger than the threshold t, the point can be inserted into the outlier set together with its RDF value. In case (c), RDF is large, P is inserted into the outlier set. We prune all the indirect neighbors of P. “Zoom-in” using HOBit metric: Given a point P, we define the neighbors of P hierarchically based on the HOBit dissimilarity between P and its neighbors, denoted as ξ-neighbors. ξ-neighbors represents the neighbors with ξ bits of dissimilarity, where ξ = 1, 2 ... 8 if P is an 8-bit value. The basic calculations in the procedure are computing DF (P, ξ) for each ξ-neighborhood and pruning neighborhood. HOBit dissimilarity is calculated by means of P-tree AND. For any data point, P, let P = b11b12 … bnm, where bi,j is the jth bit value in the ith attribute column of P. The attribute P-trees for P with ξHOBit neighbors for ith attribute are then defined by Pvi, ξ = Ppi,1 Ppi,2 … Ppi,m-ξ The ξ-neighborhood P-tree for P in multi-dimensions are then calculated by PNp, ξ = Pv1, m-ξ Pv2, m-ξ Pv3, m-ξ … Pvn, m-ξ “Zoom-in” process: The “zoom-in” procedure is a pruning process based on neighborhood expanding. We calculate DF and observe change of DF values. First we increase radius from r to 2r, compute DF (P, 2r) and compare DF (P, 2r) with DF (P, r). If DF (P, r) is close to DF (P, 2r), it indicates that the whole 2r-neighbohood has uniform density. Therefore, increase (e.g. double or 4 times) the radius until significant change is observed. As for significant decrease of DF is observed, cluster boundary and potential outliers are reached. Therefore, “zoom-out” procedure is called to detect outliers at a fine scale. Figure 2 shows this case. All the 4*r-neighbors are pruned off and “zoom-out” procedure detect outliers over points in 4*r-6*r ring. As for significant increase of DF is observed, we pick up a point with high DF value, likely to Density factor, DF (P,r) of the ξ-neighborhood is simply the root counts of PNp,r divided by r. The neighborhood pruning is accomplished by: PU = PU PN’p,ξ where PU is a P-tree represents the unprocessed points of the dataset. PN’p,ξ represents the complement set of PNp,ξ. “Zoom-out” using Inequality P-trees: In the “Zoom-out” procedure, we use inequality P-trees to search for neighborhood, upon which the RDF is calculated. The direct neighborhood P-tree of a given point P within r, denoted as PDNp,r is P-tree representation of its direct neighbors. PDNp,r is calculated by PDNp,r = Px>p-r Pxp+r.The root count of PDNp,r is equal to N(Nbr(p,r)). Accordingly, DF (P,r) and RDF (P,r) are calculated based on equation (1) and (2) respectively. Using P-Trees AND operations, the pruning is calculated as: In case of RDF (p,r)=1±α, we prune nonoutlier points by PU = PU PDN’P,r PIN’P,r;;; In case RDF<1/ β, dataset is pruned by PU = PU PDN’P,r; If RDF > β the dataset is pruned by PU= PU PDN’P,r PIN’P,r. 4. Experimental Study In this section, we experimentally compare our method (RDF) with current approaches: LOF (local outlier factor) and aLOCI (approximate local correlation integral). LOF is the first approach to density-based outlier detection. aLOCI is the fastest approach in the density-based area so far. We compare these three methods in terms of run time and scalability to data size. We will show our approach is efficient and has high scalability. We ran the methods on a 1400-MHZ AMD machine with 1GB main memory and Debian Linux version 4.0. The datasets we used are the National Hockey League (NHL, 96) dataset and NBA dataset. Due to space limitation, we only show our result on NHL dataset in this paper. The result on NBA dataset also leads to our conclusion in terms of speed and scalability. Figure 3 shows that our method has an order of magnitude improvements in speed compared to aLOCI method. Figure 4 shows our method is the most scalable among the three. When data size is large, e.g. 16384, our method starts to outperform these two methods. Figure 4. Scalability Comparison 5. Conclusion In this paper, we propose a density based outlier detection method based on a novel local density measurement RDF. The method can efficiently mining outliers over large datasets and scales well with increase of data size. A vertical data representation, P-Trees, is used to speed up the process further. Our method was tested over NHL and NBA datasets. Experiments show that our method has an order of magnitude of speed improvements with comparable accuracy over the current state-of-art density-based outlier detection approaches. 6. Reference [1] V.Barnett, T.Lewis, Outliers in Statistic Data, John Wiley’s Publisher, NY,1994 [2] Knorr, Edwin M. and Raymond T. Ng. “A Unified [3] [4] Run Time Comparisons of LOF, aLOCI, RDF 2000 [5] 1500 Run Time (s) 1000 500 0 256 1024 4096 16384 LOF 0.23 1.92 38.79 103.19 1813.43 aLOCI 0.17 1.87 35.81 87.34 985.39 65536 RDF 0.58 2.1 8.34 37.82 108.91 [6] Data Size Figure 3. Run Time Comparison Scalability Comparison of LOF,aLOCI,RDF [7] 2000 1800 1600 Run Time(s) 1400 1200 LOF 1000 aLOCI 800 RDF 600 400 200 0 -200256 1024 4096 Data Size 16384 65536 [8] Notion of Outliers: Properties and Computation”, 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. Knorr, Edwin M. and Raymond T. Ng., “Algorithms for Mining Distance-Based Outliers in Large Datasets”, Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data,2000, ISSN:0163-5808 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Densitybased Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, TX, 2000 Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, “LOCI: Fast Outlier Detection Using the Local Correlation Integral”, 19th International Conference on Data Engineering, 2003, Bangalore, India A.K.Jain, M.N.Murty, and P.J.Flynn. “Data clustering: A review”, ACM Comp. Surveys, 31(3):264-323, 1999 Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. “A Linear Method for Deviation Detection in Large Databases”, 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169. [9] S. Sarawagi, R. Agrawal, and N. Megiddo. [10] [11] [12] “Discovery-Driven Exploration of OLAP Data Cubes”, EDBT'98. Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The Ptree algebra”. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., “Efficient Density Clustering for Spatial Data”, PKDD 2003