Download 04_icdm_RDF_dongmei_update

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A Introduction  Related Work     Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities. Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient. Contributions of this paper 1. a relative density factor (RDF)   2. RDF-based outlier detection method   3. RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7] but RDF is easier to compute; it efficiently prunes the data points which are deep in clusters It detects outliers only within the remaining small subset of the data; a vertical data representation in P-trees  P-Trees improve the efficiency of the method further. Definitions Definition 1: Disk Neighborhood --- DiskNbr(x,r)  Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’  X | d(x-x’)  r}, where d(x-x’) is the distance of x and x’  Direct & indirect neighbors of x Direct DiskNbr x Indirect DiskNbr Definition 2: Density of DiskNbr(x, r) --- Dens (x,r) Dens ( x, r ) | DiskNbr(x,r) | /r dim , where dim is the number of dimensions Definitions (Continued) Direct neighbor Direct DiskNbr x r 2r x Indirect neighbors Indirect DiskNbr Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r) AVG RDF(x,r)  ( Dens (q, r )) qDiskNbr( x , r ) Dens ( x, r )  | DiskNbr (q, r ) | qDiskNbr( x , r ) | DiskNbr ( x, r ) |2 Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)} RDF(x,r)  | DiskNbr ( x,2r ) |  | DiskNbr ( x, r ) | | DiskNbr ( x, r ) | *(2dim  1) RDF is used to measure outlierness. Outliers are points with high RDF values. The Proposed Outlier Detection Method  Given a dataset X, the proposed outlier detection method is processed by:   Find Outliers Prune Non-outliers Start point x Finding Outliers p rr 2r 4r 6r Prune out non-outlier Outliers!!! Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers. Finding Outliers x (a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε) Three possible distributions with regard to RDF: (a) prune all neighbors, call “Pruning Non-outliers” procedure; (b) prune all direct neighbors of x, calculate RDF for each indirect neighbor. (c) x is an outlier, prune indirect neighbors of x. Finding Outliers using P-Trees  P-Tree based direct neighbors --- PDNxr For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0), (x2,m-1, …x2,0), … (xn,m-1, …xn,0), where xi,j is the jth bit value in the ith attribute.     X PDN rxi For muti-attributes, PDN r  AND i0,n -1 x |DiskNbr(x,r)|= rc(PDN r) P-Tree based indirect neighbors --- PINxr   For the ith attribute, PDNxir = Px’>xi-r AND Px’xi+r PINxr = (OR q q  Nbr(x,r) PDN r) AND PDNx’r Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDNxr AND PINx’r (b): PU = PU AND PDNxr; where PU is a P-tree representing unprocessed data points Pruning Non-outliers The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer. Start point xr r 2r 4r Prune out non-outlier  1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.  RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process;  RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”. Pruning Non-outliers Using P-Trees  We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value  For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), … (xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by xi  P ,x 1 xi P  AND Pi , j ,where P   j 1, m -  P' , x  0 xi i, j i, j i, j i, j i, j PX  AND Pxi i 0, n -1 The pruning is accomplished by: PU = PU AND PXξ’, where PXξ’ is the complement set of PXξ RDF-based Outlier Detection Process Algorithm: RDF-based Outlier Detection using P-Trees Input: Dataset X, radius r, distribution parameter ε. Output: An outlier set Ols. // PU — unprocessed points represented by P-Trees; // |PU| — number of points in PU // PO --- outliers; //Build up P-Trees for Dataset X PU  createP-Trees(X); i  1; WHILE |PU| > 0 DO x  PU.first; //pick an arbitrary point x PO  FindOutliers (x, r, ε); i  i+1 ENDWHILE “Find Outliers” and “Prune Non-Outliers” Procedures Algorithm: FindOutliers Input: point x, radius r, distribution parameter ε Output: pruned dataset PU //PDN(x): direct neighbors of x //PIN(x): indirect neighbors of x // rdf is relative density factor PDN(x,r) = PX?x+r OR PX>x-r sum  0, PN  0; FOR each point q in PDN(x, r) PN (q, r) = PX<q+r OR PX<q-r; sum  sum + |PN(q, r)|; ENDFOR rdf  sum / (|PDN(x)|2); switch (rdf) case : 1/(1+ ε ) ? rdf ? 1+ ε PU  PU AND PDN’(x) AND PIN’(x); PruneNonOutliers(x, r, ε); case rdf <1/(1+ ε ): PU  PU AND PDN’(x); FOR each point q in PIN(x) FindOutliers (q, r, ε); ENDFOR case rdf > (1+ ε) // Add point x into the outlier set Ols  Ols OR x; PU  PU AND PIN’(x); Algorithm: PruneNonOutliers Input: point x, distribution parameter ε,dataset X Output: pruned dataset PU // Pi,j is P-tree for jth bit of ith attribute of X // PNxξ, i-neighborhood of a point x // n is number of attributes ,m is the number of bits //in each attribute // Pxi’i,j is complement set of Pxii,j FOR j = 0 TO m-1 IF xi,j = 1 Pxi,i,j  Pi,j ELSE Pxi,i,j  P’i,j ENDFOR PU  1; Px  1; ξ = 0; DO FOR i = 1 TO n Pxi  Pxi,i,1 FOR j = 0 TO m-ξ Pxi Pxi AND Pxi,i,j+1 ENDFOR PX  PX AND Pxi ENDFOR PNx ξ  Px; ξ ξ + 1; rdf  (rc(PNxξ)-rc (PNxξ-1) ) / (rc(PNX,ξ-1))2 WHILE (rdf < 1/(1+ ε) || rdf > (1+ε) ) q  {PNX’ξ-1AND PNX ξ} IF rdf < 1/(1+ ε) PU  PU AND PNX’ξ-1; // pruning FindOutliers (q,r,ε); ELSE IF rdf > (1+ε) FindNonOutliers(q,r,ε) ENDIF Experimental Study Run Time Comparisons of LOF, aLOCI, RDF Scalability Comparison of LOF,aLOCI,RDF 2000 2000 1800 1500 1600 1400 Run Time(s) Run Time (s) 1000 500 0 256 1024 4096 16384 LOF 0.23 1.92 38.79 103.19 1813.43 aLOCI 0.17 1.87 35.81 87.34 985.39 RDF 0.58 2.1 8.34 37.82 108.91 Data Size     65536 1200 LOF 1000 aLOCI 800 RDF 600 400 200 0 -200256 1024 4096 16384 65536 Data Size NHL data set (1996) Compare with LOF, aLOCI  LOF: Local Outlier Factor Method  aLOCI: approximate Local Correlation Integral Method Run Time Comparison Scalability Comparison  Start from 16,384, outperform in terms of scalability and speed Reference Reference 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:01635808 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999 Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164169. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001. M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002 Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003 Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003 Thank you! Determination of Parameters  Determination of r     Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20neighborhood, raverage. In our algorithm, r = raverage=0.5 Determination of ε    Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster. The results shown in the experimental part is based on ε=0.8.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 04_icdm_RDF_dongmei_update