Download 04_icdm_RDF_dongmei_update

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RDF: A Density-based Outlier
Detection Method Using Vertical
Data Representation
Dongmei Ren, Baoying Wang, William Perrizo
North Dakota State University, U.S.A
Introduction

Related Work




Breunig et al. [6] first proposed a density-based approach to mining
outliers over datasets with different densities.
Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI).
Not efficient.
Contributions of this paper
1.
a relative density factor (RDF)


2.
RDF-based outlier detection method


3.
RDF expresses the same amount of information as LOF (local outlier
factor)[6] and MDEF(multi-granularity deviation factor)[7]
but RDF is easier to compute;
it efficiently prunes the data points which are deep in clusters
It detects outliers only within the remaining small subset of the data;
a vertical data representation in P-trees

P-Trees improve the efficiency of the method further.
Definitions
Definition 1: Disk Neighborhood --- DiskNbr(x,r)

Given a point x and radius r, the disk neighborhood of x is defined as a set
DiskNbr(x, r)={x’  X | d(x-x’)  r}, where d(x-x’) is the distance of x and x’
 Direct & indirect neighbors of x
Direct DiskNbr
x
Indirect DiskNbr
Definition 2: Density of DiskNbr(x, r) --- Dens (x,r)
Dens ( x, r ) | DiskNbr(x,r) | /r dim
, where dim is the number of dimensions
Definitions (Continued)
Direct neighbor
Direct DiskNbr
x r 2r
x
Indirect neighbors
Indirect DiskNbr
Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r)
AVG
RDF(x,r) 
( Dens (q, r ))
qDiskNbr( x , r )
Dens ( x, r )

| DiskNbr (q, r ) |
qDiskNbr( x , r )
| DiskNbr ( x, r ) |2
Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}
RDF(x,r) 
| DiskNbr ( x,2r ) |  | DiskNbr ( x, r ) |
| DiskNbr ( x, r ) | *(2dim  1)
RDF is used to measure outlierness. Outliers are points with high RDF values.
The Proposed Outlier Detection Method

Given a dataset X, the proposed outlier detection method
is processed by:


Find Outliers
Prune Non-outliers
Start point x
Finding Outliers
p
rr
2r
4r
6r
Prune out non-outlier
Outliers!!!
Our method prunes non-outliers (points deep in clusters) efficiently;
find outliers over the remaining small subset of the data, which
consists of points on cluster boundaries and real outliers.
Finding Outliers
x
(a) 1/(1+ε) ≤ RDF ≤ (1+ε)
(b) RDF < 1/ (1+ε)
(c) RDF > (1+ε)
Three possible distributions with regard to RDF:
(a) prune all neighbors, call “Pruning Non-outliers” procedure;
(b) prune all direct neighbors of x, calculate RDF for each
indirect neighbor.
(c) x is an outlier, prune indirect neighbors of x.
Finding Outliers using P-Trees

P-Tree based direct neighbors --- PDNxr
For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0),
(x2,m-1, …x2,0), … (xn,m-1, …xn,0), where xi,j is the jth bit
value in the ith attribute.




X
PDN rxi
For muti-attributes, PDN r  AND
i0,n -1
x
|DiskNbr(x,r)|= rc(PDN r)
P-Tree based indirect neighbors --- PINxr


For the ith attribute, PDNxir = Px’>xi-r AND Px’xi+r
PINxr = (OR
q
q  Nbr(x,r) PDN r)
AND PDNx’r
Pruning is done by P-Trees ANDing based
on the above three distributions
(a),(c): PU = PU AND PDNxr AND PINx’r
(b): PU = PU AND PDNxr;
where PU is a P-tree representing unprocessed data points
Pruning Non-outliers
The pruning is a neighborhood expanding process. It calculates
RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr)
and prunes based on the value of RDF, where k is an integer.
Start point
xr r 2r
4r
Prune out non-outlier

1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding
neighborhood by doubling the radius.

RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune
DiskNbr(x,kr), and call “Finding Outliers” Process;

RDF > (1+ε) (significant increase of density): stop expanding and call
“Pruning Non-outliers”.
Pruning Non-outliers Using P-Trees

We define ξ- neighbors: it represents the neighbors with ξ
bits of dissimilarity with x,
e.g. ξ = 1, 2 ... 8 if x is an 8-bit value

For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0),
(x2,m, …x2,0), … (xn,m, …xn,0), where xi,j is the jth bit value in
the ith attribute. For the ith attribute, ξ- neighbors of x is
calculated by
xi
 P ,x 1
xi
P  AND Pi , j ,where P  
j 1, m -
 P' , x  0
xi
i, j
i, j
i, j
i, j
i, j
PX  AND Pxi
i 0, n -1
The
pruning is accomplished by:
PU = PU AND PXξ’, where PXξ’ is the complement set of PXξ
RDF-based Outlier Detection Process
Algorithm: RDF-based Outlier Detection using P-Trees
Input: Dataset X, radius r, distribution parameter ε.
Output: An outlier set Ols.
// PU — unprocessed points represented by P-Trees;
// |PU| — number of points in PU
// PO --- outliers;
//Build up P-Trees for Dataset X
PU  createP-Trees(X);
i  1;
WHILE |PU| > 0 DO
x  PU.first;
//pick an arbitrary point x
PO  FindOutliers (x, r, ε);
i  i+1
ENDWHILE
“Find Outliers” and “Prune Non-Outliers” Procedures
Algorithm: FindOutliers
Input: point x, radius r, distribution parameter ε
Output: pruned dataset PU
//PDN(x): direct neighbors of x
//PIN(x): indirect neighbors of x
// rdf is relative density factor
PDN(x,r) = PX?x+r OR PX>x-r
sum  0, PN  0;
FOR each point q in PDN(x, r)
PN (q, r) = PX<q+r OR PX<q-r;
sum  sum + |PN(q, r)|;
ENDFOR
rdf  sum / (|PDN(x)|2);
switch (rdf)
case : 1/(1+ ε ) ? rdf ? 1+ ε
PU  PU AND PDN’(x) AND PIN’(x);
PruneNonOutliers(x, r, ε);
case rdf <1/(1+ ε ):
PU  PU AND PDN’(x);
FOR each point q in PIN(x)
FindOutliers (q, r, ε);
ENDFOR
case rdf > (1+ ε)
// Add point x into the outlier set
Ols  Ols OR x;
PU  PU AND PIN’(x);
Algorithm: PruneNonOutliers
Input: point x, distribution parameter ε,dataset X
Output: pruned dataset PU
// Pi,j is P-tree for jth bit of ith attribute of X
// PNxξ, i-neighborhood of a point x
// n is number of attributes ,m is the number of bits //in
each attribute
// Pxi’i,j is complement set of Pxii,j
FOR j = 0 TO m-1
IF xi,j = 1 Pxi,i,j  Pi,j
ELSE Pxi,i,j  P’i,j
ENDFOR
PU  1;
Px  1; ξ = 0;
DO
FOR i = 1 TO n
Pxi  Pxi,i,1
FOR j = 0 TO m-ξ
Pxi Pxi AND Pxi,i,j+1
ENDFOR
PX  PX AND Pxi
ENDFOR
PNx ξ  Px;
ξ ξ + 1;
rdf  (rc(PNxξ)-rc (PNxξ-1) ) / (rc(PNX,ξ-1))2
WHILE (rdf < 1/(1+ ε) || rdf > (1+ε) )
q  {PNX’ξ-1AND PNX ξ}
IF rdf < 1/(1+ ε)
PU  PU AND PNX’ξ-1; // pruning
FindOutliers (q,r,ε);
ELSE IF rdf > (1+ε)
FindNonOutliers(q,r,ε)
ENDIF
Experimental Study
Run Time Comparisons of LOF, aLOCI, RDF
Scalability Comparison of LOF,aLOCI,RDF
2000
2000
1800
1500
1600
1400
Run Time(s)
Run Time (s) 1000
500
0
256
1024
4096
16384
LOF
0.23
1.92
38.79
103.19 1813.43
aLOCI
0.17
1.87
35.81
87.34
985.39
RDF
0.58
2.1
8.34
37.82
108.91
Data Size




65536
1200
LOF
1000
aLOCI
800
RDF
600
400
200
0
-200256
1024
4096
16384
65536
Data Size
NHL data set (1996)
Compare with LOF, aLOCI
 LOF: Local Outlier Factor Method
 aLOCI: approximate Local Correlation Integral Method
Run Time Comparison
Scalability Comparison
 Start from 16,384, outperform in terms of scalability and speed
Reference
Reference
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher
Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International
Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222.
Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large
Data Bases Conference Proceedings, 1998, pp. 24-27.
Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data
Bases Conference Proceedings, 1999, pp. 211-222.
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”,
International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings
of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:01635808
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local
Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using
the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore,
India
A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999
Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large
Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164169.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.
Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied
Computing, 2002.
W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.
M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc.
Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002
Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees,
CAINE 2003
Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD
2003
Thank you!
Determination of Parameters

Determination of r




Breunig et al. shows choosing miniPt = 10-30 work well
in general [6] (miniPt-Neighborhood)
Choosing miniPts=20, get the average radius of 20neighborhood, raverage.
In our algorithm, r = raverage=0.5
Determination of ε



Selection of ε is a tradeoff between accuracy and speed.
The larger ε is, the faster the algorithm works; the
smaller ε is, the more accurate the results are.
We chose ε=0.8 experimentally, and get the same result
(same outliers) as Breunig’s, but much faster.
The results shown in the experimental part is based on
ε=0.8.
Related documents