Download RDF: A Density-based Outlier Detection Method Using Vertical Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Vertical Outlier Detection Algorithm
with Clusters as by-product
Dongmei Ren, Imad Rahal, and William Perrizo
Computer Science and Operations Research
North Dakota State University
Outline



Background
Related work
The Proposed Work




Contributions of this Paper
Review of the P-Tree technology
Approach
Conclusion
Background



Something that deviates from the standard
behavior
Can find anomaly interests
Critically important in information-based areas
such as:




Criminal Activities in Electronic commerce,
Intrusion in network,
Unusual cases in health monitoring,
Pest infestations in agriculture
Background (cont’d)

Related Work

Su et al.(2001) proposed the initial work for cluster-based outlier
detection:


Small clusters are considered as outliers.
He et al (2003) introduced two new definitions:



cluster-based local outlier
outlier factor.
 they proposed an outlier detection algorithm




CBLOF (Cluster-based Local Outlier Factor).
Outlier detection is tightly coupled with the clustering process (faster
than Su’s approach).
However, the clustering process takes precedence over the outlierdetection process.
Not highly efficient.
The Proposed Work

Improve the efficiency and scalability (to data
size) of the cluster-based outlier detection
process


Density-based clustering
Our Contributions:

Local Connective Factor (LCF)


Used to measure the membership of data points in clusters
LCF-based outlier detection method


can efficiently detect outliers and group data into clusters in
a one-time process.
does not require the beforehand clustering process, the first
step in contemporary cluster-based outlier detection methods
The Proposed Work (cont’d)

A vertical data representation, the P-tree

Performance improved by using a vertical data
representation, the P-tree.
Review of P-Trees

Traditionally, data have been represented horizontally
and then processed vertically (i.e. row by row)

Great for query processing
Not as good when one interested in collective data properties

Poorly scales with very large datasets (performance).


Our previous work --- a vertical data structure, the PTree (Ding et al.,2001).



Decompose relational tables into separate vertical bit slices by
bit position
Compress (when possible) each of the vertical slices using a Ptree
Due to the compression and scalable logical P-Tree operations,
this vertical data structure can address the problem of nonscalability with respect to size.
Review of P-Trees (cont’d)
Pure-0
2
3
2
2
5
2
7
7
Mixed
Pure-1
P1-trees for the dataset
Dataset
b)
3-bit slices
AND, OR and NOT operations
Construction of P-Trees
Review of P-Trees (cont’d)

Value P-tree R(A1,A2,…,An)




Every Ai (a1, a2, …, am)
Pi,j represents all tuples having a 1 in Ai,aj
P(Ai = 101) =P1 & P’2 & P3
Tuple P-tree

P(111,101,…) = P(A1 = 111) & P(A2 = 101) & …
Review of P-Trees (cont’d)

The inequality P-tree (Pan,2003) :


represents data points within a data set D satisfying an
inequality predicate, such as attribute x ≥v and x ≤v.
Calculation of P x ≥v :





x be an m-bit attribute within a data set D
Pm-1, …, P0 be P-trees for vertical bit slices of x
v = bm-1…bi…b0, where bi is ith binary bit value of v
Px≥v be the predicate tree for the predicate x≥v ,
Px≥v= Pm-1 opm-1 … Pi opi Pi-1 … op1 P0, i = 0, 1 … m-1, where:


opi is AND if bi=1, opi is OR otherwise
the operators are right binding, e.g. the inequality tree Px ≥101 = (P2
AND (P OR P0)).
Review of P-Trees (cont’d)

Calculation of Px≤v:


Px≤v = P’m-1opm-1 … P’i opi P’i-1 … opk+1P’k, k
≤ i ≤ m -1,
k is the rightmost bit position with value of
“0”,

bk=0, bj=1, for all j<k
Review of P-Trees (cont’d)

High Order Bit (HOBit) Distance Metric



A bitwise distance function.
Measures distances based on the most significant consecutive
matching bit positions starting from the left.
Assume Ai is an attribute in a data set. Let X and Y be the Ai values
of two tuples, the HOBit Distance between X and Y is defined by
m (X,Y) = max {i+1 | xi⊕yi = 1 },
where xi and yi are the ith bits of X and Y respectively, and
⊕denotes the XOR (exclusive OR).


e.g. X=1101 0011, Y=1100 1001  m=5
Correspondingly, the HOBit similarity is defined by
dm (X,Y) = BitNum - max {i+1| xi⊕yi = 1 }.
Definitions

Definition 1: Disk Neighborhood --- DiskNbr(x,r)


Given a point x and radius r, the disk neighborhood of x is defined as the
set : DiskNbr(x, r)={y  X | d(x,y)  r}, where d(x,y) is the distance
between x and y
Direct & indirect neighbors of x with distance r
Direct DiskNbr
x
Indirect DiskNbr

Definition 2: Density of DiskNbr(x, r) --- DensDiskNbr (x,r)
(Breunig,2000, density-based)
DensDiskNbr( x , r ) | DiskNbr(x,r) | /r

Definition 3: Density of a cluster --- Denscluster(R), is defined as the
total number of points in the cluster divided by the radius R of the
cluster:
Denscluster( R ) | Cluster(R) | /R
Definitions (cont’d)

Definition 4: Local Connective Factor (LCF) with
respect to a DiskNbr(x, r) and the closest cluster
(with radius R ) to x
LCF(x,r)  Dens


/ Dens
DiskNbr( x ,r )
cluster( R )
The LCF of the point x, denoted as LCF (x, r), is
the ratio of DensDiskNbr(x,r) over Denscluster(R).
LCF indicates to what degree, point x is connected
with the closest cluster to x.
The Outlier Detection Method

Given a dataset X, the proposed outlier detection method is
processed by:

Neighborhood Merging


Groups non-outliers (points with consistent density) into clusters
LCF-based Outlier Detection

Finds outliers over the remaining subset of the data, which consists
of points on cluster boundaries and real outliers.
Start point x
xr
r 2r 4r 8r
Finding Outliers
Neighborhood Merging
Neighborhood Merging

User chooses an r, the process picks up an arbitrary point x, calculates
DensDiskNbr(x,r); increases the radius from r to 2r, calculate
DensDiskNbr(x,2r), and observes the ratio between the two.


If the ratio is in the range, [1/(1+ε),(1+ε)](Breunig, 2000), the expansion
and the merging will be continued by increasing the radius to 4r,8r...
If the ratio is outside the range, the expansion stops. Point x and its k*rneighbors are merged into one cluster.
Start point
xr x
r
r 2r
4r
Merging nonOutlier
points
Merging
non-Outlier
points


All points in the 2r-neighborhood are grouped together.
The process will call the “LCF-based outlier detection” next and mine
outliers over the set of points in {DiskNbr(x,4r) – DiskNbr(x,2r)}.
Neighborhood Merging using P-Trees


ξ – neighbors (XI) represents the neighbors with distance ξ bits
from x, e.g. ξ = 0,1, 2 ... 8 if x is an 8-bit value
For point x, let X= (x0,x1,x3, … xn-1)





xi = (xi,m-1, …xi,0),
xi,j is the jth bit value in the ith attribute.
The ξ- neighbors of xi is the value P-tree: Pxi using last m-ξ bits (taking the
last m bits of Xi)
The ξ- neighbors of X in the tuple P-tree (AND Pxi using last m-ξ bits) for all xi
The neighbors are merged into one cluster by


PC = PC OR PXξ
where PC is a P-tree representing the currently processed cluster, and
PXξ is the inequality P-tree representing the ξ-neighbors of x.
“LCF-based Outlier Detection”

For the points in {DiskNbr(x,4r)-DiskNbr(x,2r)}, the “LCF-based outlier detection”
process finds the outlier points, and starts a new cluster if necessary.
Merging into current
cluster
Start point x
x
Outliers!
r 2r 4r
8r
Neighborhood Merging
A New Cluster!


Search for all the direct neighbors of x.
For each direct neighbor


get its direct neighbors (those will be the indirect neighbors of x).
Calculate the LCF w.r.t current cluster for the resulting neighborhood.
“LCF-based Outlier Detection” (cont’d)



1/(1+ε) ≤ LCF ≤ (1+ε): x and its neighbors can be
merged into the current cluster.
LCF > (1+ε). The point x is in a new cluster with higher
density, start a new cluster, call the neighborhood merging
procedure.
LCF < 1/ (1+ε). Point x and it neighbors can either be in a
cluster with low density or be outliers.


Get indirect neighbors of x recursively.
In case the number of all neighbors is larger than some t
(Papadimitriou, 2003, t=20),


call neighborhood merging for another cluster
In case less than t neighbors are found, we identify those small
number of points as outliers.
“LCF-based Outlier Detection” using
P-Trees

Direct neighbors represented by a P-Tree --- DT-Pxr



let X= (x0,x1,x3, … xn-1)
xi = (xi,m-1, …xi,0)
xi,j is the jth bit value in the ith attribute.

For attribute xi, Pxir = Pxi value>xi-r & Pxi valuexi+r
For r- neighbors are in the P-tree DT-Pxr = (AND Pxir) for
all xi

|DiskNbr(x,r)|= rootCount(DT-Pxr)

“LCF-based Outlier Detection” using
P-Trees (cont’d)

Indirect neighbors represented by a P-Tree -- IN-Pxr


IN-Pxr = (OR q  Nbr(x,r) DT-Pqr) AND (NOT (DTPxr)), where NOT is the complement operation of
the P-Tree
Outliers are inserted into outlier sets by OR
operation


LCF < 1/(1+ε) and t<20
POls = POls OR DT-Pxr OR IN-Pxr, where POls is an
outlier set represented by a P-Tree.
Preliminary Experimental Study



Compare with
 MST(2001): Su’s et al. two-phase clustering based
outlier detection algorithm, denoted as MST,MST is the
first approach to perform cluster-based outlier
detection.
 CBLOF(2003): He’s et al. CBLOF (cluster-based local
outlier factor) method, faster.
NHL data set (1996)
Run Time and Scalability Comparison
Preliminary Experimental Study
(Cont’d)
Comparison of run time
3000
2500
2000
1500
1000
500
0
256
1024
4096
16384
65536
MST
5.89
10.9
98.03
652.92
2501.43
CBLOF
0.13
1.1
15.33
87.34
385.39
LCF
0.55
2.12
7.98
28.63
71.91
data size



Run time comparison
Scalability is the best among the three algorithms
The outlier sets were largely the same
Conclusion

An outlier detection method with clusters as byproduct



can efficiently detect outliers and group data into clusters in
a one-time process;
does not require the beforehand clustering process, which is
the first step in current cluster-based outlier detection
methods; the elimination of the pre-clustering makes outlier
detection process faster
A vertical data representation, the P-tree.


Parameter Tuning is really important


The performance of our method is further improved by using a
vertical data representation, the P-tree.
ε, r , t
Future direction



Study more parameter tuning effects
Better quality of clusters … Gamma measure
Boundary points testing
Reference
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher
Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on
Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222.
Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases
Conference Proceedings, 1998, pp. 24-27.
Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference
Proceedings, 1999, pp. 211-222.
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International
Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD
international conference on Management of data Year of Publication: 2000, ISSN:0163-5808
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM
SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local
Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India
Jiang, M.F., S.S. Tseng, and C.M. Su, Two-phase clustering process for outliers detection, Pattern Recognition Letters, Vol 22, No. 67, pp. 691-700.
A.He, X. Xu, S.Deng, Discovering Cluster Based Local Outliers, Pattern Recognition Letters, Volume24, Issue 9-10, June 2003,
pp.1641-1650
He, Z., X., Deng, S., 2002. Squeezer: An efficient algorithm for clustering categorical data. Journal of Computer Science and
Technology.
A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999
Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd
International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.
Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing,
2002.
W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.
M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD 2002,
Spriger-Verlag LNAI 2776, 2002
Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003
Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003
Jiawei Han, Micheline Kambr, “Data mining concepts and techniques”, Morgan kaufman Publishers
Thank you!
Determination of Parameters

Determination of r




Determination of ε





Breunig et al. shows choosing miniPt = 10-30 work well in general
[6] (miniPt-Neighborhood)
Choosing miniPts=20, get the average radius of 20-neighborhood,
raverage.
In our algorithm, r = raverage=0.5
Selection of ε is a tradeoff between accuracy and speed. The larger
ε is, the faster the algorithm works; the smaller ε is, the more
accurate the results are.
We chose ε=0.8 experimentally, and get the same result (same
outliers) as Breunig’s, but much faster.
The results shown in the experimental part is based on ε=0.8.
We coded all the methods.
For the running environment, see the paper part.
Related documents