Download 04_icdm_Outline of RDF algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RDF: A Density-based Outlier Detection Method using Vertical Data
Representation
Dongmei Ren, Baoying Wang, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105, USA
[email protected]
Abstract
Outlier detection can lead to discovering unexpected
and interesting knowledge, which is critical important to
some areas such as monitoring of criminal activities in
electronic commerce, credit card fraud, etc. In this paper,
we developed an efficient density-based outlier detection
method for large datasets. Our contributions are: a) We
introduce a relative density factor (RDF); b) Based on
RDF, we propose an RDF-based outlier detection method
which can efficiently prune the data points which are
deep in clusters, and detect outliers only within the
remaining small subset of the data; c) The performance of
our method is further improved by means of a vertical
data representation, P-trees. We tested our method with
NHL and NBA data. Our method shows an order of
magnitude speed improvement compared to the
contemporary approaches.
1. Introduction
The problem of mining rare event, deviant objects,
and exceptions is critically important in many domains,
such as electronic commerce, network, surveillance, and
health monitoring. Outlier mining is drawing more and
more attentions. The current outlier mining approaches
can be classified as five categories: statistic-based [1],
distance-based[2][3][4],density-based[5][6], clusteringbased [7], deviation-based [8][9]. Density-based outlier
detection approaches are attracting most attentions for
KDD in large database.
Breunig et al. proposed a density-based approach
to mining outliers over datasets with different densities
and arbitrary shapes [5]. Their notion of outliers is local in
the sense that the outlier degree of an object is determined
by taking into account the clustering structure in a
bounded neighborhood of the object. The method does not
suffer from local density problem, so it can mine outliers
over non-uniform distributed datasets. However, the
method needs three scans and the computation of
neighborhood search costs highly, which makes the
method inefficient. Another density-based approach was
introduced by Papadimitriou & Kiragawa [6] using local
correlation integral (LOCI). This method selects a point as
an outlier if its multi-granularity deviation factor (MDEF)
deviates three times from the standard deviation of MDEF
in a neighborhood. However, the cost of computing of the
standard deviation is high.
In this paper, we propose an efficient density-based
outlier detection method using a vertical data model PTrees1. We introduce a novel local density measurement,
relative density factor (RDF). RDF indicates the degree at
which the density of the point P contrasts to those of its
neighbors. We take RDF as an outlierness measurement.
Based on RDF, our method prunes the data points that are
deep in clusters, and detect outliers only within the
remaining small subset of the data, which makes our
method efficient. Also, the performance of our algorithm
is enhanced significantly by means of P-Trees. Our
method was tested over NHL and NBA datasets.
Experiments show that our method has an order of
magnitude of speed improvement with comparable
accuracy over the current state-of-the-art density-based
outlier detection approaches.
2. Review of P-trees
In previous work, we proposed a novel vertical data
structure, the P-Trees. In the P-Trees approach, we
decompose attributes of relational tables into separate
files by bit position and compress the vertical bit files
using a data-mining-ready structure called the P-trees.
Instead of processing horizontal data vertically, we
process these vertical P-trees horizontally through fast
logical operations. Since P-trees remarkably compress the
data and the P-trees logical operations scale extremely
well, this vertical data structure has the potential to
address the non-scalability with respect to size. In this
section, we briefly review some useful features, which will
1
Patents are pending on the P-tree technology. This work is partially
supported by GSA Grant ACT#: K96130308.
be used in this paper, of P-Tree, including its optimized
logical operations.
Given a data set with d attributes, X = (A1, A2 …
Ad), and the binary representation of the j th attribute Aj as
bj.m, bj.m-1,..., bj.i, …, bj.1, bj.0, we decompose each attribute
into bit files, one file for each bit position [10]. Each bit
file is converted into a P-tree. Logical AND, OR and
NOT are the most frequently used operations of the Ptrees, which facilitate efficient neighborhood search,
pruning and computation of RDF.
Calculation of Inequality P-trees Px≥v and Px<v :
Let x be a data point within a data set X, x be an m-bit
data, and Pm, Pm-1, …, P0 be P-trees for vertical bit files of
X and P’m, P’m-1,… P’0 be the complement set for the
vertical bit files of X. Let v = bm…bi…b0, where bi is ith
binary bit value of v, then
Px≥v = Pm opm … Pi opi Pi-1 … op1 P0, i = 0, 1 … m (a)
Px<v = P’mopm … P’i opi P’i-1 … opk+1P’k, k≤i≤ m
(b)
In (a), opi is 
=1,
op
is

In
(b)
opi
i
i
is 
; In both (a) and (b), 
i=0, opi is 
stands for OR and  for AND; the operators are right
binding, which means operators are associated from right
to left, e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1
op1 P0)).
High Order Bit Metric (HOBit): The HOBit metric
[12] is a bitwise distance function. It measures distance
based on the most significant consecutive matching bit
positions starting from the left. Assume Ai is an attribute
in tabular data sets, R (A1, A2, ..., An) and its values are
represented as binary numbers, x, i.e., x = x(m)x(m-1)--x(1)x(0).x(-1)---x(-n). Let X and Y are Ai of two
tuples/samples, the HOBit similarity between X and Y is
defined by
m (X,Y) = max {i | xi⊕yi },
where xi and yi are the ith bits of X and Y respectively, and
⊕denotes the XOR (exclusive OR) operation.
Correspondingly, the HOBit dissimilarity is defined by
(note that Nbit is the number of bits for the attribute)
dm (X,Y) = Nbit - m.
3. RDF-based Outlier Detection Using Ptrees
In this section, we first introduce some definitions
related to outlier detection. Then propose a RDF-based
outlier detection method. The performance of the
algorithm is enhanced significantly by means of the
bitwise vertical data structure, P-trees, and its optimized
logical operations.
3.1. Outlier Definitions
From the density view, a point P is an outlier if it has
much lower density than those of its neighbors. Based on
this intuition, we propose some definitions related to
outliers.
Definition 1 (neighborhood)
The neighborhood of a data point P with the radius r is
defined as a set Nbr (P, r) = {x X | |P-x| r}, where |P-x|
is the distance between P and x. It is also called rneighborhood. The points in this neighborhood are called
neighbors of P, or direct r-neighbors of P. The number of
neighbors of P is defined as N (Nbr (P, r)); Indirect
neighbors of P are those points that are within the rneighborhood of the direct neighbors of P but not include
direct neighbors of P.
Definition 2 (density factor)
Given a data point P and the neighborhood radius r,
Density factor (DF) of P is a measurement for local
density around P, denoted as DF (P,r). It is defined as
(note that d is the number of dimension),
d
DF ( P, r )  N ( Nbr ( P, r )) / r .
(1)
Neighborhood density factor of the point P, denoted as
DFnbr (P, r), is the average density factor of the neighbors
of P.
DF
( P, r ) 
nbr
N ( Nbr ( P , r ))

DF ( q , r ) / N ( Nbr ( P , r )),
i 1
i
where qi is the neighbors of P, i = 1, 2, …, N(Nbr(P,r)).
Relative
Density Factor (RDF) of
d
DF(P,r) | Nbr(P,r) | /r . the point P, denoted as
RDF (P, r), is the ratio of
neighborhood density factor of P over its density factor
(DF).
RDF(P,r)  DF (P,r) / DF(P, r)
nbr
(2)
RDF indicates the degree at which the density of the point
P contrasts to those of its neighbors. We take RDF as an
outlierness measurement, which indicates the degree to
which a point can be an outlier in the view of the whole
dataset.
Definition 3 (outliers)
Based on RDF, we define outliers as a subset of the
dataset X with RDF > t, where t is a RDF threshold
defined by case. The outlier set is denoted as Ols(X, t) =
{xX | RDF(x) > t}.
3.2. RDF-based
Pruning
Outlier
Detection
with
Given a dataset X and a RDF threshold t, the RDFbased outlier detection is processed in two phases: “zoomout” procedure and “zoom-in” procedure. The detection
process starts with the “zoom-out” procedure, which calls
“zoom-in” procedure when necessary. On the other hand
the “zoom-in” procedure also calls “zoom-out” procedure
by case.
“Zoom-out” process: The procedure starts with an
arbitrary point P and a small neighborhood radius r, and
calculates RDF of the point. There are three possible local
data distributions with regard to the value of RDF, which
are shown in figure 1, where α is a small value number,
while β is a large value number. In our experiments, we
choose α < 0.3 and β > 12, which leads to a good balance
between accuracy and pruning speed.
be in a denser cluster, and call “zoom-in” procedure
further to prune off all points in the dense cluster.
As we can see, our method detects outliers using
“zoom-out” process for small candidate outlier sets, the
boundary points and the outliers. This subset of data as a
whole is much smaller than the original dataset. This is
where the performance of our algorithm lies in.
Both the “zoom-in” and “zoom-out” procedures can
be further improved by using the P-trees data structure
and its optimal logical operations. The speed
improvement lies in: a) P-trees make the “zoom-in”
process on fly using HOBit metric; b) P-trees are very
efficient for neighborhood search by its logical operations;
c) P-tree can be used as a self-index for unprocessed
dataset, clustered dataset and outlier set. Because of it,
pruning is efficiently executed by logical operations of Ptrees.
Zoom-out
(a) RDF = 1 ± α
(b) RDF ≤ 1/β
Zoom-in
(c) RDF ≥ β
Figure 1. Three different local data distributions
Figure 2. “Zoom-in” Process followed by “Zoom-out”
In case (a), it is observed that neither the point P is
an outlier, nor the direct and indirect neighbors of P are.
The local neighbors are distributed uniformly. The “zoomin” procedure will be called to quickly reach points
located on the boundary or outlier points.
In case (b), the point P is highly likely to be a center
point of a cluster. We prune all neighbors of P, while
calculating RDF for each of the indirect r-neighbors. In
case RDF of one point larger than the threshold t, the
point can be inserted into the outlier set together with its
RDF value.
In case (c), RDF is large, P is inserted into the outlier
set. We prune all the indirect neighbors of P.
“Zoom-in” using HOBit metric: Given a point P,
we define the neighbors of P hierarchically based on the
HOBit dissimilarity between P and its neighbors, denoted
as ξ-neighbors. ξ-neighbors represents the neighbors
with ξ bits of dissimilarity, where ξ = 1, 2 ... 8 if P is an
8-bit value. The basic calculations in the procedure are
computing DF (P, ξ) for each ξ-neighborhood and
pruning neighborhood. HOBit dissimilarity is calculated
by means of P-tree AND. For any data point, P, let P =
b11b12 … bnm, where bi,j is the jth bit value in the ith
attribute column of P. The attribute P-trees for P with ξHOBit neighbors for ith attribute are then defined by
Pvi, ξ = Ppi,1  Ppi,2  … Ppi,m-ξ
The ξ-neighborhood P-tree for P in multi-dimensions are
then calculated by
PNp, ξ = Pv1, m-ξ Pv2, m-ξ Pv3, m-ξ … Pvn, m-ξ
“Zoom-in” process: The “zoom-in” procedure is a
pruning process based on neighborhood expanding. We
calculate DF and observe change of DF values. First we
increase radius from r to 2r, compute DF (P, 2r) and
compare DF (P, 2r) with DF (P, r). If DF (P, r) is close to
DF (P, 2r), it indicates that the whole 2r-neighbohood has
uniform density. Therefore, increase (e.g. double or 4
times) the radius until significant change is observed. As
for significant decrease of DF is observed, cluster
boundary and potential outliers are reached. Therefore,
“zoom-out” procedure is called to detect outliers at a fine
scale. Figure 2 shows this case. All the 4*r-neighbors are
pruned off and “zoom-out” procedure detect outliers over
points in 4*r-6*r ring. As for significant increase of DF is
observed, we pick up a point with high DF value, likely to
Density factor, DF (P,r) of the ξ-neighborhood is simply
the root counts of PNp,r divided by r.
The neighborhood pruning is accomplished by:
PU = PU  PN’p,ξ
where PU is a P-tree represents the unprocessed points of
the dataset. PN’p,ξ represents the complement set of PNp,ξ.
“Zoom-out” using Inequality P-trees: In the
“Zoom-out” procedure, we use inequality P-trees to search
for neighborhood, upon which the RDF is calculated.
The direct neighborhood P-tree of a given point P
within r, denoted as PDNp,r is P-tree representation of its
direct neighbors. PDNp,r is calculated by PDNp,r = Px>p-r 
Pxp+r.The root count of PDNp,r is equal to N(Nbr(p,r)).
Accordingly, DF (P,r) and RDF (P,r) are calculated based
on equation (1) and (2) respectively.
Using P-Trees AND operations, the pruning is
calculated as: In case of RDF (p,r)=1±α, we prune nonoutlier points by PU = PU  PDN’P,r PIN’P,r;;; In case
RDF<1/ β, dataset is pruned by PU = PU  PDN’P,r; If
RDF > β the dataset is pruned by PU= PU  PDN’P,r 
PIN’P,r.
4.
Experimental Study
In this section, we experimentally compare our method
(RDF) with current approaches: LOF (local outlier factor)
and aLOCI (approximate local correlation integral). LOF
is the first approach to density-based outlier detection.
aLOCI is the fastest approach in the density-based area so
far. We compare these three methods in terms of run time
and scalability to data size. We will show our approach is
efficient and has high scalability.
We ran the methods on a 1400-MHZ AMD machine
with 1GB main memory and Debian Linux version 4.0.
The datasets we used are the National Hockey League
(NHL, 96) dataset and NBA dataset. Due to space
limitation, we only show our result on NHL dataset in this
paper. The result on NBA dataset also leads to our
conclusion in terms of speed and scalability.
Figure 3 shows that our method has an order of
magnitude improvements in speed compared to aLOCI
method. Figure 4 shows our method is the most scalable
among the three. When data size is large, e.g. 16384, our
method starts to outperform these two methods.
Figure 4. Scalability Comparison
5. Conclusion
In this paper, we propose a density based outlier
detection method based on a novel local density
measurement RDF. The method can efficiently mining
outliers over large datasets and scales well with increase
of data size. A vertical data representation, P-Trees, is
used to speed up the process further. Our method was
tested over NHL and NBA datasets. Experiments show
that our method has an order of magnitude of speed
improvements with comparable accuracy over the current
state-of-art density-based outlier detection approaches.
6. Reference
[1] V.Barnett, T.Lewis, Outliers in Statistic Data, John
Wiley’s Publisher, NY,1994
[2] Knorr, Edwin M. and Raymond T. Ng. “A Unified
[3]
[4]
Run Time Comparisons of LOF, aLOCI, RDF
2000
[5]
1500
Run Time (s) 1000
500
0
256
1024
4096
16384
LOF
0.23
1.92
38.79
103.19 1813.43
aLOCI
0.17
1.87
35.81
87.34
985.39
65536
RDF
0.58
2.1
8.34
37.82
108.91
[6]
Data Size
Figure 3. Run Time Comparison
Scalability Comparison of LOF,aLOCI,RDF
[7]
2000
1800
1600
Run Time(s)
1400
1200
LOF
1000
aLOCI
800
RDF
600
400
200
0
-200256
1024
4096
Data Size
16384
65536
[8]
Notion of Outliers: Properties and Computation”,
3rd International Conference on Knowledge
Discovery and Data Mining Proceedings, 1997, pp.
219-222.
Knorr, Edwin M. and Raymond T. Ng.,
“Algorithms for Mining Distance-Based Outliers in
Large Datasets”, Very Large Data Bases
Conference Proceedings, 1998, pp. 24-27.
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok
Shim, “Efficient algorithms for mining outliers from
large datasets”, International Conference on
Management of Data and Symposium on Principles
of Database Systems, Proceedings of the 2000
ACM SIGMOD international conference on
Management of data,2000, ISSN:0163-5808
Markus M. Breunig, Hans-Peter Kriegel, Raymond
T. Ng, Jörg Sander, “LOF: Identifying Densitybased Local Outliers”, Proc. ACM SIGMOD 2000
Int. Conf. On Management of Data, TX, 2000
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip
B. Gibbons, Christos Faloutsos, “LOCI: Fast
Outlier Detection Using the Local Correlation
Integral”, 19th International Conference on Data
Engineering, 2003, Bangalore, India
A.K.Jain, M.N.Murty, and P.J.Flynn. “Data
clustering: A review”, ACM Comp. Surveys,
31(3):264-323, 1999
Arning, Andreas, Rakesh Agrawal, and Prabhakar
Raghavan. “A Linear Method for Deviation
Detection in Large Databases”, 2nd International
Conference on Knowledge Discovery and Data
Mining Proceedings, 1996, pp. 164-169.
[9] S. Sarawagi, R. Agrawal, and N. Megiddo.
[10]
[11]
[12]
“Discovery-Driven Exploration of OLAP Data
Cubes”, EDBT'98.
Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The Ptree algebra”. Proceedings of the ACM SAC,
Symposium on Applied Computing, 2002.
W. Perrizo, “Peano Count Tree Technology,”
Technical Report NDSU-CSOR-TR-01-1, 2001.
Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and
Perrizo, W., “Efficient Density Clustering for
Spatial Data”, PKDD 2003